Monday, March 7, 2016

Cleaning up after aws cli on Mac OSX...

I've installed the aws command line on my Mac. It's super handy. However, the aws s3 command creates $folder$ files for every "directory" when a recursive copy is performed. It's super annoying. 

For example, you could have a "directory" in S3 named "myfiles". When you download the objects with "myfiles" in the path you will end up with a file named "myfiles_$folder$".

Running aws --version returns this info:

    aws-cli/1.10.6 Python/2.7.10 Darwin/14.5.0 botocore/1.3.28

I haven't found anything that explains how I can prevent those files from being created, so I've been doing manual cleanup afterwards.  This is the command I run:

    > rm $(find . "*$folder$")

Tuesday, January 12, 2016

Debugging a local Spark job using IntelliJ

A coworker was working on a local Spark job and shared how he set up his environment for debugging the job (which is basically the same as debugging any other remote process). These are the instructions I followed:

1. Create a remote debug configurations.

Go to IntelliJ's "Run | Edit Configurations" screen
Click on the "+" to "Add New Configuration"
Select "Remote"

2. Copy the command line argument to use and modify it however you see fit.

I'm using Java 8, so I used the example command line arguments from the top edit box. The only change I made was to set "suspend=y" so the spark job would stop and wait for me to start my "Remote Debug" process.

This is what I used: 

3. Export the command line arg as SPARK_JAVA_OPTS (Spark uses this value when you submit a spark job).

I set the SPARK_JAVA_OPTS like this:

export SPARK_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

4. Start the spark job.

You should see your spark job start up, and then pause with the following line printed on the console:

Listening for transport dt_scoket at address: 5005

5. In IntelliJ, create whatever breakpoints you want to use and start the remote debug configuration.