Tuesday, February 9, 2021

Use Jupyter Lab with PySpark and S3

If you're having an issue accessing S3 data from JupyterLab, then read on! Perhaps the info here, or in the linked GitHub repo, might help you discover and resolve your issue.

Before I get into any of the details, feel free to use this GitHub repo as an example of how to configure your JupyterLab PySpark notebook (running in a docker container on your local machine) to access data in S3.

I was recently wanting to use JupyterLab to view some parquet data in S3. I chose the pyspark-notebook image from the Jupyter Docker Stacks repo as a base Docker image and added jar files that would allow Spark to connect and read/write data to S3. 

It seemed like it should be pretty easy and technically it was. However, I ended up getting 403 errors when the pyspark code would try to read data from S3. The reason for the 403 error was understandable when I realized what I had done, but I didn't find any posts or documentation that illustrated my exact problem, so I figured I would share it here.

I wasn't setting the credentials provider explicitly when configuring the Spark session. I had passed in temporary credentials as environment variables when running the a script to start the Jupyter Lab container. I retrieved the credentials by using the following command:

aws --profile <your aws profile name> --region us-west-2 sts get-session-token

The default credentials provider was being used because I didn't explicitly set a credential provider to use. I assume that the access key and session key had info tying them to a session so the default credentials provider received an error when attempting to connect since no session token would have been passed to AWS. 

Here is the documentation from Hadoop where it shows the property names and values to use when configuring Spark to access data from S3. It's what I read and realized that I needed to set the credentials provider explicitly. 

The Hadoop documentation also reminded me that you don't need to explicitly set the access key, secret key, and session token in the Spark session configuration if you use the standard AWS environment variable names. It makes sense - I just had been explicitly setting the credentials previously.

You can view a GitHub repo that includes a Dockerfile, run script to start the docker container that passes the AWS credentials as environment variables, and an example JupyterLab notebook that uses pyspark to connect to S3 and download data. The bucket referenced in the example is private, so you'll need to substitute the S3 URI to a URI that points to publicly available data or use a bucket that you have permission to access and read. 

If you're interested in the data I used then you can find it here: the CDC's COVID-19 dataset.

Good luck, and have fun Sparking!