Tuesday, February 9, 2021

Use Jupyter Lab with PySpark and S3

If you're having an issue accessing S3 data from JupyterLab, then read on! Perhaps the info here, or in the linked GitHub repo, might help you discover and resolve your issue.

Before I get into any of the details, feel free to use this GitHub repo as an example of how to configure your JupyterLab PySpark notebook (running in a docker container on your local machine) to access data in S3.

I was recently wanting to use JupyterLab to view some parquet data in S3. I chose the pyspark-notebook image from the Jupyter Docker Stacks repo as a base Docker image and added jar files that would allow Spark to connect and read/write data to S3. 

It seemed like it should be pretty easy and technically it was. However, I ended up getting 403 errors when the pyspark code would try to read data from S3. The reason for the 403 error was understandable when I realized what I had done, but I didn't find any posts or documentation that illustrated my exact problem, so I figured I would share it here.

I wasn't setting the credentials provider explicitly when configuring the Spark session. I had passed in temporary credentials as environment variables when running the a script to start the Jupyter Lab container. I retrieved the credentials by using the following command:

aws --profile <your aws profile name> --region us-west-2 sts get-session-token

The default credentials provider was being used because I didn't explicitly set a credential provider to use. I assume that the access key and session key had info tying them to a session so the default credentials provider received an error when attempting to connect since no session token would have been passed to AWS. 

Here is the documentation from Hadoop where it shows the property names and values to use when configuring Spark to access data from S3. It's what I read and realized that I needed to set the credentials provider explicitly. 

The Hadoop documentation also reminded me that you don't need to explicitly set the access key, secret key, and session token in the Spark session configuration if you use the standard AWS environment variable names. It makes sense - I just had been explicitly setting the credentials previously.

You can view a GitHub repo that includes a Dockerfile, run script to start the docker container that passes the AWS credentials as environment variables, and an example JupyterLab notebook that uses pyspark to connect to S3 and download data. The bucket referenced in the example is private, so you'll need to substitute the S3 URI to a URI that points to publicly available data or use a bucket that you have permission to access and read. 

If you're interested in the data I used then you can find it here: the CDC's COVID-19 dataset.

Good luck, and have fun Sparking!

Monday, January 4, 2021

Notes from installing K3s on my Raspberry Pi cluster...

I put a Raspberry Pi cluster together  - now what should I do?

I have a problem when it comes to Raspberry Pis. Actually, I have a problem with impulsivity, but I'll pretend that the issue is exclusive to Raspberry Pis. The great thing is that the cost of the various Raspberry Pi models is low enough that it's relatively cheap to build a cluster. 

I knew I wanted to do the following:

  • Learn more about Kubernetes without using work resources.
  • Try some of the very neat open source projects that are mentioned on the Kubernetes podcast (ie, OpenFaas, MinIO, etc).
  • Play around with some DIY home automation. I'm not completely sure what I want to do yet, but I have a bunch of Philips Hue lights that are begging to be controlled from the cluster.

I did a search for "Raspberry Pi cluster" and "Kubernetes", and found Jeff Geerling's Raspberry Pi Cluster Ep 2 - Setting up the Cluster YouTube video.

Jeff's Kubernetes and Raspberry Pi videos are packed with useful information and he communicates very clearly. Definitely check them out!

I mostly followed his related blog post when I went to install K3s on my cluster. The K3s site has great information as well, and the install steps (all two of them) are very simple to follow. 

However, I ran into an issue or two when I was first setting up the cluster, so I took some notes that I'm providing here. 

Initial OS Selection and Setup

I used Raspberry Pi OS Lite from the Raspberry Pi OS Download page.

Jeff's blog post lists steps for how to copy the Raspberry Pi OS image onto the SD cards, so I'll leave those steps out. However, before I unmounted the SD card I copied the following text to the /boot/cmdline.txt file:

cgroup_enable=cpuset cgroup_memory=1 cgroup_enable=memory

On First Boot

I made the following changes on first boot of each image:

  1. Changed the password
  2. sudo apt-get update && sudo apt-get upgrade -y && sudo apt-get install -y vim
  3. Set the hostname. I named my Raspberry Pis so they matched my network cable colors, so I ran a command similar to this: echo -e "green" | sudo tee /etc/hostname
  4. Set locale, timezone, and WLAN channel country via raspi-config
  5. Updated raspi-config
  6. Changed memory available to the GPU to 16 MB (A3 Memory Split)
  7. Expanded the file system to max size available

There is (or was) an issue with Raspberry Pi OS Lite (Raspbian Buster) from a recent update, and the cgroup info that is added to /boot/cmdline.txt was being ignored. The work-around is to run a raspberry pi update.


Before Installing K3s

There is a section on rancher.com where it shows that if you are using Raspian Buster (which is Raspberry Pi OS Lite) then you should enable legacy iptables:

sudo iptables -F
sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
sudo reboot

I ran the following on each of the Raspberry Pis:

echo -e "\twhite" | sudo tee -a /etc/hosts
echo -e "\tred"   | sudo tee -a /etc/hosts
echo -e "\tgreen" | sudo tee -a /etc/hosts
echo -e "\tblue"  | sudo tee -a /etc/hosts
echo -e "\tblack" | sudo tee -a /etc/hosts

Installing K3s


This will tell the installer that it is a server because the K3S_URL wasn’t set:

curl -sfL https://get.k3s.io | K3S_KUBECONFIG_MODE="644" sh -s -

Get the token from the leader node and trigger the download and execution of the install script. The first Pi in the cluster has a white network cable, so I arbitrarily chose it as the leader node. I ran the following on each of the follower nodes.

export TOKEN=`ssh -t pi@white sudo cat /var/lib/rancher/k3s/server/node-token`
curl -sfL https://get.k3s.io | K3S_URL= K3S_TOKEN=$TOKEN sh -

Test the Cluster

If you haven’t already installed kubectl, then install it now.

Copy the /etc/rancher/k3s/k3s.yaml file from the leader node of your Raspberry Pi cluster, to ~/.kube/config on whatever computer you plan on using to access the cluster. The location might be different on a Windows machine, but that would most likely be the correct path if you’re using the bash shell.

scp pi@white:/etc/rancher/k3s/k3s.yaml ~/.kube/config

Now run kubectl get pods —all-namespaces and see that the cluster already has pods running on the nodes.

Let's Run Something!

I found Alex Ellis's blog post called "Will it cluster?" and followed his instructions for installing OpenFaas. It seems funny now because I had no idea who Alex Ellis was at the time. I only tried OpenFaas because of the "Will it cluster?" blog post. 

I installed OpenFaas, copied the service and deployment yaml files Alex provided in the "Will it cluster?" post, and was able to use the Figlet application in just a few minutes. Very neat stuff!

I recommend going to the OpenFaas.com site to learn more about using OpenFaas. 

This example shows that sending text to one of the nodes in the cluster (it doesn't matter which node you pick) will run the figlet function - a function that converts your text data into an ASCII art version of the text.

> echo -n "Hello" | curl --data-binary @- http://red:31111
 _   _      _ _
| | | | ___| | | ___
| |_| |/ _ \ | |/ _ \
|  _  |  __/ | | (_) |
|_| |_|\___|_|_|\___/