Computing Resources

To complete your projects, there are several kinds of computing resources available:

Our EMR cluster

I have set up an Amazon EMR cluster for everyone using Spark in this course to share.

These instructions will expand as I learn more things about Spark.

Getting access

I need your SSH public key to grant you access. If you have used SSH before, look in the ~/.ssh hidden directory for a id_rsa.pub or id_ed25519.pub file. The contents should look something like this:

ssh-ed25519 LOTSOFAPPARENTLYRANDOMCHARACTERS you@yourcomputer

Paste the entire thing into an email and send it to me.

If you have no SSH public key, follow these instructions for generating one (using your GitHub email address isn’t important for us, though) and send me the public key as above. (You don’t need to follow their ssh-agent steps.)

Do not send me any file beginning with -----BEGIN OPENSSH PRIVATE KEY-----. It says PRIVATE KEY for a reason.

I will then set up your account with remote access. I will send you the name of the server you need to SSH into to get access.

Logging in and adding files

Once your SSH access is set up, you can use the server like any SSH server. Use scp to copy files, or configure your text editor to use SFTP to edit files directly on the server.

The server you log in to is separate from the servers doing the calculation. That means that if you have somedatafile.txt in your home directory on the server, you will not be able to load it in Spark with spark.read.text("somedatafile.txt"). The servers run the Hadoop Distributed File System, and so you need to add the file to HDFS. You can run

hadoop fs -put somedatafile.txt

to copy the file to your HDFS user directory. There are HDFS commands corresponding to the familiar commands like cp, mv, ls, rm, and so on; run hadoop fs to be shown a list of them all.

Installed software

The EMR servers have Python 2.7 and R 3.4 installed. You can run ordinary Python or R with the usual commands, or you can run the Spark-integrated versions:

sparkR  # R with the SparkR package auto-loaded and connected

pyspark # Python with pyspark loaded and connected

Both automatically connect to the Spark server and store the connection in the spark variable for you to use.

GraphFrames works as usual:

pyspark --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11

Note that there is not a convenient way to load GraphFrames in the EMR Notebooks feature; the notebooks don’t provide any way to load Spark packages. You may have to stick with the pyspark console or the spark-submit command, passing the --packages argument each time.

Installing Python or R packages poses problems: computation is distributed to all nodes in the cluster, so all of them need your packages installed. Unfortunately there is no easy way to do this automatically. You can send individual Python files and Zip files of Python files, so if you have some_module.py and want to import some_module in your script, use sc.addPyFile("some_module.py") to have it distributed to the worker nodes.

You may want to separate your Spark code from the code that uses other packages, if possible, by having a Spark file that runs data processing and a separate file that uses packages to do other things to the processed data.