36-651/751: Computing Resources

– Spring 2019, mini 3 (last updated March 5, 2019) all courses · refsmmat.com

To complete your projects, there are several kinds of computing resources available:

General-purpose department servers
These servers run Linux, have 32GB of RAM and 8 cores, and already have R, Python, and common compilers installed. They’re accessible via SSH.
Department GPU servers
The department has two servers with GPUs installed, along with common deep learning frameworks. Ask about getting access.
Amazon Web Services

This offers cloud computing services that you can rent. There are a bunch of products, but the basic ones are the Elastic Compute Cloud (rent as many servers as you need) and the Simple Storage Service (rent as much storage as you need). EC2 includes general computing servers (the T3 instances) as well as ones with powerful GPUs (the P3 instances); check their instance types page for details.

The class has a shared Amazon EMR cluster with Spark installed; see below.

There is a free tier of services you can get when starting out; you can use the free tier while you figure out how to get things installed and set up. When you need more computing power, the AWS Educate program offers $40 in AWS credit to students.

Note that AWS bills for services by the hour. If you work on your project a few days a week, you can dramatically reduce your bill just by shutting off your servers while you’re not using them. I recommend starting small until you’ve gotten your code working, and only then scale up to working at a large scale with many machines.

If you’re likely to use up the $40 credit, ask well in advance about getting more resources; we can likely cover some additional expenses, but ask in advance, and definitely ask before spending any of your own money.

Our EMR cluster

I have set up an Amazon EMR cluster for everyone using Spark in this course to share.

These instructions will expand as I learn more things about Spark.

Getting access

I need your SSH public key to grant you access. If you have used SSH before, look in the ~/.ssh hidden directory for a id_rsa.pub or id_ed25519.pub file. The contents should look something like this:


Paste the entire thing into an email and send it to me.

If you have no SSH public key, follow these instructions for generating one (using your GitHub email address isn’t important for us, though) and send me the public key as above. (You don’t need to follow their ssh-agent steps.)

Do not send me any file beginning with -----BEGIN OPENSSH PRIVATE KEY-----. It says PRIVATE KEY for a reason.

I will then set up your account with remote access. I will send you the name of the server you need to SSH into to get access.

Logging in and adding files

Once your SSH access is set up, you can use the server like any SSH server. Use scp to copy files, or configure your text editor to use SFTP to edit files directly on the server.

The server you log in to is separate from the servers doing the calculation. That means that if you have somedatafile.txt in your home directory on the server, you will not be able to load it in Spark with spark.read.text("somedatafile.txt"). The servers run the Hadoop Distributed File System, and so you need to add the file to HDFS. You can run

to copy the file to your HDFS user directory. There are HDFS commands corresponding to the familiar commands like cp, mv, ls, rm, and so on; run hadoop fs to be shown a list of them all.

Installed software

The EMR servers have Python 2.7 and R 3.4 installed. You can run ordinary Python or R with the usual commands, or you can run the Spark-integrated versions:

Both automatically connect to the Spark server and store the connection in the spark variable for you to use.

GraphFrames works as usual:

Note that there is not a convenient way to load GraphFrames in the EMR Notebooks feature; the notebooks don’t provide any way to load Spark packages. You may have to stick with the pyspark console or the spark-submit command, passing the --packages argument each time.

Installing Python or R packages poses problems: computation is distributed to all nodes in the cluster, so all of them need your packages installed. Unfortunately there is no easy way to do this automatically. You can send individual Python files and Zip files of Python files, so if you have some_module.py and want to import some_module in your script, use sc.addPyFile("some_module.py") to have it distributed to the worker nodes.

You may want to separate your Spark code from the code that uses other packages, if possible, by having a Spark file that runs data processing and a separate file that uses packages to do other things to the processed data.