36-651/751: Parallelism in Practice

– Spring 2019, mini 3 (last updated January 23, 2019) all courses · refsmmat.com

Today, let’s talk about parallelism in practice. We talked about the principles of concurrency and parallelism, and the challenges they bring, but how do we actually make parallel code?

How processes are made

Let’s talk a bit about how processes are made. If I create multiple processes, how exactly do I create them? What does the operating system do?

We’ll talk about the only operating systems that matter: Unix-like systems, including Linux and macOS. Sorry, Windows.

The core operations are fork and exec. These are system calls, meaning they are requests made by programs to the operating system.

fork creates a copy of a process. It’s quite an interesting system call; here’s example code from Wikipedia:

int main(void)
{
   pid_t pid = fork();

   if (pid == -1) {
      perror("fork failed");
      exit(EXIT_FAILURE);
   }
   else if (pid == 0) {
      printf("Hello from the child process!\n");
      _exit(EXIT_SUCCESS);
   }
   else {
      int status;
      // waitpid waits until the child exits
      (void)waitpid(pid, &status, 0);
   }
   return EXIT_SUCCESS;
}

Notice that:

  1. The child process is an exact copy. It even picks up execution where the parent left off, immediately after the call to fork.
  2. fork returns either (a) an error value, (b) 0, in the child process, or (c) the process ID of the child process, in the parent process.
  3. The child inherits a copy of all the memory of the parent.

This last part seems wasteful. Surely we don’t want to have to copy all that memory. But any modern operating system doesn’t actually copy it, and instead uses copy on write: both processes use the same memory until one modifies any chunk of memory, in which case it receives its own personal copy of that chunk.

fork is traditionally the only way to create processes on Unix-like systems, though there are now various variations and modified versions.

But what if I want to create a different process, not a copy of myself? That’s when exec comes in.

The exec system call replaces a process (and all its memory) with another one. If I want to run another program, I do this:

  1. fork the process.
  2. In the parent process, continue doing stuff.
  3. In the child process, exec the program I want to run. The child is now replaced with that program.

This may seem like pointless historical detail, but fork and its copy-on-write semantics actually matter: they affect how you share data between processes in multi-process systems.

Solving embarrassing problems with map

Example 1: You have a dataset. You want to fit a model to your dataset. You need to try many different tuning parameter settings and evaluate their test error (or some other metric) to pick the best one.

Python and multiprocessing

pool.map can take arbitrary functions and arbitrary sequences, and will automatically divide the sequence of data up to be given to different processes in the pool. This does have caveats:

  1. The function (the first argument) must be pickle-able, because multiprocessing pickles the function to pass it to the other processes. (Remember, processes don’t share memory, so the function has to be sent to them.) Unfortunately, while functions can be pickled, object methods cannot be.
  2. The data must also be pickle-able.
  3. Global variables are problematic. The NUM_TRIALS variable is available in all the pool processes. Why? fork! We can read its value without causing problems, but note that if we modify it inside fit_and_evaluate, each process will have its own value, not a shared value.

    But this doesn’t work for every kind of variable.

    For example, suppose you’re using PostgreSQL and write

    This will not work correctly, because PostgreSQL cursor objects are not meant to be shared. You need to create one cursor per process.

    Also be careful if you have open files or datasets and try to write to them from multiple processes.
  4. Global variables are also nice. Suppose your parallel involves some 10GB dataset; you want to run your analysis on 8 cores, but you do not have 80GB of RAM. If you load the dataset before creating the Pool, that’s fine: as long as the worker processes do not modify it, only read from it, they will not create their own copies of it.

R and parallel

In R, the parallel package works in a similar way. Instead of creating a pool of processes, you make a “cluster”, then submit tasks to the cluster for computation. The cluster can just be a bunch of processes on one computer – created via fork, for example – or it can consist of R processes on many different machines, which parallel connects to via SSH.

Let’s start with the simple case. We can make a fork cluster and submit tasks to it:

Note that fork clusters do not work on Windows, because there is no fork on Windows.

On a Mac, forking causes problems with graphics: the Quartz graphics window can only be controlled from one process at a time, so it will be disabled in the children.

We can alternately make a PSOCK cluster, which creates new R processes not by forking but by creating Rscript processes from scratch and sending them code to evaluate over sockets.

The rscript argument tells R where to find the Rscript program on the servers; otherwise it tries to find it in the same place it exists on your computer, which on Macs is a bizarre location that does not exist on Linux.

This will create one process per host; repeat the hostname in the first argument if you want to run multiple processes per server. Please do not use all eight cores on every hydra.

Note that if you have to send large datasets for computation, R will have to take the time to send those datasets to each server in the cluster. You might instead arrange to have the data in files on each server (e.g. in your home directory) and have the R processes load the data from there.

Be careful not to have each worker output to the same file and cause conflicts.

The threaded way, with OpenMP

OpenMP is a framework for multithreaded programming that’s largely intended to parallelize different kinds of for loops. It works by being a part of the C/C++/Fortran compiler and reading a specially formatted comment above the loop that tells OpenMP how it should be parallelized.

Because it works for C and C++, it’s available in Rcpp (see the example gallery here) and in Cython (see the documentation here). It requires passing an extra flag to the compiler, usually -fopenmp. If you forget to pass the flag, the compiler ignores the OpenMP directives and compiles everything to run single-threaded.

Let’s use an example from this article. Suppose you’re doing a calculation that involves heavy use of the multivariate normal density; you use the dmvnorm function from the mvtnorm package, but it’s slow. Profiling reveals that it’s a major part of the execution time of your code.

You investigate your code and see that you calculate the density for many points, and you could parallelize these calculations.

Here’s the implementation in Rcpp, using the RcppArmadillo package for its linear algebra:

Notice the #pragma. We’re telling OpenMP to run the loop in parallel on a “static” schedule, meaning it decides in advance how to divide up the data between threads. (There are other schedules, like “dynamic”, which gives data to each thread as it finishes its work, so if some loop iterations are much faster than others, the thread with those chunks of the data gets an opportunity to get more data to work on.)

There are other options that can be included in the #pragma line. There are a bunch – review the very detailed guide here – so I’ll just cover a few. Some can be used to declare whether variables inside the loop should be shared or private:

Here a is private – each thread gets its own independent copy – and b is shared between cores, with a note to OpenMP requesting it to use an atomic instruction to do the addition. There are other options for determining what the value of a should be after the loop; e.g. lastprivate(a) says that after the loop, a should have the value it had in the last loop iteration.

OpenMP also understands certain kinds of reductions, including +, i, and *:

This calculates the factorial in parallel. (Obviously this isn’t terribly useful, because when number is large enough for the parallelization to help, fac will overflow the maximum size of an integer anyway.)

You can have loops with multiple reduction variables. You can also manually construct parallel loops with atomic operations, SIMD, and parallel chunks that aren’t loops (e.g. you can ask for the same chunk of code to be run on every thread).

Cython works by spitting out C or C++ code from the Cython code you write, and it supports OpenMP as well. For example:

prange tells Cython that this is a parallel range, and so it generates a for loop with the #pragma attached. Cython automatically recognizes that s is a reduction variable and generates the appropriate #pragma.

Parallel loops are not the only way

Tools like OpenMP are well-adapted for scientific computation, which often involves lots of big for loops on big datasets. The parallel and multiprocessing packages are good at solving this kind of problem (though multiprocessing also provides ways to start different processes and manually have them do different things).

But not every parallel task is a for loop or a map. In other cases, we have tasks to run and we don’t need to wait for them to finish.

One way is to manually create and manage multiple threads. But there is another abstraction that is quite useful here: the idea of futures and promises.

These were designed for dealing with concurrency, though parallelism is an added benefit.

Some motivating examples for concurrency:

  1. When a user visits a website, it does database queries to build the page for them. Why should processing stop while waiting for the query results?
  2. I have an interactive application that users access to analyze data. When they press the big “Analyze Data” button, it will take several minutes to run, but I want them to be able to press other buttons while waiting.
  3. After each step of my analysis, I write the intermediate results to disk in case my code crashes – but there’s no need to wait for this to finish before starting the next step.

Promises and futures are a way to solve this problem, built on top of threading ideas.

I’ll write examples with the R future package, since the syntax will be familiar.

A future represents some value that may be available in the future. It may have been calculated already, it may not have been.

We’ll use a statistical example: I’d like to run several models, but I only need their results later, to compare them.

library(future)

linear.fit %<-% lm(y ~ ., data=huge_dataset)

lasso.fit %<-% lasso(y ~ ., data=huge_dataset)

## and many more

linear.predictions <- predict(linear.fit, test_data)

## and so on...

The first few lines run without waiting for the fits to complete; instead, linear.fit and lasso.fit are promises, which represent a promise for some result when it’s available. Only when we try to use these values does R block waiting for the calculation to complete.

This can happen sequentially, with R calculating one value after another, or it can use methods from parallel to calculate the values in different processes automatically.

Naturally there’s a future.apply package building apply functions with futures, and various other adaptations.

There are many variations on futures and promises in different programming languages, often involving the keywords async and await – Python has recently added these to the core language, JavaScript has made extensive use in recent years, and Java, C++, C#, and other major languages have variations.

Is parallelism what you need?

Before investing significant effort into writing parallel code, you should ask yourself: is parallelism what you really need?

Use your language’s profiling tools. Find out what part of the code takes up the most time. Remember Amdahl’s law: parallel code can be no faster than the slowest sequential part, so if can parallelize a function that only takes 20% of the time your analysis takes, you still are stuck with the other 80%.

In R and Python, profiling can often reveal ways of speeding up your code that don’t require parallelism: places where you can use better data structures, redundant calculations you can avoid, expensive operations you can move outside of hot loops. Parallelization can often make code more confusing or hard to use, so use it after you’ve done other performance tricks, not before.