Parallel Computing

Moore’s Law roughly captures the rapid growth in processing power, especially in the increasingly dense packing of transistors on a single chip.

But with current technologies, there are decreasing returns from increasing this density.

And as researchers, we are always pushing the edge of what is feasible with whatever technology we are using: larger problems, bigger data sets, more intensive methods.

As we push the limitations of a single processor, it makes sense to start thinking about how to use multiple processors effectively, especially as ordinary computers – even phones and tablets – routinely ship with multiple processors that can work in parallel.

Indeed, there is a significant current trend toward distributed computing: computation run on a network of processors that may be in different locations and may share various resources.

Concurrency versus Parallelism

There are two related but distinct concepts:

Concurrency is dealing with many tasks at the same time.
Parallelism is performing many tasks at the same time.

Concurrency is common in applications like web servers, which must be able to deal with thousands of people trying to visit a website at the same time. Even if the server only has one processor and can only run one set of instructions at a time, it can handle the requests concurrently: for example, while waiting for a file to be read from the hard drive to serve to one user, it might do some processing for several other users, returning to serve the first user once the file is ready.

Parallelism involves having several processors or CPU cores doing work simultaneously.

The benefits of exploiting concurrency and parallelism in our algorithms and computations:

Faster code (or more users served simultaneously)
Fault-tolerance (if designed right)
Decomposes tasks into smaller units
Scalability (add more processors?)

But concurrency and parallelism come with serious challenges:

Potential complexity from asynchrony
Overhead (communication between processors, transaction rollback)
Bad outcomes are possible and require care to avoid
Security is more challenging
Difficult to test
Difficult to profile
Difficult to debug

The complexity is a particular problem, often leading to infuriating Heisenbugs when the code changes its behavior based on which processes or tasks happen in which order, so even minor or apparently unrelated changes to the code make it drastically change behavior. It can be quite easy to accidentally write incorrect code.

Concurrency Concepts

Concurrency involves managing multiple tasks or processes that share resources, like data in memory or files on the hard drive. The processes may run successively, asynchronously, or simultaneously.

The challenge of concurrent programming centers on controlling access to the shared resources. This can get complicated: concurrent execution of even simple tasks can lead to nondeterminism.

Operating systems have some basic features designed to allow for concurrency, such as processes and threads.

Even with these features, there are three main concurrency challenges:

Race conditions
Mutual exclusion
Deadlock

Processes and Threads

In early computers, there was just one set of instructions running on the processor – one chunk of memory defining what the processor should do. Those instructions had to handle everything. Early operating systems, like DOS, did very little: DOS only allows one program to run at a time, and provides very basic services (like access to the hard drive) to that program.

Modern operating systems are much more sophisticated. The fundamental unit of abstraction is the process. You can think of a process as a set of instructions – a computer program – combined with the current activity of that process, including things like the data it stores in memory and the files it currently has open. Processes are usually isolated from each other: without special privileges from the operating system, they cannot access each other’s memory or interfere with the execution of each other.

Modern multitasking operating systems allow users to start multiple processes at the same time, and you likely have hundreds running on your computer right now. Processes may run in parallel if there are several processors, but they are always concurrent.

Beneath the process is the thread. Sometimes, a single program may want to be able to run several things simultaneously, all having access to its memory and resources. You can hence create multiple threads inside a single process, each sharing access to its memory. For example, your web browser may have a thread to respond to user input (typing, clicks on menus and buttons, and so on) and separate threads for tasks such as decoding compressed images or parsing HTML, so the browser does not appear unresponsive when viewing a large image or webpage.

Operations that can be executed properly as part of multiple concurrent threads are said to be thread safe. Beware the use of non-thread-safe operations in a concurrent context.

Note that processes and threads are not free. Spawning them (yes, that’s the term) requires your program to make a request to the operating system to make a new program or thread, and this takes some time for the operating system to do the necessary bookkeeping and fulfill the request. Creating new threads every time you perform a matrix operation, say, could be quite slow; concurrency and parallelism frameworks often create a thread pool of pre-made threads ready to run whatever tasks are needed.

Common Case: Multiple programs running on your laptop

Q: How does a single-processor computer seemingly execute multiple processes and threads at the same time?

For instance, you are running your data analysis while reading Twitter in your web browser.

Basic strategy: Slice time into small intervals and allow each process to run exclusively during any single interval. If the interval is small enough, then our perception is that they are running simultaneously. Your operating system does this automatically, dozens or hundreds of times each second.

When a process is operating during an interval it is said to be running; when it loses that privilege it is said to be blocking. We sometimes say things like “reading a file is a blocking operation”, meaning that execution of the process is suspended (and other programs run) until the hard drive fetches the requested file and has it ready.

To switch processes or threads requires a context switch that saves the current register values and instruction state (i.e. which instruction is next to run) to main memory. This is handled by the operating system.

Context switching is expensive, since it requires main memory access, and we have to make sure that every process gets the time it needs to run. Operating system designers spend a great deal of effort on scheduling algorithms to make sure that, say, your web browser can respond immediately when you click the Retweet button, but your data analysis still gets plenty of time to run in the background.

Making Things Fast with Concurrency

Concurrency need not involve parallelism: there may be just one processor doing many tasks concurrently. It’s obviously helpful for a machine shared between many processes, so they all appear to run. But is there any reason to make one process concurrent? Would it make it faster?

Let’s think of a few examples.

A web server receives a request from a client (“give me this web page”), fetches data from files or SQL databases to produce the requested page, and sends the page back. Many clients send requests at nearly the same time.
A web scraper a request to a website (“give me this web page”), waits for the response, extracts the necessary data from the page, stores it somewhere, and perhaps adds new pages to the list of pages to scrape. There are many web pages to scrape.
An analytics dashboard sends many complicated SQL queries to a database server, waits for the server to compute the results, formats the results into graphics, and displays them.

Q: How could concurrency without parallelism make these faster or more efficient?

Example: Shared counter

Suppose you have two threads, each sharing access to an array of data and a variable, Counter. Each thread can read and modify Counter.

These threads may be running in parallel, or they may be running on a single-core machine where the operating system switches back and forth between threads quickly, as described above.

Consider the following code for threads A and B:

Data = [10.2, 11.4, -17, 1.0, 0.0, 7, ...]

A:     Set Counter = 0
       Set Counter = Counter + 1
       Write data[Counter:3*Counter] to file foo.out

B:     Set Counter = Counter + 1
       Write data[3*Counter+1:5*Counter] to file foo.out

Ignoring the file writes, this actually breaks down into the following instructions:

A1: Set Counter = 0
A2: Read Counter into local_x
A3: Set Counter = local_x + 1
B1: Read Counter into local_y
B2: Set Counter = local_y + 1

At any time in the process, the operating system may context switch between threads A and B, saving one thread’s register state and restoring the other’s.

If these are executed in the order A1,A2,A3,B1,B2, the Counter will end up with a value of 2.

But if they are executed in the order A1,A2,B1,A3,B2, the Counter will end up with a value of 1.

If they are executed in the order B1,A1,A2,A3,B2, then the value of Counter can be effectively random. Q: Why?

This is a race condition: both processes reference the same location and its value depends on the order of reads and writes. Since the operating system can switch from one thread to the other essentially at any time, in any order, we cannot predict the results.

Notice this happens even though only one instruction is executed at a time; concurrency is the problem here, not parallelism.

One way to help solve this problem would be to make the increment operations atomic: an uninterruptible series of instructions that are guaranteed to execute to completion when started.

Now, when these two processes write output to a file based on their calculations – say, as the coordinates for some particular point – the output can also be garbled. The writes involve many instructions and if they are interleaved between the two processes, the data produced in the file can come out in many orders, leading to the famous knock-knock joke:

“Knock knock”

“Race condition!”

“Who’s there?”

This is the mutual exclusion problem. Unless each thread can get exclusive access to the file while writing to it, the output is potentially useless.

A common statistical case: You run many simulations in parallel. They write their results to simulation-output.csv, one row at a time. To your surprise, not only are the rows in a random order, but individual lines are mashed together, because one line was written while another was being written.

Atomic Operations and Locks

One key feature that helps solve these concurrency problems is atomic operations. Here, “atomic” means “smallest indivisible unit”, in the sense that an atomic operation is indivisible: either the entire operation completes, or none of it does. It is impossible for only part of the operation to complete.

In the deadlock operation above, if each process could acquire input.txt and output.txt atomically, deadlock would be impossible, because it would be impossible for one process to hold one file without holding the other – the operating system would not grant it access to one and not the other.

But how do we get atomic operations? They are provided in several ways.

Computer processors frequently support atomic instructions. x86 has a fetch-and-add instruction, for example, that reads a number and adds a value to it atomically; this would solve the problems with Counter in the previous section.

But how do we solve the problem in general, without an atomic CPU instruction for everything we need to do?

One common atomic instruction is test-and-set, which tells the processor to do the following:

Read the value from a particular memory location.
Set that value to 1.
Return the original value.

Crucially, only one test-and-set can operate on a memory location at a time – even if several processors try to run test-and-set on the same memory location at the same time, they will coordinate to ensure that one is run before the other, so the sequence is uninterrupted.

How is this useful? We can use it to implement a lock, a way for two threads or processes to cooperate by declaring who has access to a resource when.

Consider another example with threads A and B:

Data = [10.2, 11.4, -17, 1.0, 0.0, 7, ...]
Lock = 0

A:     Analyze Data[1:100]
       While test_and_set(Lock) == 1, wait
       Write results to file foo.out
       Set Lock = 0


B:     Analyze Data[101:200]
       While test_and_set(Lock) == 1, wait
       Write results to file foo.out
       Set Lock = 0

Each thread can analyze each half of the data simultaneously or in any order, but as soon as one thread begins to write to foo.out, the other will be forced to wait until it is done before doing so – until the first thread releases the lock by setting Lock to 0.

This type of lock is called a spinlock, because the thread stuck waiting for the other to complete simply checks the lock in a loop repeatedly (“spins”) until the lock is released. This can be inefficient if the operating system keeps running this thread, wasting processor time checking on a lock that hasn’t yet changed. More advanced locking mechanisms coordinate with the operating system to inform it that the thread is waiting on a lock to be released, so it can switch to executing another thread that has real work to do.

You won’t usually write your own locks with test-and-set. (There are other atomic instructions, like compare-and-swap, that are also used.) Your programming language libraries usually will provide locking mechanisms.

Returning briefly to thread safety: you may have heard that Python and R are not thread-safe. What does this mean? The internal data structures – the variables where the R and Python interpreters keep track of your variables, handle memory allocation, and so on – are not protected with locks or atomic operations. If you run two Python threads at the same time, they can experience race conditions and other bizarre problems and do weird things; hence Python has a Global Interpreter Lock that locks everything in the interpreter, so only one thread can run at a time. The multiprocessing library gets around this by creating separate processes, meaning they do not share memory and data structures.

Q: Python has a threading module for creating multiple threads. If they can’t run simultaneously because of the GIL, how are they useful? Think of an example case.

Deadlock and Dining Philosophers

Deadlock occurs when multiple processes block waiting for resources that will never be released.

Example: Processes A and B need mutually exclusive access to both input.txt and output.txt and block until they have them. Because of concurrent execution, A gets access to input.txt but, before it gets access to output.txt, B gets access to output.txt first. Both processes then block forever.

This is illustrated in the famous Dining Philosopher’s problem: several philosophers are sitting around a circular table, with a fork on the table between each pair. The philosophers will randomly alternate between thinking and eating spaghetti, but they will only eat their spaghetti if they have a fork in both hands. When they are done eating they will lay down their forks; when they are done thinking, they will randomly look left and right, picking up a fork if it is available.

Q: Can the philosophers all starve?

Proposed concurrent systems involving locks and shared resources usually require quite careful design to avoid deadlock; tools like TLA+ exist to allow formal proofs of correctness of concurrent algorithms.

Concurrent Data Structures

One common problem: you would like to have several threads work concurrently. They need to share and modify a data structure to do so. How do you prevent race conditions? Threads share address space, so they have access to the same memory, but accessing the data structure without care can easily cause problems.

Example: Your algorithm involves traversing a large graph using a priority queue. If the priority queue could be shared between threads, multiple threads could process nodes concurrently. This might be useful if every node requires a lot of computation to process.

Option 1: Use one big lock to control access to the data structure, preventing two threads from modifying it at the same time. For example, put locks around enqueue and dequeue so they may only be called from one thread at a time.

Problem: If threads use this data structure a lot, they may spend much of their time waiting for the lock to open.

Option 2: Use a concurrent priority queue. Instead of a single lock, such a queue uses atomic instructions for critical portions of the code, to ensure no thread ever has to wait for a lock. Atomic operations are slower than normal operations, but avoid accidental deadlock and can be faster than naively locking everything.

Many programming languages have concurrent data structure libraries. Python’s queue module, in the standard library, uses locks to share queues between threads; libcds has many lock-free data structures for C++; Java has a collection of concurrent collections; and so on.

A common communication system between threads is message passing, in which each thread has its own queue that other threads can write messages to. Each thread periodically dequeues messages from its queue to process.

Parallelism Concepts

Parallel Architectures

Computer architectures have introduced parallelism at several levels:

Bit-level Parallelism

What’s the advantage of 64-bit architecture over 32-bit architecture?

One answer: faster basic computations through parallel processing, e.g. faster calculations with 64-bit integers and floats (also: memory range increased past 4GB)
Instruction-level Parallelism

Modern CPUs use various techniques like pipelining, re-ordering, out-of-order execution, and speculative execution, so they can execute multiple instructions while waiting for others to finish.

This is mostly invisible to the programmer at higher levels, unless you’re a computer security researcher. You don’t need to do anything to take advantage of this, since your CPU does it automatically, and you have very little control over it.
Data Parallelism

A data-parallel architecture (sometimes called Single Instruction, Multiple Data, or SIMD) can perform operations on a large quantity of data in parallel.

Example: Adding arrays elementwise. Modern x86 processors with SSE have instructions to add multiple pairs of numbers. High-level languages like R and Python usually don’t provide ways to use these instructions directly, since you have little control over what CPU instructions are used to run your code.
Task Parallelism

The allocation of computational work on a task into components that are performed simultaneously.
Important cases:
- Multiple cores on a single machine
- A distributed network of machines (with shared or distinct memory)
- MIMD (Multiple Instruction, Multiple Data) architectures

In addition, data and processing capabilities are increasingly distributed, meaning they live and run on many different computers, possibly in different locations. This raises a variety of challenging problems in data storage and management.

(cf. Flynn’s taxonomy: single S or multiple M instructions I or data D, yielding: SIMD, MIMD, SISD, MISD.)

Embarrassing Parallelism

With all these types and levels of parallelism, the first question to ask is: what kinds of problems are suitable for parallelization?

The simplest case is problems that are embarrassingly parallel: problems featuring many tasks that can be trivially run simultaneously. Examples:

Doing the same transformation to every element of a huge array. The order does not matter, and each element is independent of the others.
Trying multiple values of tuning parameters to see which gives the best results (provided the values you want to try are fixed in advance). The results with one set of tuning parameters do not depend on the results with another set.
Fitting the same model with the same parameters to a bunch of bootstrap samples.

Q: Can you think of other examples?

Many of these problems are maps, in the functional programming sense: they involve having many chunks of data and one operation that must be applied (mapped) to each chunk separately. The results are not interdependent and order does not matter.

Programming languages often provide support for this pattern, since it’s simple to handle with multiple threads or processes each running the same code:

Python: The multiprocessing package provides a map method that can apply a function to each element of a sequence (e.g. a sequence of data chunks or parameter values). multiprocessing automatically starts multiple Python processes, splits up the data, and runs the function on different chunks in different processes.
R: The parallel package is now built in to R, replacing snow and multicore. It provides parallel versions of the apply functions, such as parLapply. furrr provides parallel versions of purrr’s map functions.
Julia: The @distributed macro can turn for loops into parallel loops run on multiple processes/cores.
C, C++, Fortran: OpenMP can turn certain for loops into parallel loops running on multiple threads simultaneously.

The R and Python packages also support remote processes, meaning the code and data can be sent to several machines to run on all of them in parallel.

Note a caveat: as mentioned above, spawning threads or processes has a cost. Parallelizing an operation that already takes only a few milliseconds likely won’t make it appreciably faster. If you do ten million of those operations, the parallelism will only be worthwhile if you create the threads once, not ten million times. This is why Python’s multiprocessing supports creating a “pool” of processes and then using it repeatedly for different tasks.

Less Embarrassing Parallelism

Not every task is embarrassingly parallel, and even when some parts of your code are easily parallelized, other parts may not have obvious parallel versions.

One common departure from embarrassing parallelism is when you want to combine results somehow. Transforming every element in an array is easily parallelized, but adding every element is not: the sum of the first 1000 elements depends on the sum of the first 500, and so on. Similarly, the result of the 47th iteration of an iterative algorithm depends on the 46 previous steps.

Functional programming again saves the day. Some of these problems can be written as reductions, and if the reduction operation is associative, then steps can be run in parallel.

Consider taking the sum of a large array. Addition is associative, so we can sum the first 100 elements on one CPU and the second 100 on another CPU, then add the two final results. Associativity means it does not matter how we split up the elements or which order we add them – our result will be the same.

You can implement this manually by splitting your data into chunks and giving each chunk to a separate process or thread.

Alternately, OpenMP automatically supports parallel reductions, in limited forms. If a for loop uses an operator OpenMP knows is associative (like +, *, or -) on a variable each iteration through the loop, OpenMP can automatically transform the loop into a parallel reduction.

MapReduce – implemented by Hadoop, among others – is a framework for splitting up a large dataset, applying a map function to each element, and then applying a reduction, with all operations automatically parallelized to multiple cores or even multiple machines. It automatically distributes the needed data files and ensures that if one computer crashes before it finishes its part of the work, that part is finished by another machine.

Sometimes one adds “keys”, so that all data with the same key must be reduced on the same core.

Examples of parallel algorithms using reductions:

Amdahl’s Law is useful to keep in mind: parallelizing your program can only speed up the parallel portions, and won’t make your program any faster than the slowest non-parallel parts.