36-651/751: Computer Architecture

– Spring 2019, mini 3 (last updated January 17, 2019) all courses · refsmmat.com

(Based on notes originally developed by Chris Genovese.)

A Brief Primer on Computer Processors

Here’s a (somewhat simplified) picture of computer architecture:

The Components of Processor Architecture

  1. The Central Processing Unit (CPU)

    The CPU is the control center of the computer. It carries out the instructions given by computer programs for arithmetic, logic, system control, and input/output.

    There are many CPU designs with different performance trade-offs. The main components of the CPU are:

    Control Unit
    manages the execution of instructions
    Arithmetic Logic Unit
    carries out arithmetic, logic, and bitwise operations
    Registers
    specialized, very fast storage used in executing instructions. Some registers have fixed uses and some are general purpose. There are only a few of these available.
    Cache
    a fast but limited memory space to speed computations
    Memory Management Unit
    helps the operating system manage memory resources (optional)
    Firmware
    fixed program for initialization and startup

    In modern CPUs, some or all of these components are on a single integrated circuit chip; at the very least, they are colocate on a common circuit board (the motherboard).

  2. The Cache

    The cache is a small (kilobytes or megabytes) memory unit that is very fast: one or two orders of magnitude faster than main memory. It’s usually directly attached to the processor. It is used to store copies of data currently being used by the processor; the processor automatically decides which data should be in the cache and ensures the main memory is updated with any changes made to the cache.

  3. The Bus

    Specialized communication circuitry that allows transfer of data among the various parts of the computer. A fast bus is as valuable as a fast CPU to get high performance.

  4. Main Memory (RAM)

    A large, fast, random access storage area. Each slot in memory has a unique address, by which it can be read or written. There are different types of Random Access Memory, with different cost and performance characteristics.

    RAM is volatile: its contents are lost as soon as the power is cut off. RAM can’t be used to store data permanently.

    When you say you have 16GB of RAM, this is the memory you’re referring to.

  5. External Storage (e.g., Hard Disks, USB Drives, …)

    A very large storage area that is very slow compared to memory. Rotating hard disks (and even solid-state drives) impose physical constraints on the order in which information can be accessed.

  6. Ports

    Addressable access points to other peripheral devices, allowing expansion of the system in a general way. Special programs called device drivers manage the details of how these devices are controlled.

How Programs (aka Apps) Run

At the lowest level, a program is a set of instructions in machine code, each instruction encoded in a fixed number of bits in the following form:

opcode data-to-operate-on-if-any

The opcode is a number representing a specific operation that is hardwired into the CPU circuitry. The data operated on depends on the instruction and consists of register “names”, memory addresses, or numeric constants.

For example, here is a program for the Intel x86 processor to compute the GCD of two integers via Euclid’s algorithm, shown in hexadecimal notation:

55 89 e5 53  83 ec 04 83  e4 f0 e8 31  00 00 00 89  c3 e8 2a 00
00 00 39 c3  74 10 89 b6  00 00 00 00  39 d3 7e 13  29 c3 39 c3
75 86 89 1c  24 e8 6e 00  00 00 8b 5d  fc c9 c3 29  d8 eb eb 90

This is not easy to read or reason about, which is why we have programming languages! Anyone working directly with the processor usually finds it easier to use assembly language, which adds only the barest syntax to this machine code. Here’s the above program in assembly language, for reference:

   pushl %ebp                      jle  D
   movl  %esp, %ebp                subl %eax, %ebx
   pushl %ebx                   B: cmpl %eax, %ebx
   subl  $4, %esp                  jne  A
   andl  $-16, %esp             C: movl %ebx, (%esp)
   call  getint                    call putint
   movl  %eax, %ebx                movl -4(%ebp), %ebx
   call  getint                    leave
   cmpl  %eax, %ebx                ret
   je    C                      D: subl %ebx, %eax
A: cmpl  %eax, %ebx                jmp  B

When the CPU runs a program, it is given a starting address that points to the machine code of the program, and it proceeds through the instructions, modifying registers and memory, interacting with external storage and with peripheral devices, and keeping track of where in the code it is.

The CPU uses special registers to manage the control of the program’s flow. For example, the program counter (PC) register (sometimes called instruction pointer, IP, or instruction address register, IAR) holds the next instruction to be executed. It is normally incremented with each instruction, but instructions like branches (conditionals), jumps, function calls, and function returns can change the PC register directly.

Similarly, the stack pointer (SP) points to an area in main memory that is used as a stack for temporary computation and storage, such as holding the arguments to a function before calling it. This is updated automatically for certain instructions.

Where Data Comes From

Many CPU instructions manipulate data – obviously the CPU would not be terribly useful if it had no way to access and use data from the disk or from memory.

If we could, we’d put all the data we need in registers. Registers are physically attached to the CPU and are extremely fast to access. The assembly code above is referring to registers when it uses names like %eax and %ebp.

But there are a limited number of registers, determined by the type of CPU you bought. Registers are hence only useful you’re operating on right now with these few instructions; you’re not going to, say, store an entire vector or matrix in registers.

There are hence CPU instructions to request data from main memory (RAM) be loaded into the registers for use. But main memory is slow – a disparity that has only grown as processors have gotten faster and faster. Refer to the famous Latency Numbers Every Programmer Should Know.

This could be very bad: your CPU might theoretically be able to execute billions of instructions per second, but if it has to wait 100 ns every time it needs to fetch data, most of that processing power would be wasted waiting on data.

CPU makers have several strategies to avoid this:

Cache

Cache memory is slower than registers but much faster than RAM – perhaps two orders of magnitude faster. A small amount of cache memory lives on the CPU. When a program loads data from RAM, the CPU automatically puts it in the cache, so if the program uses it again, it’s available; data that’s been in the cache for a while without being used gets “evicted.”

Caches usually load and evict chunks of memory of a specific size. We refer to a “cache line”, meaning a fixed-size chunk of memory loaded into the cache. The data isn’t moved to the cache, but instead copied; the authoritative source for the data is still the RAM. When data in the cache is modified, the processor keeps track of this and eventually updates the copy in RAM as well.

Effective cache use is important for performance. For example, if you time how long various matrix operations take as a function of the size of the matrices, there will be a sudden jump when the matrices hit a certain size – the point at which the matrices and related data no longer fit in the cache, forcing the processor to wait to retrieve it from main memory.

Prefetching

The processor may automatically load the next instructions it wants to execute, and even data (if you’re loading a big chunk of data at once), in advance.

Speculative execution

Suppose the memory being requested will be used to decide which instructions to execute next (e.g. it will be used in an if statement). The processor could try to guess the answer and begin executing subsequent instructions; this is called branch prediction, and is a form of speculative execution. When branches are predicted correctly, the processor can execute subsequent instructions even while waiting for previous instructions to finish.

Note that these are all features that are effectively invisible to you: you do not have direct control over the cache, prefetching, or speculative execution. You don’t need to call a special put_in_cache() function. This is handled automatically by the processor.

We only indirectly influence the cache by the order in which we load and use data. We prefer to write programs to maintain cache locality, meaning successive instructions tend to access data that’s fresh in the cache; the worst case would be reading from widely disparate chunks of memory but never the same memory region twice, making the cache worthless and forcing every access to go to the main RAM.

Cache locality concerns are why we sometimes care whether our arrays and matrices are stored in row-major or column-major order. In row-major order, the physical memory locations store all the entries in one row, then all in the next, and so on, so reading an entry in a row will tend to bring that row into the cache line; it’s hence more efficient to loop through all entries in one row before moving on to the next, rather than jumping between rows. In column-major order the reverse is true.

(R uses column-major order; Numpy uses row-major order by default.)

In interpreted languages like R and Python, where your program is interpreted by another program that spends lots of time looking up variable names in tables, running garbage collection, and so on, you can worry less about the cache locality of your code because so much other stuff is moving in and out of the cache all the time. But it’s worth thinking about when you’re using Rcpp or Cython or writing code in a faster language like Julia or C.

Multi-core Machines

The basic principles of CPUs, as little machines executing instructions and shuffling data around between RAM, registers, and cache, is straightforward enough (though not exactly simple). But what happens when we have multiple CPUs in one machine?

This leads to interesting problems. How do the CPUs communicate with each other? What happens if one CPU modifies some memory that’s also stored in a cache line in another CPU’s cache?

Symmetric Multiprocessing

Refer back to that CPU diagram at the top of the page. In a system using symmetric multiprocessing, there are multiple CPUs connected to the same bus. There’s still only one set of main memory to refer to, and the processors all request chunks of main memory through the bus. Each processor usually has its own cache memory, to avoid the expense of fetching main memory through the bus.

Most multiprocessor computers you use today are symmetric multiprocessors.

For ordinary applications – like running your web browser and Python at the same time – this design works well. Your web browser’s CPU instructions can be executed by one processor, fetching data from RAM and sending requests to the network driver as needed, while another processor can simultaneously be executing Python instructions and fetching data for your analysis program.

Danger arises if two different processors need to modify the same memory at (nearly) the same time. Consider:

  1. Processor 1 is working with some data from memory. This data is loaded into its private cache.
  2. Processor 2 starts reading that same memory, so a copy is loaded into its own cache as well.
  3. Processor 1 modifies the memory. This is written back to RAM.
  4. Now processor 2’s cache is wrong – it stores an old version of the data.

To avoid this inconsistency, known as incoherence, the processor must have a method to ensure cache coherence. For example, it may automatically delete processor 2’s cache entry when processor 1 modifies the memory, forcing processor 2 to fetch the new version from memory the next time it wants to access it.

Coherence and Performance Concerns

This can have observable effects on the speed of parallel programs. Suppose you write a program that needs to operate on every entry of some big array; you decide to do this in parallel to make it faster. You set up your program to split the data in two, so Processor 1 can operate on half the entries and Processor 2 on the other.

You could write your program so Processor 1 handles the odd-indexed entries and Processor 2 the even-indexed ones. But when Processor 1 refers to entry 1, that entry and adjacent entries will be loaded into the cache. (Remember, an entire cache line is loaded at once.) When Processor 2 modifies entry 2, it will have to invalidate Processor 1’s cache of entries 1, 3, 5, and possibly others, depending on the size of the cache line, and Processor 1 will have to reload those from RAM when it uses them.

This can dramatically slow down a parallel program.

The situation is worse if different processors read and modify the same location in memory, because the results will depend on which processor happens to go first; see the Parallel Computing notes for more information.

How Processors Represent Numbers

As statisticians, we have to deal a lot with numbers (though not as much as our relatives tend to assume). Integers are, of course, represented in binary, and that is fairly straightforward, but what about floating-point numbers?

Floating point numbers are treated a bit mystically by programmers, but they’re not that bad.

For reference, consult