R's Lists and its Detestable Dearth of Data-Structures

Update, September 2018: The R Journal recently published an interesting review of the state of data structures and collections in R: Timothy Barry (2018), “Collections in R: Review and Proposal”, vol 10 no. 1, pages 455-471.

R is the lingua franca of academic statistics. Many papers introducing new statistical methods are accompanied by a package posted on CRAN, R’s repository of useful packages and tools. Undergraduate programs often teach R and use it throughout their courses, as do most graduate programs—most PhD students I know are implementing their work in R as they do their research. R’s popularity is exploding, and even Microsoft has their own R distribution these days.

But R is an unusual language. It was designed by statisticians for statisticians, and has a number of convenience features—for example, there are no scalars. The number 17 is just a vector of length one, and the + operator can add arbitrary vectors elementwise by dispatching to fast C code. On the one hand, this means I can write foo + bar for two large vectors and get an efficient sum, but on the other hand, it means 1 + 2 has to go through the same loop instead of being converted to a fast addition.

There are other curious features of R, like Ross Ihaka’s famous¹ example of a function whose return value x is randomly local or global:

Ihaka says “No sensible language would allow this,” and misfeatures like this seriously limit R’s performance by making it difficult to efficiently run R code. (How can an optimizer deal with a variable that is randomly local or global?)

But the performance of a program depends on more than just how quickly the interpreter can execute the instructions. For a program doing vectorized operations on large arrays of numbers, sure, but what about a program that needs different data structures, like sets, graphs, or trees? What about a hash table? Do we have the tools to efficiently store and search data in appropriate data structures?

Lists lists lists

Besides vectors, and their multidimensional generalizations in matrices and arrays, the other core type in R is the list. If we want to build data structures, we’ll have to build them from lists. A list has named fields which can contain arbitrary data—vectors, other lists, strings, functions, and anything else. For example,

foo <- list(walrus="walrus", func=function(x) { -4 * x + 3 },
            walden=c(1, 2, 3, 4, 5))

foo$func(2)
# -5

foo$walrus
# "walrus"

Lists are the structures upon which much of R’s other features are built. For example, R’s famed data frames are just lists with one named field per column, and S3 classes (such as the lm objects returned from a linear model fit) are usually just lists with their class attribute set to their class name. Generic functions like predict are implemented by checking the class attribute and passing the list to the appropriate function.

Lists also support numbered entries, and some R functions return lists without any names. We can access named and unnamed entries by their numbers:

foo[[3]]
# [1] 1 2 3 4 5

Lists are the only key-value mapping type provided in base R: there are no dictionaries or associative arrays.² Nor are there structure or record types. So you’re encouraged to use lists for everything. Returning a complicated set of results from a model fit? Use a list. Looking up items by their name? Use a list.

It’s great to have such a flexible data structure, but it turns out that lists are a poor foundation to build a language upon. Let’s start with the basic problem: finding elements in a list.

What about the lookups?

Lists support some unusual features. For example, we can access their elements with shortened names, and we’ll get the element with the best match:

foo$wald
# [1] 1 2 3 4 5

Hang on—how does R do that? In a standard hash table, we hash the key (turn the string “wald” into a numeric code), map that into an array, and look up the entry. But if we look up “wald” instead of “walden”, we’d end up in the wrong place and fail to find the entry. How does R do it?

Simple: it’s not a hash table. It’s a named array. Lookup means an O(n) linear search through the array, checking the names as we go, not an O(1) hash table lookup.

We can see how lists are constructed in src/main/builtin.c, inside do_makelist. There are two steps. First, R iterates through the arguments to list (which are provided as a SEXP—yes, in reference to S-expressions, since R code is internally kept as linked lists) and checks if any of them have names. If they do, it allocates a names vector and fills it up with the names of the list elements, then attaches it as an attribute to the resulting list, which is internally stored as a vector (an array).

So lists are a pair of arrays: one array of pointers to the list elements, one array of their corresponding names.

Subsetting (the $ operator) is implemented in src/main/subset.c, inside R_subset3_dflt, where R iterates through the list to find the element with a matching name:

R_xlen_t i, n, imatch = -1;
int havematch;
nlist = getAttrib(x, R_NamesSymbol);

n = xlength(nlist);
havematch = 0;
for (i = 0 ; i < n ; i = i + 1) {
    switch(pstrmatch(STRING_ELT(nlist, i), input, slen)) {
    case EXACT_MATCH:
        y = VECTOR_ELT(x, i);
        if (NAMED(x) > NAMED(y))
            SET_NAMED(y, NAMED(x));
        UNPROTECT(2); /* input, x */
        return y;
    case PARTIAL_MATCH:
        havematch++;
        if (havematch == 1) {
            /* partial matches can cause aliasing in eval.c:evalseq
               This is overkill, but alternative ways to prevent
               the aliasing appear to be even worse */
            y = VECTOR_ELT(x,i);
            SET_NAMED(y,2);
            SET_VECTOR_ELT(x,i,y);
        }
        imatch = i;
        break;
    case NO_MATCH:
        break;
    }
}

So we iterate through the elements of the list, checking their names. Exact matches are returned immediately, but unique partial matches require extra processing later (R optionally throws warnings on partial matches, since they’re easy ways to screw up). If there are multiple matches, no match is returned, since picking one arbitrarily would surprise the user.

Now, I mentioned this means it’s an O(n) linear search, not a O(1) hash table lookup. What implications does this have for practical programs? Unfortunately, it makes lists unusable for many tasks you’d want a hash table for.

For example, in our statistical computing course, we give students a simple challenge: given a file of dictionary words, find sets of words which are anagrams of each other, and print out the sets of grouped anagrams. (For example, “iceman” and “cinema” would be printed together.) The best way to do this is with a hash table, keyed cleverly so that anagrams have the same key.³

As an example, I read in 91,000 English words from a text file and made them keys of a list, with the values just always 1. I then selected a random subset of 200 of the words.

words <- read.csv("english-words.txt", stringsAsFactors=FALSE,
                  header=FALSE)

wordlist <- as.list(rep(1, length(words$V1)))
names(wordlist) <- words$V1

keys <- sample(words$V1, size=200)

I then wrote a simple function to time how long it takes to look up those keys in lists of varying sizes:

time.lookup <- function(n) {
    sublist <- sample(wordlist, n)

    t <- system.time({
        for (key in keys) { sublist[[key]] }
    })

    t[["elapsed"]]
}

Some quick plotting and we obtain:

svg("list-lookup.svg", width=5, height=4)
sizes <- seq(from=1000, to=90000, by=500)
times <- sapply(sizes, time.lookup)
plot(times ~ sizes, xlab="List size", ylab="Lookup time (seconds)",
     pch=20, las=1)
dev.off()

Yes, it’s roughly linear, so list lookups are O(n). For the anagrams problem, this becomes a prohibitive time cost, making the program orders of magnitude slower than the equivalent in Python or another language with built-in O(1) hash tables.

Obviously, for small lists (like a data frame with five columns), a linear search through such a small array is hardly a cost at all—it may even be faster than calculating a hash function. But once we start storing large datasets in lists, we run into problems.

Lists: powerful but inflexible

Lists are flexible and widely used in R. They’re the basis of S3 classes, they can store heterogeneous and deeply nested data, and the default vectorization machinery (like apply and lapply) produces lists. There’s extended list subsetting syntax with the [ and [[ operators, along with $, and useful features like record types (or structs) can be emulated by simply using a list.

But the next problem with R’s approach to lists is the names: they may only be strings. In Python (or Racket, or many other languages), on the other hand, I could use any immutable type as a key:

foo = {(1, 2): 7, "bar": "baz"}

foo[(1, 2)]
# 7

Any immutable type can be hashed, so there’s no reason it can’t be used as a key in a dictionary. But R only supports strings as list keys. This may seem like a minor niggle, but it turns out that arbitrary keys can be amazingly useful for O(1) lookups of things other than strings. They’re also useful for storing sets, collections of items for which every item appears only once and which support efficient unions and intersections. I often make use of sets in my own code:

Finding duplicates in sets of data other than strings.
Storing sets; for example, my Conway’s Game of Life code (in Racket) keeps track of which cells are live by storing the current set of live cells, in terms of their (x, y) coordinates. (This way, I don’t need to store a matrix of all the grid cells—I can just store the live ones, and hence support an infinitely large grid. This trick came from Chris Genovese.) But in R, I couldn’t use this trick: there’s no way to look up if a cell is live without an O(n) search, and I’d have to convert the coordinates to “x,y” strings instead of using more natural c(x,y) vectors.

Code that tries to get around these restrictions has to use convoluted workarounds. The sets package, for example, offers native set data structures in R, but stores them as sorted lists. (Sorting is necessary to make it easy to check if two sets are equal.) For sets containing non-numeric elements, the elements are sorted by their string representation, so every object has to be converted to a string. To perform set intersection operations, or others that would require O(1) access, the list is converted to a hash table by converting all its elements to strings and using an R environment.

All these convolutions cost efficiency and complexity. Of course, the user of the sets package need not worry with the details, but if they care about performance, they’ll notice the cost.

The problem with absent data structures

R users have various superstitions about writing fast code: vectorize everything, avoid for loops, use built-in functions when you can, and write hot functions in C++ (through Rcpp) whenever necessary. And that’s about it.

And yes, vectorization can gain you a large constant factor in performance over a for loop, since the loop is done in fast C code instead of a slow bytecode interpreter. But if you want to go from O(n²) to O(n), you need to use the right data structure instead of munging together lists of lists using vapply and merge and whatever other functions you can piece together.

I’d love to see basic sets, hash tables, and trees built into R. I don’t know if it can be done while keeping compatibility, but it’s necessary if R is to become a more general-purpose programming language. I don’t know if that will happen or if Julia and Python eat its lunch.

For certain values of “famous”.↩
I realize there are environments, but they are a curious beast of their own. The hash package uses them to implement hash tables with string keys.↩
Stat computing students: it’s cheating to read the rest of this footnote. But, in short, sort the characters in the words. “iceman” and “cinema” both sort to “aceimn”, so they’d end up at the same key in the hash table.↩