Statistical Computing Postmortem

In the fall of 2015 I had an unusual job: teaching computing to statistics PhD students.¹ Chris Genovese asked me to be co-instructor of his newly rewritten course 36-750, which takes a somewhat unusual approach to the subject: Students are not expected to know a programming language, but we do not teach one – they must be able to pick up all the programming on their own, and we must be able to grade the blend of R, Python, Java, Ruby, Clojure, Julia, Haskell, CoffeeScript, and C we get as homework submissions.

The course doesn’t aim to teach functions and loops and basic syntax. Nor do we teach specific tools, like knitr, or many numerical methods, like optimization or MCMC. Instead, it’s a course on computing:

We only had one semester, so we covered these topics at high speed. Chris and I split the lecturing, and a department TA shortage meant that I also did all the grading. Class met twice a week, so students had two homework assignments per week, chosen at their own discretion from a bank of problems covering all the topics in the course so far. Typical assignments might have the students write a maze solver with graph algorithms, implement a Bloom filter, solve the Towers of Hanoi puzzle, or find all words that are anagrams of each other in a dictionary. Assignments were submitted as GitHub pull requests on each student’s private repository, so I could leave feedback with GitHub’s review features.

Our goal was to force students to immediately apply the principles they learned in class, so assignments exercised everything we taught: some were impossible to efficiently solve without clever data structures,² some required dynamic programming, some required graph algorithms, and so on. And everything had to be submitted with unit tests and with Git, so students had plenty of practice by the end.

Results

I think we did reasonably well. Many students entered the course thinking “Why do I need to learn how to write efficient code? Why do I need data structures and algorithms? I’ll just use an R package that calls C++ code to do everything fast.”³ Most held the common R user’s belief that the only way to speed up code is by vectorizing loops and calling C++ or Fortran. By the end, they could choose the right data structures and algorithms to make their own ideas fast even without compiled code, and hopefully they saw the value of unit tests and automation. (I certainly left comments on enough untested and unnoticed bugs throughout the course, and so they gradually expanded their coverage.)

It wasn’t all perfect. Scheduling hurt some. With most students taking advanced statistical theory and convex optimization, while also finishing their Advanced Data Analysis projects, most treated stat computing as their last priority and started submitting incomplete assignments or choosing the easiest problems from the bank. We encouraged this by being flexible about deadlines and completion, and also by allowing students to choose any assignment at all from the problem bank.

The grading incentive structure was a problem. I graded assignments and left comments on each – sometimes quite a few comments – but students had no incentive to read the comments and revise their code. My classmate Jerzy Wieczorek experimented with a revision system in his Statistical Graphics course, where students were required to submit a certain number of high-quality assignments to get an A, and they could revise and resubmit previous assignments until they were good enough to count. We can probably adapt this. We could also do a better job providing test cases to students so they catch common errors in their code before submission; there were several assignments where many students misunderstood the problem statement and submitted incorrect solutions.

To give the students experience with more complex programs, at the end of the course we introduced challenge problems: one had them implement R-trees and use them for spatial search in a real dataset, and the other had them use locality-sensitive hashing and spectral analysis to implement a music-matching system like Shazam. Students chose one of the two, wrote their code over a week or two, and then peer-reviewed another student’s code on GitHub, which turned out to be a very good process: there were some very thorough reviews, and the students reported they learned a lot from it. The next time we run the course, we’ll probably introduce challenge problems much earlier and do more routine peer review throughout the semester.

But, overall, I’m happy. I’ve taught before, but only introductory courses to undergraduates, and it’s a very different experience lecturing to PhD students just a year or two behind you.⁴ I think the course is valuable and, at the least, made students realize there’s more to computing than just cobbling together R scripts – that with careful design, good algorithms, and thorough testing, you can save yourself time and effort while solving your research questions. Let’s hope it sticks.

We actually drew several non-statistics students as well, despite not widely advertising the course. Some of them dropped, but several dedicated students made it through to the end, and did a very good job.↩
Making them difficult to solve in R, which suffers from a paucity of data structures: the built-in list is O(n), so it can’t suffice as a hash map, and the hash package accepts limited data types as keys. Fortunately the dequer package appeared just in time for the course.↩
This is probably why R’s data structures are so poor: everyone abandons R at the first glimpse of slowness without thinking about choosing faster data structures. I hope the Julia team can convince the world that interactive languages can also be efficient languages.↩
It’s easier to make weird jokes without getting blank stares from the audience.↩