The purpose of Advanced Statistical Computing is to give students real-world experience working on complicated software projects using techniques and technologies related to statistics, statistical modeling, and data science. Each student will either choose a project from the examples below or propose their own topic.
Warning: The information on this page is not yet final. Project details may change before the semester starts.
All projects must:
- Involve a substantial software development component. Projects can’t just be about setting up or combining existing software packages.
- Involve programming methods or software tools that are relevant to statistics and data science, as judged by the instructor. For example, projects that implement algorithms to analyze data or that use tools commonly used to analyze data (like SQL or Hadoop) are acceptable, while a project to use a Raspberry Pi to display the latest bus schedule on a TV in the graduate student offices is insufficiently statistical.
- Involve interesting algorithms, data structures, or computing tools (like databases or distributed systems). Rewriting some R code in Rcpp to make it faster isn’t enough; writing Rcpp that uses a better data structure or calculates results in parallel is.
- Have clearly identified goals and milestones. Your project proposal should be specific enough that, at the end of the mini, we can tell whether you met your goals.
I am happy to work with you to develop project ideas. If you have an idea and you aren’t sure if it meets the requirements, just email or talk to me and we can work it out.
Yes, projects can be related to an existing project you’re already working on. For example, if your thesis research involves a new algorithm for analyzing social networks, your project could involve parallelizing that algorithm or implementing it on large network data with Apache Spark.
These are examples of projects that would be suitable for the course. You can choose one as your own project, or suggest a topic of your own.
- Implement a network analysis method on a very large dataset using Apache Spark’s tools for graph processing. For example, do network analysis on all links between Wikipedia pages using a database dump, or do large-scale network analysis on citations extracted from arXiv data dumps.
- Implement a distributed statistics or machine learning algorithm using Spark’s SparkR or MLlib packages; for example, take a machine learning method not already implemented in MLlib and make a new implementation.
- Write a parallel implementation of an existing statistical method using OpenMP or MPI. Perform extensive performance tests of the method on different numbers of CPU cores, demonstrating the benefit of parallelization.
- Implement a statistics or machine learning algorithm in parallel on a graphics card, using CUDA or OpenCL. For example, pick an interesting algorithm that involves a lot of small matrix operations (like deep learning) and implement it from scratch on the GPU.
- Build off of a project from 36-650/750; for example, parallelize your anomaly-speed implementation and automate it to run on many videos at once, or build a sophisticated non-destructive classification tree that can find optimal pruning without repeatedly rebuilding the tree and can build forests in parallel.
- Build statistical “tooling”, meaning helpful tools that make it easier for statisticians to do their work. shinystan, for example, makes it easy to look at MCMC diagnostics; maybe a method or model you use could benefit from fancy interactive diagnostics and visualizations.
- Implement a “statistical QuickCheck”. QuickCheck lets programmers write unit tests specifying properties of functions that must be true for all inputs, and ensures these properties hold on random inputs. A statistical QuickCheck might let the user specify “the estimator implemented by this R function should be asymptotically normal when the data comes from this distribution”, or “this estimate should converge at this rate”, or “this estimate is invariant to linear transformations of the input”, or other statistical properties of an estimator, then automatically conduct appropriate simulation studies to verify that the properties are true.
Proposing a project
The project proposal is a short written document with the following sections:
- A summary of the project.
- A list of specific goals (analyses to conduct, methods to implement).
- An expected schedule of which goals will be achieved at which points in the mini.
- A brief description of your prior experience with the technologies and methods proposed. Have you used the technology before? Is this a statistical method you’ve already used or have developed for your thesis?
- If applicable, a description of the data you intend to use.
- Links to references, examples, packages, tutorials, and documentation you expect to use. (For example, if you want to base your OpenMP code on code you found in another R package, give a link, and link to other tutorials and references you find.) The point of this step is to encourage you to find examples and references before you start, so you know where to look when you get stuck. Spend some time Googling for relevant material; for R users, the CRAN Task Views are good ways to find relevant packages.
Project proposals are subject to the approval of the instructor; I invite you to send a draft proposal before the deadline, even before the beginning of the semester, to get feedback, so you can start the project right away.
Proposals should be submitted as text, HTML, or PDF files, through Canvas.
Each Tuesday before class, you will be asked to turn in a brief project status update through Canvas. This need only be a paragraph or a few bullet points. It should cover:
- Which of your goals (from your proposal) you worked on this week, and what you accomplished.
- Any unexpected roadblocks or problems you encountered, what you plan to do about them, and what kind of assistance or information would help solve the problems.
- What you plan to do in the next week.
Also, you must track your project with a version control system like Git and share your repository with me on GitHub or Bitbucket, whichever you prefer. Make regular commits, and push before each weekly status update so I can see your latest code.
I recommend using your commit messages to narrate your progress, with the messages explaining the purpose of each commit, what problems you’re trying to solve, and so on, so your weekly status update can essentially be a summary of the commit messages.
I will use your status updates and code to help guide your projects.
Through your status updates, and through access to your code, I’ll grade your projects on several criteria, similar to the grading in 36-650/750:
- Is your code readable, well-organized, and easy to understand?
- Is your code well-designed, modular, and easy to maintain and extend?
- Representations, Algorithms, and Data Structures
- Did you use appropriate algorithms and data structures to meet the goals you set for yourself?
- Correctness, Elegance, and Performance
- Is your code correct, elegant, and fast?
To judge whether your project succeeded, I’ll also consider the goals you set for yourself in your proposal. It’s okay if you don’t meet all these goals, as long as your status updates are clear about the problems you faced while trying to do so. In other words, “I didn’t meet any of my goals” is bad, but “I tried X, Y, and Z, but still couldn’t meet this goal for this reason” is good.
As you work on the project, you will also prepare a written tutorial on a method or technology you used. For example, if your project parallelizes a statistical algorithm, you can write a tutorial on using the parallelization tools you used; if your project uses Spark to analyze network data, you can write a tutorial on Spark’s network features. Your tutorial should be about the methods or technologies you used, not about the specific thing you made – if you implemented a particular parallel algorithm, write a tutorial on how to parallelize algorithms, not a tutorial on how to use your specific implementation of a specific algorithm.
Think of the tutorial as a blog post or wiki page for your fellow classmates to use if they are interested in the same topic. What should they know to get started? Describe the basic principles of the topic, give example code and setup instructions, and link to other tutorials or references you found useful during your project.
Ideally, these tutorials can be posted on the course website, for future graduate students to benefit from.
Tutorials should be submitted in a convenient editable format, ideally Markdown or RMarkdown files. No Word files; LaTeX only if there is no other option.
The final presentation
In the last week of class, you will give a brief 15-minute presentation summarizing your project, what you accomplished, and what tools you used. This presentation can summarize of the written tutorial and also discuss any interesting obstacles you hit, results you obtained, and so on. (If you parallelized an algorithm, tell us how fast it is; if you made a new analysis tool, show off its results.)