Statistical programming languages

I do not find R to be a very satisfying statistical programming language. It has a lot of features for easy interactive use, but misses features where they count. There are few good data structures beyond data frames, weird scoping rules that make a high performance interpreter nearly impossible, and limited ability to manage code complexity and ensure high quality.

I’d describe R as “a programming language designed by statisticians.” But let’s turn the question around: what would a statistical programming language look like? One where statistical concepts were a core part of the language? Where, say, distributions are first-class objects, rather than just a collection of functions to sample and get quantiles? Where our code expresses the assumptions and claims we make about procedures instead of just mundane computational details?

Prior discussion

Amelia McNamara did her PhD on tools for statistical programming.
- Her dissertation is the master source.
- “On the State of Computing in Statistics Education: Tools for Learning and for Doing”, a review of current tools for computing in teaching statistics. https://arxiv.org/abs/1610.00984
- McNamara, A. (2018). Key attributes of a modern statistical computing tool. The American Statistician, 73(4), 375–384. doi:10.1080/00031305.2018.1482784
  Basically: a statistical language needs to handle the entire workflow, from playing with interactive graphics as EDA, reformatting data, turning data into a narrative, and publishing a reproducible document with the embedded analysis. It should be easy to play with randomization and bootstrapping, be self-documenting, and be easily extended with new models and new features. And it should be easy for beginners.
John Myles White has written about type safety in statistical computing, with the idea that the types in the program can justify the statistics: if variable types include information like “this is an IID random sample of asymptotically normal quantities”, then inference functions can be justified. If this could be codified in the function declarations, we could make assumptions explicit and prevent mistakes.

Next, some prior work.

Statistical DSLs

Tea (GitHub), a Python-based DSL in which you specify your data, assumptions, and hypotheses, and a constraint solver automatically selects appropriate hypothesis tests and outputs the results. Not a very elegant DSL (dictionaries of strings), but a very interesting idea, because it makes all the assumptions and hypotheses explicit in the code. Unfortunately it only does NHST, not estimation, though the authors claim it is extensible and can easily do more things.

infer, an R package which provides a DSL for specifying hypotheses, sampling from null distributions, and calculating test statistics:

mtcars %>%
  specify(am ~ vs, success = "1") %>%
  hypothesize(null = "independence") %>%
  generate(reps = 100, type = "permute") %>%
  calculate(stat = "diff in props", order = c("1", "0"))

Ross, K., & Sun, D. L. (2019). Symbulate: Simulation in the language of probability. Journal of Statistics Education, 27(1), 12–28. doi:10.1080/10691898.2019.1600387
Symbulate, a Python-based DSL for probability that has clear concepts of probability spaces, random variables, conditioning, and simulation, with the goal of making it easy to do in-class simulation activities in a syntax similar to probability theory and where probability concepts are first-class objects in the language, so students do not need to switch to programmatic thinking instead. Also can detect common misconceptions by making them errors.
Declarative Statistics, a scheme where models are built on constraint programming: specify the variables, specify the constraints a good model meets, and let a constraint solver package find the parameter values that are consistent with the data. Conceptually straightforward, even if some simple models (like least-squares regression) are difficult to express. May be difficult to teach from, because the constraints require parametric distributions but the distributions aren’t motivated from the method.
Whittemore (GitHub), a Clojure-based DSL for performing causal queries.
DrBayes (thesis), a Racket-based language implementing measure theory. Arbitrary Bayesian models can be defined even when there are no densities and Bayes’s law is impossible to use (e.g. conditioning on a non-axis-aligned set). Has various clever tricks to make sampling and inference fast in many cases.
Stan, a compiled DSL for Bayesian inference and penalized maximum likelihood problems. Meant to be used to define a model and then called from R, Python, or Julia code which does the data cleaning and result-munging. A descendant of tools like JAGS and BUGS.
HLearn understands the algebraic structure of common ML algorithms, exploiting it for fast generic implementations of things like cross-validation. See a toy example of usage here.

The simulator package for R (CRAN), which creates a simulation DSL which can generate from models with different parameters, use different estimators, and report results:

new_simulation(name = "bet-on-sparsity", label = "Bet on sparsity") %>%
    generate_model(make_sparse_linear_model, n = 200, p = 500,
                   k = as.list(seq(5, 80, by = 5)), vary_along = "k") %>%
    simulate_from_model(nsim = 5, index = 1) %>%
    run_method(list(lasso, ridge)) %>%
    evaluate(list(mse, bestmse, df))

The mosaic package for R tries to create a consistent formula-based DSL for inference; see this R Journal article for examples.
Lisp-Stat, built on a subset of Common Lisp and adding vector operations, statistical functions, and linear algebra. Supported dynamic graphics that many still wistfully remember. J. Stat. Software has an entire special issue on Lisp-Stat, its useful features, and its slow death.

Common Lisp Statistics seems dead-ish, but a presentation by the authors proposes a fascinating idea: can’t our statistical language encode the assumptions and claims we make about statistical procedures, and even check them through simulation?

(define-theorem my-proposed-theorem
  (:theorem-type '(distribution-properties
                   frequentist likelihood))
  (:assumes '(assumption-1 assumption-2))
  (:likelihood-form
   (defun likelihood (data theta gamma)
       (exponential-family theta gamma)))
  (:compute-by
   '(progn
      (compute-start-values thetahat gammahat)
      (until (convergence)
          (setf convergence
                (or (step-1 thetahat)
                    (step-2 gammahat))))))
  (:claim (equal-distr '(thetahat gammahat) 'normal)))

Data querying DSLs

It’d be nice to build tabular data straight in, like R’s data frames but more so: with built-in efficient querying, joining, and transformations. It is useful to think about how data frames differ from SQL-style tables, though:

Wu, Y. (2019). Is a dataframe just a table? In PLATEAU workshop. https://plateau-workshop.org/assets/papers-2019/10.pdf

SQL-style tables do not have row ordering, while data frames (in R, Python, and similar implementations) do. This implies very different ways of accessing and using data. The paper considers the consequences of these differences, and encourages future implementations to think more carefully about the data operations they are supporting.

Previous DSLs for data munging:

Postmodern, a PostgreSQL package for Common Lisp, which looks a bit like this:

(:select 'relname :from 'pg-catalog.pg-class
         :inner-join 'pg-catalog.pg-namespace :on (:= 'relnamespace 'pg-namespace.oid)
         :where (:and (:= 'relkind "r")
                      (:not-in 'nspname (:set "pg_catalog" "pg_toast"))

SxQL, another Common Lisp SQL generator:

(select (:id :name :sex)
        (from (:as :person :p))
        (where (:and (:>= :age 18)
                     (:< :age 65)))
        (order-by (:desc age)))

LINQ from .NET.

dplyr for R, which chains functions for manipulating data, based on the basic operations: filtering, selecting, grouping, summarizing, and arranging:

flights %>%
  group_by(year, month, day) %>%
  select(arr_delay, dep_delay) %>%
  summarise(
    arr = mean(arr_delay, na.rm = TRUE),
    dep = mean(dep_delay, na.rm = TRUE)
  ) %>%
  filter(arr > 30 | dep > 30)

data.table has its own peculiar R syntax for these operations, and supports indexed tables.
StructuredQueries.jl builds a graph representing the query (in a macro), then a collect call executes the query and collects its results into a table. Supports building queries with placeholders, then passing in placeholders later. Would be an interesting exercise with S-expressions.
The Pyret programming language has tabular data as a core type, with built-in operations for querying and filtering (but not joining, as far as I can see?).

Incidentally, Feather is meant to be an interoperable standard for data tables.

It would be good to provide uniform access to all kinds of tables: CSV files, database tables, built-in data frames, etc. I think StructuredQueries.jl has the right idea: an abstract query interface which can be applied to data from many sources makes the most sense, and lets us avoid inefficient idioms like df[df$foo > mean(df$foo)], which involves all sorts of unnecessary allocation and looping.

Report generation

A key motivator here is a simple problem: code is disconnected from presentation of results. We often see people copying and pasting numbers from results tables into their papers, often screwing up results as they go. Or people work in the R console, pasting bits into a script or R Markdown file as they go, but lose pieces on the way, or do some of their data manipulation with ad hoc commands that are never recorded. Data might be transformed or outliers dropped, the results saved to a file, and those steps were never recorded.

knitr embeds R within Markdown and LaTeX, with caching and other smart features to make it easier to incrementally develop reports. It’s still too easy to spend most of your time in the console, only jumping to R Markdown for “final” stuff.
Scribble is a document format that is executable Racket code, with a custom reader and library for headings, figures, etc. Easily embeds Racket (or any language implemented in Racket) and output, and the document format can be extended with new elements with a few lines of code.
Org mode provides literate programming features crossed with notebooks, where an Org file can contain embedded code chunks with their output, and can weave the chunks together into standalone code files.
Jupyter takes Mathematica’s notebook concept to the next level, with embedded code in many different languages interspersed with graphics and Markdown text. Tends to be used in practice as a fancy REPL – it’s still easy to get into a state where the code in the notebook doesn’t actually match the current interpreter state, since you’ve edited and removed code.

But let’s consider another aspect of making report generation practical. We should be able to write our analysis into a script or report and tinker with it, without waiting for the entire analysis to re-run every time we change a figure caption. How do we avoid that?

knitr has a chunk cache to store variables created by each chunk and avoid rerunning chunks unnecessarily.
trackr captures R “results” – plots, model objects, and so on – and records metadata about them, such as the data used to generate them, the type of plot, the data, transformations involved, and so on. The database of results then can be easily searched to find a particular result and retrieve it. There’s built-in knitr integration to store all output from code chunks in knitr documents.
drake lets R users declare dependencies in their code and generates targets from their dependencies, like Make implemented inside R.
miniAdapton is a Scheme adaptation of Adapton, a system for incremental computation: by keeping a directed graph of computation inputs and outputs, only the necessary parts can be recomputed when the user requests a certain result.

Modeling and hypotheses

Suppose we build a system like Common Lisp Statistics above, where we can annotate estimators with their assumptions and their distributional properties. Then we could specify a null and automatically derive a test: under the null, the parameter has a certain value, and we can run the estimator, and pass its result to some function that uses the estimate and the distributional assumptions to return our p value.

Could we compose estimators this way and get correct p values automatically? For example, consider a simple test of means where I first test for normality and then decide whether to do a t test or a nonparametric test. The overall test has a different alpha level because of the choice being made – can I capture that and get a correct p value automatically?

We can treat hypotheses like generators in a QuickCheck framework: combinators of properties (e.g. (Vectorof Normal)) which support generation of arbitrary data from the hypothesis. This would let us draw from models under hypotheses, either from parametric simulation or bootstrapping (since presumably you could write a hypothesis such as “distributed like this data”).

Let’s consider the pieces we need:

Models

A model has some parameters, a generative process, data slots, and perhaps a log-likelihood or other objective function. There might also be some parameters for the type of model, like the assumptions it places on the error distribution or the number of variables in an ANOVA. There are also inherent assumptions.

Consider a model for a simple two-group comparison. We have a parameter for the distribution of errors, then parameters for the means.

;; groups is a vector of a binary factor indicating which group each data point
;; is in
;; following Typed Racket style of type declarations
(define-model (groupwise-means [groups : (Vectorof Factor)]
                               #:error-dist [error-dist Normal]) : (Vectorof Real)
  #:parameters ([group-means : (Map Factor Real)]
                [variance : Real])
  #:generate-one (lambda (group) (+ (group-means group)
                                    (sample error-dist)))
  #:log-likelihood (+ (sum (error-dist (.- xs mean-x)))
                      (sum (error-dist (.- ys (+ mean-x mean-diff))))))

We can instantiate a model with, say, (two-means Normal-Error), where Normal-Error is a normal distribution centered at 0 whose error is fit automatically.

Questions: Can we encode the independence assumption on xs and ys? How do we write a generating function for xs and ys for models where they might depend on each other (e.g. regression)? Can the log-likelihood be determined automatically from the generating process?

Estimators

An estimator is for a specific parameter in a model. The estimator, under certain assumptions, has some useful properties, such as distributional properties. There can be multiple ways to estimate a given parameter, so we expose multiple estimators.

(define-estimator (mean-difference xs ys)
  #:for two-groups-mean-diff
  ;; Require the inputs to have the same assumed standard
  ;; deviation.
  #:assuming ([xs (Vectorof (Normal mean1 stddev))]
              [ys (Vectorof (Normal mean2 stddev))])
  ;; my stddev calculation is wrong, but whatever
  #:distributed-as (Normal (- mean2 mean1)
                           (/ (* stddev 2) (length xs)))
  (- (mean ys) (mean xs)))

Questions: Are there estimators for which there are multiple different distributional properties, under different sets of assumptions? How do we fit a model to data, given some estimators?

Some thought needs to go into the API so it’s easy to do estimators for multiple parameters (e.g. if we estimate both the slope and intercept of a linear regression simultaneously).

There are also cases when we can use the log-likelihood to derive an estimator automatically.

Tests

How do we compose these features? Suppose we have, in some imaginary syntax,

;; Returns distribution object representing the distribution
;; under this hypothesis. Since mean-difference gives a
;; parametric form for its sampling distribution, in terms
;; of the model parameters, and the assumptions are met,
;; the sampling is done parametrically.
(assuming ([xs (Vectorof (Normal 1 1))]
           [ys (Vectorof (Normal 0 1))])
          (mean-difference xs ys))

;; On the other hand, if the assumptions aren't met, we recognize
;; this and do automatic parametric bootstrapping instead.
;; Notice the differing standard deviations.
(assuming ([xs (Vectorof (Normal 1 2))]
           [ys (Vectorof (Normal 0 1))])
          (mean-difference xs ys))

Notice that define-estimator gives assumptions for its distribution, which would require some kind of structural pattern matching on the assumptions to be sure they all hold. (There’d have to be some kind of trust-me-really override too.)

Iterative estimators

It should be pretty easy to automate iterative fitting processes. Instead of making the estimator take arguments for tolerances and iterations and such, just supply code to run a single update step:

(define-estimator (mean-difference xs ys)
  #:for two-normal-means-mean-diff
  #:assuming ...

  #:iterative #t

  ;; calculate a reasonable starting value from the data
  #:start-values (some-function xs ys)
  ;; given a parameter value, update using this function
  #:update-step (some-other-function xs ys))

Then the estimator would automatically run an iterative method until convergence, and would have keyword arguments for convergence parameters:

;; use default options
(mean-difference xs ys)

;; set tolerance and maximum iteration count
(mean-difference xs ys #:tolerance 1e-10 #:max-iterations 200)

Then there could be methods to get convergence diagnostics, trace the steps taken to convergence, and so on.

Automated diagnostics

If we provide data and an estimator, we can fit a model. Since we’ve described everything in such an abstract, high-level way, presumably we can do interesting things like sensitivity analyses: automatically drop data points and view the differences in estimates. Or automatic power analysis: specify a model, estimator, and assumptions on the data, and immediately get a power estimate based on a bunch of simulations made under those assumptions.

First-class objects

The models and estimators above should be first-class: besides declaring them, we should be able to perform operations upon them. The syntax above doesn’t really provide much leeway for this, so maybe it needs a rethink. Consider the operations we’d like to perform:

Adding terms to models. We might add a penalty term to regression to see what happens, or fuse coefficients, or whatever. How do we do this? Adding a penalty term may mean adding to the log-likelihood, adding another parameter, changing the generative model, and so on.
Compose estimators. An estimator can be made from several others, e.g. regression coefficients building on standard error estimators.
Combine several sets of assumptions, or remove assumptions from a set to see what happens.

Why bother?

For data analysts, the language proposed above provides a few things:

Explicit statements of models, hypotheses, and assumptions. Rather than writing code that does a bunch of math and prints out numbers, we see directly what models are used under what assumptions as part of the code, not just the descriptive text surrounding it (or often omitted).
Each estimator and model self-documents its assumptions, distributional properties, and so on.
For any hypothesis, model, or estimator we want, much of the supporting code is already built in – optimizers, sampling from models, power analysis, and so on.

But a trickier question is: what’s in it for statisticians developing new methods, rather than using existing ones?