Statistical programming languages

Alex Reinhart – Updated August 8, 2017 notebooks · refsmmat.com

Contrast with Probabilistic programming.

I do not find R to be a very satisfying statistical programming language. It has a lot of features for easy interactive use, but misses features where they count. There are few good data structures beyond data frames, weird scoping rules that make a high performance interpreter nearly impossible, and limited ability to manage code complexity and ensure high quality.

I’d describe R as “a programming language designed by statisticians.” But let’s turn the question around: what would a statistical programming language look like? One where statistical concepts were a core part of the language? Where, say, distributions are first-class objects, rather than just a collection of functions to sample and get quantiles? Where our code expresses the assumptions and claims we make about procedures instead of just mundane computational details?

See also Teaching programming.

Prior discussion

Next, some prior work.

Statistical DSLs

Data querying DSLs

It’d be nice to build tabular data straight in, like R’s data frames but more so: with built-in efficient querying, joining, and transformations. Previous DSLs for data munging:

Incidentally, Feather is meant to be an interoperable standard for data tables.

It would be good to provide uniform access to all kinds of tables: CSV files, database tables, built-in data frames, etc. I think StructuredQueries.jl has the right idea: an abstract query interface which can be applied to data from many sources makes the most sense, and lets us avoid inefficient idioms like df[df$foo > mean(df$foo)], which involves all sorts of unnecessary allocation and looping.

Report generation

A key motivator here is a simple problem: code is disconnected from presentation of results. We often see people copying and pasting numbers from results tables into their papers, often screwing up results as they go. Or people work in the R console, pasting bits into a script or R Markdown file as they go, but lose pieces on the way, or do some of their data manipulation with ad hoc commands that are never recorded. Data might be transformed or outliers dropped, the results saved to a file, and those steps were never recorded.

But let’s consider another aspect of making report generation practical. We should be able to write our analysis into a script or report and tinker with it, without waiting for the entire analysis to re-run every time we change a figure caption. How do we avoid that?

Modeling and hypotheses

Suppose we build a system like Common Lisp Statistics above, where we can annotate estimators with their assumptions and their distributional properties. Then we could specify a null and automatically derive a test: under the null, the parameter has a certain value, and we can run the estimator, and pass its result to some function that uses the estimate and the distributional assumptions to return our p value.

Could we compose estimators this way and get correct p values automatically? For example, consider a simple test of means where I first test for normality and then decide whether to do a t test or a nonparametric test. The overall test has a different alpha level because of the choice being made – can I capture that and get a correct p value automatically?

We can treat hypotheses like generators in a QuickCheck framework: combinators of properties (e.g. (Vectorof Normal)) which support generation of arbitrary data from the hypothesis. This would let us draw from models under hypotheses, either from parametric simulation or bootstrapping (since presumably you could write a hypothesis such as “distributed like this data”).

Let’s consider the pieces we need:

Models

A model has some parameters, a generative process, data slots, and perhaps a log-likelihood or other objective function. There might also be some parameters for the type of model, like the assumptions it places on the error distribution or the number of variables in an ANOVA. There are also inherent assumptions.

Consider a model for a simple two-group comparison. We have a parameter for the distribution of errors, then parameters for the means.

;; groups is a vector of a binary factor indicating which group each data point
;; is in
;; following Typed Racket style of type declarations
(define-model (groupwise-means [groups : (Vectorof Factor)]
                               #:error-dist [error-dist Normal]) : (Vectorof Real)
  #:parameters ([group-means : (Map Factor Real)]
                [variance : Real])
  #:generate-one (lambda (group) (+ (group-means group)
                                    (sample error-dist)))
  #:log-likelihood (+ (sum (error-dist (.- xs mean-x)))
                      (sum (error-dist (.- ys (+ mean-x mean-diff))))))

We can instantiate a model with, say, (two-means Normal-Error), where Normal-Error is a normal distribution centered at 0 whose error is fit automatically.

Questions: Can we encode the independence assumption on xs and ys? How do we write a generating function for xs and ys for models where they might depend on each other (e.g. regression)? Can the log-likelihood be determined automatically from the generating process?

Estimators

An estimator is for a specific parameter in a model. The estimator, under certain assumptions, has some useful properties, such as distributional properties. There can be multiple ways to estimate a given parameter, so we expose multiple estimators.

(define-estimator (mean-difference xs ys)
  #:for two-groups-mean-diff
  ;; Require the inputs to have the same assumed standard
  ;; deviation.
  #:assuming ([xs (Vectorof (Normal mean1 stddev))]
              [ys (Vectorof (Normal mean2 stddev))])
  ;; my stddev calculation is wrong, but whatever
  #:distributed-as (Normal (- mean2 mean1)
                           (/ (* stddev 2) (length xs)))
  (- (mean ys) (mean xs)))

Questions: Are there estimators for which there are multiple different distributional properties, under different sets of assumptions? How do we fit a model to data, given some estimators?

Some thought needs to go into the API so it’s easy to do estimators for multiple parameters (e.g. if we estimate both the slope and intercept of a linear regression simultaneously).

There are also cases when we can use the log-likelihood to derive an estimator automatically.

Tests

How do we compose these features? Suppose we have, in some imaginary syntax,

;; Returns distribution object representing the distribution
;; under this hypothesis. Since mean-difference gives a
;; parametric form for its sampling distribution, in terms
;; of the model parameters, and the assumptions are met,
;; the sampling is done parametrically.
(assuming ([xs (Vectorof (Normal 1 1))]
           [ys (Vectorof (Normal 0 1))])
          (mean-difference xs ys))

;; On the other hand, if the assumptions aren't met, we recognize
;; this and do automatic parametric bootstrapping instead.
;; Notice the differing standard deviations.
(assuming ([xs (Vectorof (Normal 1 2))]
           [ys (Vectorof (Normal 0 1))])
          (mean-difference xs ys))

Notice that define-estimator gives assumptions for its distribution, which would require some kind of structural pattern matching on the assumptions to be sure they all hold. (There’d have to be some kind of trust-me-really override too.)

Iterative estimators

It should be pretty easy to automate iterative fitting processes. Instead of making the estimator take arguments for tolerances and iterations and such, just supply code to run a single update step:

(define-estimator (mean-difference xs ys)
  #:for two-normal-means-mean-diff
  #:assuming ...

  #:iterative #t

  ;; calculate a reasonable starting value from the data
  #:start-values (some-function xs ys)
  ;; given a parameter value, update using this function
  #:update-step (some-other-function xs ys))

Then the estimator would automatically run an iterative method until convergence, and would have keyword arguments for convergence parameters:

;; use default options
(mean-difference xs ys)

;; set tolerance and maximum iteration count
(mean-difference xs ys #:tolerance 1e-10 #:max-iterations 200)

Then there could be methods to get convergence diagnostics, trace the steps taken to convergence, and so on.

Automated diagnostics

If we provide data and an estimator, we can fit a model. Since we’ve described everything in such an abstract, high-level way, presumably we can do interesting things like sensitivity analyses: automatically drop data points and view the differences in estimates. Or automatic power analysis: specify a model, estimator, and assumptions on the data, and immediately get a power estimate based on a bunch of simulations made under those assumptions.

First-class objects

The models and estimators above should be first-class: besides declaring them, we should be able to perform operations upon them. The syntax above doesn’t really provide much leeway for this, so maybe it needs a rethink. Consider the operations we’d like to perform:

Why bother?

For data analysts, the language proposed above provides a few things:

But a trickier question is: what’s in it for statisticians developing new methods, rather than using existing ones?