Contrast with Probabilistic programming.
I do not find R to be a very satisfying statistical programming language. It has a lot of features for easy interactive use, but misses features where they count. There are few good data structures beyond data frames, weird scoping rules that make a high performance interpreter nearly impossible, and limited ability to manage code complexity and ensure high quality.
I’d describe R as “a programming language designed by statisticians.” But let’s turn the question around: what would a statistical programming language look like? One where statistical concepts were a core part of the language? Where, say, distributions are first-class objects, rather than just a collection of functions to sample and get quantiles? Where our code expresses the assumptions and claims we make about procedures instead of just mundane computational details?
See also Teaching programming.
“Key attributes of a modern statistical computing tool”, on what a statistical language should include. https://arxiv.org/abs/1610.00985Basically: a statistical language needs to handle the entire workflow, from playing with interactive graphics as EDA, reformatting data, turning data into a narrative, and publishing a reproducible document with the embedded analysis. It should be easy to play with randomization and bootstrapping, be self-documenting, and be easily extended with new models and new features. And it should be easy for beginners.
Next, some prior work.
infer, an R package which provides a DSL for specifying hypotheses, sampling from null distributions, and calculating test statistics:
Ross, K., & Sun, D. L. (2019). Symbulate: Simulation in the language of probability. Journal of Statistics Education, 27(1), 12–28. doi:10.1080/10691898.2019.1600387Symbulate, a Python-based DSL for probability that has clear concepts of probability spaces, random variables, conditioning, and simulation, with the goal of making it easy to do in-class simulation activities in a syntax similar to probability theory and where probability concepts are first-class objects in the language, so students do not need to switch to programmatic thinking instead. Also can detect common misconceptions by making them errors.
Common Lisp Statistics seems dead-ish, but a presentation by the authors proposes a fascinating idea: can’t our statistical language encode the assumptions and claims we make about statistical procedures, and even check them through simulation?
(define-theorem my-proposed-theorem (:theorem-type '(distribution-properties frequentist likelihood)) (:assumes '(assumption-1 assumption-2)) (:likelihood-form (defun likelihood (data theta gamma) (exponential-family theta gamma))) (:compute-by '(progn (compute-start-values thetahat gammahat) (until (convergence) (setf convergence (or (step-1 thetahat) (step-2 gammahat)))))) (:claim (equal-distr '(thetahat gammahat) 'normal)))
It’d be nice to build tabular data straight in, like R’s data frames but more so: with built-in efficient querying, joining, and transformations. Previous DSLs for data munging:
Postmodern, a PostgreSQL package for Common Lisp, which looks a bit like this:
SxQL, another Common Lisp SQL generator:
LINQ from .NET.
dplyr for R, which chains functions for manipulating data, based on the basic operations: filtering, selecting, grouping, summarizing, and arranging:
data.table has its own peculiar R syntax for these operations, and supports indexed tables.
StructuredQueries.jl builds a graph representing the query (in a macro), then a
collect call executes the query and collects its results into a table. Supports building queries with placeholders, then passing in placeholders later. Would be an interesting exercise with S-expressions.
Incidentally, Feather is meant to be an interoperable standard for data tables.
It would be good to provide uniform access to all kinds of tables: CSV files, database tables, built-in data frames, etc. I think StructuredQueries.jl has the right idea: an abstract query interface which can be applied to data from many sources makes the most sense, and lets us avoid inefficient idioms like
df[df$foo > mean(df$foo)], which involves all sorts of unnecessary allocation and looping.
A key motivator here is a simple problem: code is disconnected from presentation of results. We often see people copying and pasting numbers from results tables into their papers, often screwing up results as they go. Or people work in the R console, pasting bits into a script or R Markdown file as they go, but lose pieces on the way, or do some of their data manipulation with ad hoc commands that are never recorded. Data might be transformed or outliers dropped, the results saved to a file, and those steps were never recorded.
But let’s consider another aspect of making report generation practical. We should be able to write our analysis into a script or report and tinker with it, without waiting for the entire analysis to re-run every time we change a figure caption. How do we avoid that?
Suppose we build a system like Common Lisp Statistics above, where we can annotate estimators with their assumptions and their distributional properties. Then we could specify a null and automatically derive a test: under the null, the parameter has a certain value, and we can run the estimator, and pass its result to some function that uses the estimate and the distributional assumptions to return our p value.
Could we compose estimators this way and get correct p values automatically? For example, consider a simple test of means where I first test for normality and then decide whether to do a t test or a nonparametric test. The overall test has a different alpha level because of the choice being made – can I capture that and get a correct p value automatically?
We can treat hypotheses like generators in a QuickCheck framework: combinators of properties (e.g.
(Vectorof Normal)) which support generation of arbitrary data from the hypothesis. This would let us draw from models under hypotheses, either from parametric simulation or bootstrapping (since presumably you could write a hypothesis such as “distributed like this data”).
Let’s consider the pieces we need:
A model has some parameters, a generative process, data slots, and perhaps a log-likelihood or other objective function. There might also be some parameters for the type of model, like the assumptions it places on the error distribution or the number of variables in an ANOVA. There are also inherent assumptions.
Consider a model for a simple two-group comparison. We have a parameter for the distribution of errors, then parameters for the means.
;; groups is a vector of a binary factor indicating which group each data point ;; is in ;; following Typed Racket style of type declarations (define-model (groupwise-means [groups : (Vectorof Factor)] #:error-dist [error-dist Normal]) : (Vectorof Real) #:parameters ([group-means : (Map Factor Real)] [variance : Real]) #:generate-one (lambda (group) (+ (group-means group) (sample error-dist))) #:log-likelihood (+ (sum (error-dist (.- xs mean-x))) (sum (error-dist (.- ys (+ mean-x mean-diff))))))
We can instantiate a model with, say,
(two-means Normal-Error), where
Normal-Error is a normal distribution centered at 0 whose error is fit automatically.
Questions: Can we encode the independence assumption on
ys? How do we write a generating function for
ys for models where they might depend on each other (e.g. regression)? Can the log-likelihood be determined automatically from the generating process?
An estimator is for a specific parameter in a model. The estimator, under certain assumptions, has some useful properties, such as distributional properties. There can be multiple ways to estimate a given parameter, so we expose multiple estimators.
(define-estimator (mean-difference xs ys) #:for two-groups-mean-diff ;; Require the inputs to have the same assumed standard ;; deviation. #:assuming ([xs (Vectorof (Normal mean1 stddev))] [ys (Vectorof (Normal mean2 stddev))]) ;; my stddev calculation is wrong, but whatever #:distributed-as (Normal (- mean2 mean1) (/ (* stddev 2) (length xs))) (- (mean ys) (mean xs)))
Questions: Are there estimators for which there are multiple different distributional properties, under different sets of assumptions? How do we fit a model to data, given some estimators?
Some thought needs to go into the API so it’s easy to do estimators for multiple parameters (e.g. if we estimate both the slope and intercept of a linear regression simultaneously).
There are also cases when we can use the log-likelihood to derive an estimator automatically.
How do we compose these features? Suppose we have, in some imaginary syntax,
;; Returns distribution object representing the distribution ;; under this hypothesis. Since mean-difference gives a ;; parametric form for its sampling distribution, in terms ;; of the model parameters, and the assumptions are met, ;; the sampling is done parametrically. (assuming ([xs (Vectorof (Normal 1 1))] [ys (Vectorof (Normal 0 1))]) (mean-difference xs ys)) ;; On the other hand, if the assumptions aren't met, we recognize ;; this and do automatic parametric bootstrapping instead. ;; Notice the differing standard deviations. (assuming ([xs (Vectorof (Normal 1 2))] [ys (Vectorof (Normal 0 1))]) (mean-difference xs ys))
define-estimator gives assumptions for its distribution, which would require some kind of structural pattern matching on the assumptions to be sure they all hold. (There’d have to be some kind of trust-me-really override too.)
It should be pretty easy to automate iterative fitting processes. Instead of making the estimator take arguments for tolerances and iterations and such, just supply code to run a single update step:
Then the estimator would automatically run an iterative method until convergence, and would have keyword arguments for convergence parameters:
Then there could be methods to get convergence diagnostics, trace the steps taken to convergence, and so on.
If we provide data and an estimator, we can fit a model. Since we’ve described everything in such an abstract, high-level way, presumably we can do interesting things like sensitivity analyses: automatically drop data points and view the differences in estimates. Or automatic power analysis: specify a model, estimator, and assumptions on the data, and immediately get a power estimate based on a bunch of simulations made under those assumptions.
The models and estimators above should be first-class: besides declaring them, we should be able to perform operations upon them. The syntax above doesn’t really provide much leeway for this, so maybe it needs a rethink. Consider the operations we’d like to perform:
For data analysts, the language proposed above provides a few things:
But a trickier question is: what’s in it for statisticians developing new methods, rather than using existing ones?