# Probabilistic programming

Alex Reinhart – Updated February 28, 2017

Contrast with Statistical programming languages.

Warning: I do not know what I am talking about.

For the basic concepts, see The Design and Implementation of Probabilistic Programming Languages, by Noah D. Goodman and Andreas Stuhlmüller. However, I can’t find a good high-level overview (that is, well-suited to a statistician who already understands probabilistic models, not a computer scientist) anywhere.

A almost-but-not-quite-entirely-wrong description would be this: A probabilistic program generates outputs via some code, which may depend on random events (a coin flip, a draw from a distribution, etc.) dependent on some parameters. Naturally we can sample from the outputs by just repeatedly running the program; however, the number of possible execution paths through the program is exponential in the number of random choices it involves, so exploring the entire distribution quickly becomes intractable. (Continuous random variables make the space of execution paths infinite.) There are a variety of MCMC and other approaches to get an approximate distribution efficiently.

We can also work in reverse: condition on data. If we’ve naively enumerated all possible execution paths through the program, we can select out those that result in the observed data, and obtain the conditional distributions of all the random values we used in the program. (These could be parameters drawn from a prior distribution, say.) Hence we can perform inference on useful model parameters. Again, enumeration is ridiculously inefficient for most types of models, so research focuses on clever ways to avoid it and obtain approximate solutions.

A simple example from the Venture tutorial:

``````assume intercept = normal(0, 10);
assume slope = normal(0, 10)

assume line = proc(x) { slope * x + intercept }
assume obs = proc(x) { normal(line(x), 1) }

observe obs(1) = 2;
observe obs(2) = 2;

infer conditional;

sample list(intercept, slope)
``````

So we specify priors on the slope and intercept, specify how the line and observations are computed, specify observations, and can then condition on those observations, and then sample from the slope and intercept conditional distributions. We could also draw samples from `obs`, presumably, by repeatedly sampling new `slope` and `intercept` values from their priors.

## Languages

A good overview (from March 2015) is The State of Probabilistic Programming. There’s a huge variety of probabilistic programming languages; it seems many explore the tradeoffs between expressivity and efficiency, by either (a) focusing on expressing certain types of models and then having very efficient inference algorithms which exploit their structure, or (b) being more flexible but hence requiring more general (and less performant) inference algorithms.

Since there are a million different probabilistic languages in various states of maturity, I’ll just list whichever caught my eye here. A bigger list is available here.

• Stan is specialized for Bayesian models (particularly hierarchical models) by specifying the density of the data in terms of parameters and their priors, then supports specialized Monte Carlo methods to simulate from posteriors of parameters (among other things).

• WebPPL runs in the browser and seems to focus on getting the outcome distribution, not conditioning on observations.

• Venture adds all sorts of algorithms for conditioning on observations.

• Edward (paper) introduces a notation for specifying the inference algorithm as well as the model, so it’s possible to build general models but obtain good performance by specializing the inference for that particular model, instead of relying on some very general algorithm provided by the language.

• DrBayes (paper, dissertation) implements measure-theoretic probability directly. (I guess this is the nuclear option.) Random variables are functions from a sample space to the state space. No densities are used, so variables with no density (e.g. truncated values) are handled easily. To condition on an outcome, obtain its preimage in the sample space and obtain the distribution of other variables from the preimage. (The actual implementation of this is beyond me.)