Blockly for statistics

Alex Reinhart – Updated March 18, 2017 notebooks · refsmmat.com

A concept for an interactive statistics exploration tool built using Blockly.

High-level overview: the first block is the data frame. Blocks attached to it do things to it: resample a new data frame, permute a certain column, group by a variable, calculate aggregate functions, and plot results. Students can easily experiment with statistical methods by rearranging blocks to try new things.

Concept outline below. Is this flexible enough to handle everything we’d want in intro stats?

The data block

We start with a data block. This would be provided and not configurable. It would specify the variables, column types, and so on.

Ideally we’d have a bank of datasets with datapackage.json descriptions, from which the user could choose any dataset.

Data would be represented as an object with a data dictionary (column name => array of data), a set of type specifications, and an array of grouping variables. The grouping array would start empty.

Resampling and permutation

A “sample” block, with “number of samples” argument, would select at random from the data, returning a new dataset with the chosen number of rows. There’d be a “with replacement”/“without replacement” selector.

A “permute” block would operate on one variable in the dataset, permuting its values at random. If applied to a dataset that had already been grouped, it would work within each group, and it would be an error to try to permute the variable already chosen for grouping. (Do these restrictions make sense? Would within-group permutation be useful in some cases – maybe when you have two factors and only want to test a hypothesis about one of them?)

Grouping

A simple “group by” block would take in a data frame and a variable name, and group by the levels of that variable. That variable’s name would be added to the grouping array. The data wouldn’t actually be transformed in any way – subsequent functions just know to operate on the groupings.

“Group by” must be composable, so we can group by multiple factors, e.g. if we have a two-way ANOVA and want to get means for each combination of two separate factor variables.

Functions

Functions would be split into two kinds: ordinary functions and aggregate functions, much in the same way SQL does it.

Ordinary function blocks would be something like “square” or “log” or whatever transformations, maybe with drop-down selectors for various options. There’d also be a selector for the variable to which this function is applied. The output from the block would be a new dataset with the transformation applied.

Aggregate functions operate over groups, as defined in the grouping variable array. They’d have a variable selected to operate on, then apply the function to that variable in each group, reducing each group to a single result. Output would be a new grouped data array, where the new variable name is, say, “mean(x)” and the grouping variable is retained. All other variables are dropped.

Flow control

There’d be a simple “Repeat _ times” block, looking like one of Blockly’s loop blocks, which can take any number of other blocks inside it. This block would run the enclosed statements repeatedly, always starting from the same input, and gather the output from the last enclosed statement. It’d simply concatenate the data frames, keeping any grouping applied.

Visualization

There’d be a few blocks for plotting results. A boxplot block, for example, would take data that’s been grouped and a variable name, and plot the variable within each group.

Example use case

Suppose the user has selected a simple dataset: a grouping variable X and a continuous outcome Y. Classic t test material.

The data block is on the screen. The block structure needed for a permutation test would be something like

  1. Data block.
  2. Repeat 1000 times
    1. Permute X
    2. Group by X
    3. mean(Y)
  3. Boxplot of mean(Y)

Problems

Variable scoping

How do we handle scoping? In the example above, X and Y are available in block 2, but in block 3, the only available variables are X and mean(Y). Blockly seems to prefer global scope, since children learning programming aren’t expected to understand lexical scoping. But here there’s obvious scoping as the dataset is changed by the operations performed within it; the statements contained in block 2 have different scope from blocks 1 or 3.

Instead of providing variables as separate blocks, we could use a separate “variable” block with a dynamically-populated dropdown menu of variables currently in scope, and provide a view of the data whenever the user clicks on a node so they can see the current structure of the data. That seems unintuitive and clunky, though.