# Experimental design

Alex Reinhart – Updated September 29, 2018

Many experimental design textbooks, written for practicing scientists, don’t explain why we’re going to all of this trouble to design elaborate treatment allocations. Why, exactly, do I want to use a Latin square over some other allocation of treatments? In most cases, the purpose behind designs is control of estimation variance: by choosing treatment allocation carefully, we can obtain a treatment effect estimate that has the lowest possible variance, given our sample size constraints. This involves clever tricks like making treatment effects orthogonal to other effects by design.

• Mutz, D. C., Pemantle, R., & Pham, P. (2018). The perils of balance testing in experimental design: Messy analyses of clean data. The American Statistician. doi:10.1080/00031305.2017.1322143

Some people recommend that in a randomized experiment with assorted covariates (like patient demographics), you should test that the covariates are roughly equal between treatment and control groups, indicating the randomization was “successful”. This doesn’t make sense, particularly if we use it as justification for including or excluding specific covariates from the model.

• [To read] Chaloner, K., & Verdinelli, I. (1995). Bayesian experimental design: A review. Statistical Science, 10(3), 273–304. doi:10.1214/ss/1177009939

## Examples

• Easterling, R. G. (2004). Teaching experimental design. The American Statistician, 58(3), 244–252. doi:10.1198/000313004x1477

Some useful examples for teaching experimental design.

• Hunter, W. G. (1977). Some ideas about teaching design of experiments, with 2^5 examples of experiments conducted by students. The American Statistician, 31(1), 12–17. doi:10.1080/00031305.1977.10479185

Discussion of teaching experimental design by having students design their own experiments, including 2^5 examples of experiments designed by students and some interesting pedagogical ideas, such as having students predict the outcome of the experiment in advance so they can be surprised.

## Online experiments

By which I mean “experiments performed on websites”, not “online” as in “online learning”.

• Tang, D., Agarwal, A., O’Brien, D., & Meyer, M. (2010). Overlapping experiment infrastructure: More, better, faster experimentation. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 17–26). doi:10.1145/1835804.1835810. https://ai.google/research/pubs/pub36500

An overview of Google’s 2010-era experiment infrastructure. Splits system parameters into “layers”; parameters from separate layers can be tweaked independently without fear of conflict (e.g. some combination of parameter values producing unreadable pages). Users can then be in one experiment per layer, instead of just one experiment. Uses A/A experiments to track typical variance in metrics, for use in power analyses.

• Kohavi, R., Deng, A., Frasca, B., Walker, T., Xu, Y., & Pohlmann, N. (2013). Online controlled experiments at large scale. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1168–1176). doi:10.1145/2487575.2488217

A good discussion of how Microsoft uses controlled experiments. Sensitivity is important, because small improvements add up to millions of dollars of revenue annually. Takes the interesting position that “interactions are relatively rare and more often represent bugs than true statistical interactions”, so they do one-at-a-time experiments instead of multi-factor experiments.

• Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. In Sixth ACM International Conference on Web Search and Data Mining (pp. 123–132). doi:10.1145/2433396.2433413

Proposes the CUPED (Controlled-experiment Using Pre-Experiment Data) method for reducing variance in treatment effect estimation. Rejecting the use of ANCOVA to account for pre-treatment covariates for its restrictive assumptions, they propose a method based on “control variates”, which seems to be identical to ANCOVA with one covariate. I find this confusing.

• Poyarkov, A., Drutsa, A., Khalyavin, A., Gusev, G., & Serdyukov, P. (2016). Boosted decision tree regression adjustment for variance reduction in online controlled experiments. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 235–244). doi:10.1145/2939672.2939688

A cross of fancy machine learning and classic experimental design, following up on CUPED. (Fortunately they notice CUPED is just ANCOVA.) If ANCOVA is problematic because it assumes linear effects of covariates (though of course we could do nonparametric regression), why not predict the effects of covariates with decision trees? This would be unappealing in small-data experiments, where there’s hardly enough data to train a fancy machine learning algorithm, but in online experiments with millions of users, a fancy machine learning method may be entirely suitable to estimating covariate effects, and reduces treatment effect estimation variance for the same reasons ANCOVA does.

• Hohnhold, H., O’Brien, D., & Tang, D. (2015). Focusing on the long-term: It’s good for users and business. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1849–1858). doi:10.1145/2783258.2788583

Describes the challenges facing experimental design in online experiments (ad experiments at Google). Sample size is not a problem. But there are per-user covariates, like ad blindness and willingness to use the product, which affect outcomes (ad revenue) but are also affected by the treatment (different ad placement parameters). A new ad placement scheme might improve revenue in the short term but train users to ignore ads in the long term.

This is counter to the usual problem in experimental design – that covariates might affect the choice of treatment, requiring us to randomize treatments to break the causal link. Here, the treatment affects the covariates, and the authors propose some clever ways to measure this effect, such as keeping some users in the treatment group for a long period while comparing them against users entered into the treatment group for only a day at a time. These led to a change that “decreased the search ad load on Google’s mobile traffic by 50%, resulting in dramatic gains in user experience metrics” which “would be so great that the long-term revenue change would be a net positive.”