Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election.

*The Annals of Applied Statistics*,*12*(2), 685–726. doi:10.1214/18-aoas1161sfMeng gives a fundamental identity for the difference between \bar G_N, the population mean of some quantity G in a population of size N, and the sample mean \bar G_n for a sample of size n:

\displaystyle \bar G_n - \bar G_N = \rho_{R,G} \times \sqrt{\frac{1 - f}{f}} \times \sigma_G,

where \rho_{R,G} is the population correlation between sampling response (R = 0 or 1 for each potential respondent, depending on whether they respond to the survey) and G, f = n/N, and \sigma_G is the standard deviation of G. Hence the error factors into three pieces that Meng calls “Data Quality”, “Data Quantity”, and “Problem Difficulty”. Meng demonstrates that data quality matters a lot: a very small \rho_{R,G} can lead to huge relative error when n is large.

For example, if the expected correlation between inclusion and outcome is just 0.05, sampling

*half*the population gives us an MSE for \bar G_n equivalent to a random sample of just*400*people. If that were a poll of eligible US voters, that comes out to a “99.999965% reduction of the sample size, or equivalently estimation efficiency.”