Statistical misconceptions

An important part of teaching statistics is understanding the misconceptions students typically hold about important statistical concepts.

General concepts

Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task. Quarterly Journal of Experimental Psychology, 12(3), 129–140. doi:10.1080/17470216008416717

The article behind this NYTimes interactive demonstration; go play with that first.

Students are presented with a sequence of three numbers and told it follows some rule, then asked to deduce what the rule is. They can try new sequences and are told if those sequences follow the rule. Most students, seeing the obvious pattern in the example, test a few further examples of it, then quit, without testing any other hypotheses – testing a sequence which shouldn’t work, according to their rule, to see if they’re right.

I see this in statistics when scientists, obtaining a significant result in favor of their scientific hypothesis, do not attempt to falsify their hypothesis or consider alternate hypotheses which could explain the result just as well. Maybe this is also a symptom of the law of small numbers: “I have seen this sequence, therefore it must be the rule, and I won’t try hard to disprove it.”

Also related:
Mahoney, M. J., & DeMonbreun, B. G. (1977). Psychology of the scientist: An analysis of problem-solving bias. Cognitive Therapy and Research, 1(3), 229–238. doi:10.1007/bf01186796

Fine academic trolling. Repeated the same test as used by Wason, but on PhD scientists and conservative Protestant ministers. Also, a telling quote about Wason’s experiments (apparently he did many more after the initial paper):

…After having been told that an early hypothesis was wrong, they would often return to it via later confirmatory tests. This conceptual tenacity was sometimes striking. Moreover, subjects sometimes displayed considerable emotional stress and frustration while participating. They occasionally became upset when informed of their errors and one subject apparently exhibited psychotic behavior sufficient to warrant evacuation by ambulance.

Anyway, the ministers didn’t do any worse than the PhDs, beyond taking longer to come up with hypotheses. The PhDs were also more likely to go back to confirm a hypothesis they were already told was wrong, instead of trying a new one.

Oh, and

And finally, when asked not to discuss the project with his co-workers, another psychologist (PS-6) said, “Good grief! I never talk to any of my colleagues.”
Castro Sotos, A. E., Vanhoof, S., Van den Noortgate, W., & Onghena, P. (2007). Students’ misconceptions of statistical inference: A review of the empirical evidence from research on statistics education. Educational Research Review, 2(2), 98–113. doi:10.1016/j.edurev.2007.04.001

Review of misconceptions, including
- the law of small numbers (below), which had further implications, like students believing that sampling distributions should look more like the population as the sample size increases
- lots of confusion about null and alternative hypotheses, and choosing the right null, and whether a hypothesis refers to the population or the sample
- the usual p value misconceptions (below)
- various confidence interval misinterpretations, like thinking graphical comparisons of overlap are valid tests for differences, or thinking a 95% CI means 95% of replications will fall in the interval
- conflation of the central limit theorem, assuming that any large sample must be approximately normal, and hence being confused when asked to justify a CLT approximation
- the belief that hypothesis tests are deterministic or give absolute proof of hypotheses

Representativeness

Rabin, M. (2002). Inference by believers in the law of small numbers. The Quarterly Journal of Economics, 117(3), 775–816. doi:10.1162/003355302760193896

The “law of small numbers”: people overestimate how representative a small sample is of the population from which it is drawn. This leads to the gambler’s fallacy (if we get three heads in a row, the next must be tails, because I expect the sequence to be balanced), but also means people are more willing to reject the null when confronted with an unusual set of data, because they are overconfident in its representativeness of the population.
Hirsch, L. S., & O’Donnell, A. M. (2001). Representativeness in statistical reasoning: Identifying and assessing misconceptions. Journal of Statistics Education, 9(2). https://ww2.amstat.org/publications/jse/v9n2/hirsch.html

Results of a test of the “representativeness” misconception, embodied by the idea that, flipping a fair coin six times, the sequence H H H H T T is somehow less likely than H T H T H H, because it’s less “representative” of a fair sequence of flips. Multiple choice questions asked students which sequence is least likely, then gave a multiple choice set of reasons for their choice. Also includes an experiment where students predicted probabilities of sequences of draws of marbles, then actually drew marbles, though this did not seem to help.
Konold, C. (1995). Issues in assessing conceptual understanding in probability and statistics. Journal of Statistics Education, 3(1). http://ww2.amstat.org/publications/jse/v3n1/konold.html

More representativeness results. Interestingly, students choose correctly (that all outcomes are equally likely) when asked which is most likely, but not when asked which is least likely. Konold calls this the “outcome approach”: students interpret probabilities as statements of what will happen, not long-run relative frequencies. None is most likely because all could happen and they can’t pick one in particular. Similarly, “70% chance of rain” means “it will rain” to students, without any conception of calibration.
Pfannkuch, M., & Brown, C. M. (1996). Building on and challenging students’ intuitions about probability: Can we improve undergraduate learning? Journal of Statistics Education, 4(1). https://ww2.amstat.org/publications/jse/v4n1/pfannkuch.html

When presented with classic probability examples, like dice or roulette wheels, students make reasonable probabilistic statements; when asked similar problems with real data (like the distribution of rare birth defects in New Zealand), they immediately look for deterministic explanations instead of probabilistic ones, assuming there must be some explanation for the results. Describes some activities intended to connect the real-world examples to the classic examples, by having students roll dice to simulate birth defect data and see what results are surprising or not.

Probability densities

Chaput, J. S., Crack, T. F., & Onishchenko, O. (2021). What quantity appears on the vertical axis of a normal distribution? A student survey. Journal of Statistics and Data Science Education, 1–23. doi:10.1080/26939169.2021.1933658

“Our survey finds that only 27 out of 148 students surveyed (i.e., 18.2%) could label the vertical axis of the normal distribution correctly, and of these, only five students (i.e., 3.4%) could explain their label.” Students don’t have intuition for probability densities (often interpreting them as probabilities or counts) and do not know they have units. I’ve often been interested in the units problem, because I only started being able to remember common probability formulas when I read a paper from David Hogg explaining probability for physicists, including the dimensional analysis of probability. Perhaps we should teach statistics with dimensional analysis.

Sampling distributions

Well, A. D., Pollatsek, A., & Boyce, S. J. (1990). Understanding the effects of sample size on variability of the mean. Organizational Behavior and Human Decision Processes, 47(2), 289–312. doi:10.1016/0749-5978(90)90040-G
Chance, B., DelMas, R., & Garfield, J. (2004). Reasoning about sampling distributions. In D. Ben-Zvi & J. Garfield (Eds.), The challenge of developing statistical literacy, reasoning and thinking (pp. 295–323). Kluwer Academic Publishers.

Discusses the great conceptual difficulties students have with sampling distributions, even if they have computer-based interactive sampling activities and labs to practice with. Goes through several rounds of activities and interviews to try to explore the problems.

Inference and p values

Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory & Psychology, 5(1), 75–98. doi:10.1177/0959354395051004

Argues against the backwards logic of conventional significance testing, then reviews research showing that many scientists do not understand the meaning of alpha (interpreting it as the probability their conclusion is wrong). Argues for several root causes: the phrase “Type I error” sounds unconditional, without the important conditioning on the null; there are no easy mechanical alternatives; and it’s hard to push back against everyone else using hypothesis testing.
Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions. Seminars in Hematology, 45(3), 135–140. doi:10.1053/j.seminhematol.2008.04.003
Aquilonius, B. C., & Brenner, M. E. (2015). Students’ reasoning about p-values. Statistics Education Research Journal, 14(2), 7–27. http://iase-web.org/documents/SERJ/SERJ14(2)_Aquilonius.pdf

Asked 16 students to solve p value questions, then recorded their reasoning. None of the students could remember the formal definition of p values; nonetheless, they used their calculators to do tests and got the right answers. They couldn’t explain the meaning of the p values to the interviewer. They could draw bell curves and label the rejection regions, but didn’t know why you reject in that region. Several students claimed their instructors never explained why.

There’s an interesting example where students are asked to test if a coin is fair. It got 31 heads out of 50; some rejected the null, then couldn’t believe a coin would be unfair; others accepted the null, but then couldn’t understand how 62% heads could possibly be fair, since “it has to be 50.” Sampling variation didn’t seem to enter into it.

The students had been exposed to the usual definitions of p values and hypothesis testing, but it seems none of it was retained whatsoever.
Crooks, N. M., Bartel, A. N., & Alibali, M. W. (2019). Conceptual knowledge of confidence intervals in psychology undergraduate and graduate students. Statistics Education Research Journal, 18(1), 46–62. https://iase-web.org/documents/SERJ/SERJ18(1)_Crooks.pdf
Reviews the common misconceptions about confidence intervals, such as that 95% confidence means that 95% of replications would call in the CI, or that the CI is the range of the samples. Constructs an assessment from prior research and textbook questions and asked 21 undergrads and 19 grad students, all in psychology, to take it; half the participants completed the assessment in a think-aloud interview and half did it on paper. (The assessment questions were revised by asking pilot participants for feedback, not through interviews.) Found overall low performance; graduate students were more likely to hold the replication misconception, and half of all participants treated the CI as fixed with a random parameter possibly falling within it. I’m worried by the small sample size and population (students in specific classes at one institution) and would be interested to see this replicated with questions revised via think-alouds.

Graphics

Cooper, L. L., & Shore, F. S. (2008). Students’ misconceptions in interpreting center and variability of data represented via histograms and stem-and-leaf plots. Journal of Statistics Education, 16(2). https://ww2.amstat.org/publications/jse/V16n2/cooper.pdf

A few questions and interviews about histograms and stem-and-leaf plots; they note (though the question was not directly designed to catch this misconception, and the interpretation is based on a small set of interviews)

The more troubling finding is that 50% of the students judged variability by focusing on the varying heights of the bars, implying variability in frequencies, rather than data values.

Many also “incorrectly interpreted the median to be the middle of the horizontal axis” on a histogram.
Kaplan, J. J., Gabrosek, J. G., Curtiss, P., & Malone, C. (2014). Investigating student understanding of histograms. Journal of Statistics Education, 22(4). http://www.amstat.org/publications/jse/v22n2/kaplan.pdf

Notes four misconceptions previously observed:
1. Students do not understand the distinction between a bar chart and a histogram, and why this distinction is important.
2. Students use the frequency (y-axis) instead of the data values (x-axis) when reporting on the center of the distribution and the modal group of values.
3. Students believe that a flatter histogram equates to less variability in the data.
4. For data that has an implied (though unobserved) time component, students read the histogram as a time plot believing (incorrectly) that values on the left side of the graph took place earlier in time.
Reports on a pre/post test used to test these misconceptions in a college intro course, finding that these misconceptions are highly persistent through the course: “nearly 25% of the students still chose a bumpier histogram as having high variability at the post-test”, after 50% did on the pre-test.
Cooper, L. L. (2018). Assessing students’ understanding of variability in graphical representations that share the common attribute of bars. Journal of Statistics Education, 26(2), 110–124. doi:10.1080/10691898.2018.1473060

More misunderstanding of variability in bar charts and histograms, in a pre-test given to students before they learn formal measures of variation. Also explores variation in categorical data: in a bar chart of categorical data, high variation means bars of relatively equal height, indicating individual cases are least alike. Students correctly note this, but attempt to apply the same reasoning to histograms and to value bar charts (e.g. a bar chart of rainfall by month, which could just as well be a line graph). Advocates for a formal definition of variability in categorical data, the “coefficient of unalikeability”, and for clearly describing different ideas of variability in class: variation from the mean vs. unalikeability.