See also Student assessment.

An important part of teaching statistics is understanding the misconceptions students typically hold about important statistical concepts.

Wason, P. C. (1960). On the failure to eliminate hypotheses in a conceptual task.

*Quarterly Journal of Experimental Psychology*,*12*(3), 129–140. doi:10.1080/17470216008416717The article behind this NYTimes interactive demonstration; go play with that first.

Students are presented with a sequence of three numbers and told it follows some rule, then asked to deduce what the rule is. They can try new sequences and are told if those sequences follow the rule. Most students, seeing the obvious pattern in the example, test a few further examples of it, then quit, without testing any other hypotheses – testing a sequence which

*shouldn’t*work, according to their rule, to see if they’re right.I see this in statistics when scientists, obtaining a significant result in favor of their scientific hypothesis, do not attempt to falsify their hypothesis or consider alternate hypotheses which could explain the result just as well. Maybe this is also a symptom of the law of small numbers: “I have seen this sequence, therefore it must be the rule, and I won’t try hard to disprove it.”

Also related:

Mahoney, M. J., & DeMonbreun, B. G. (1977). Psychology of the scientist: An analysis of problem-solving bias.

*Cognitive Therapy and Research*,*1*(3), 229–238. doi:10.1007/bf01186796Fine academic trolling. Repeated the same test as used by Wason, but on PhD scientists and conservative Protestant ministers. Also, a telling quote about Wason’s experiments (apparently he did many more after the initial paper):

…After having been told that an early hypothesis was wrong, they would often return to it via later confirmatory tests. This conceptual tenacity was sometimes striking. Moreover, subjects sometimes displayed considerable emotional stress and frustration while participating. They occasionally became upset when informed of their errors and one subject apparently exhibited psychotic behavior sufficient to warrant evacuation by ambulance.

Anyway, the ministers didn’t do any worse than the PhDs, beyond taking longer to come up with hypotheses. The PhDs were also more likely to go back to confirm a hypothesis they were already told was wrong, instead of trying a new one.

Oh, and

And finally, when asked not to discuss the project with his co-workers, another psychologist (PS-6) said, “Good grief! I never talk to any of my colleagues.”

Castro Sotos, A. E., Vanhoof, S., Van den Noortgate, W., & Onghena, P. (2007). Students’ misconceptions of statistical inference: A review of the empirical evidence from research on statistics education.

*Educational Research Review*,*2*(2), 98–113. doi:10.1016/j.edurev.2007.04.001Review of misconceptions, including

the law of small numbers (below), which had further implications, like students believing that sampling distributions should look more like the population as the sample size increases

lots of confusion about null and alternative hypotheses, and choosing the right null, and whether a hypothesis refers to the population or the sample

the usual p value misconceptions (below)

various confidence interval misinterpretations, like thinking graphical comparisons of overlap are valid tests for differences, or thinking a 95% CI means 95% of replications will fall in the interval

Rabin, M. (2002). Inference by believers in the law of small numbers.

*The Quarterly Journal of Economics*,*117*(3), 775–816. doi:10.1162/003355302760193896The “law of small numbers”: people overestimate how representative a small sample is of the population from which it is drawn. This leads to the gambler’s fallacy (if we get three heads in a row, the next must be tails, because I expect the sequence to be balanced), but also means people are more willing to reject the null when confronted with an unusual set of data, because they are overconfident in its representativeness of the population.

Hirsch, L. S., & O’Donnell, A. M. (2001). Representativeness in statistical reasoning: Identifying and assessing misconceptions.

*Journal of Statistics Education*,*9*(2). https://ww2.amstat.org/publications/jse/v9n2/hirsch.htmlResults of a test of the “representativeness” misconception, embodied by the idea that, flipping a fair coin six times, the sequence

`H H H H T T`

is somehow less likely than`H T H T H H`

, because it’s less “representative” of a fair sequence of flips. Multiple choice questions asked students which sequence is least likely, then gave a multiple choice set of reasons for their choice. Also includes an experiment where students predicted probabilities of sequences of draws of marbles, then actually drew marbles, though this did not seem to help.Konold, C. (1995). Issues in assessing conceptual understanding in probability and statistics.

*Journal of Statistics Education*,*3*(1). http://ww2.amstat.org/publications/jse/v3n1/konold.htmlMore representativeness results. Interestingly, students choose correctly (that all outcomes are equally likely) when asked which is

*most*likely, but not when asked which is*least*likely. Konold calls this the “outcome approach”: students interpret probabilities as statements of what will happen, not long-run relative frequencies. None is most likely because all*could*happen and they can’t pick one in particular. Similarly, “70% chance of rain” means “it will rain” to students, without any conception of calibration.Pfannkuch, M., & Brown, C. M. (1996). Building on and challenging students’ intuitions about probability: Can we improve undergraduate learning?

*Journal of Statistics Education*,*4*(1). https://ww2.amstat.org/publications/jse/v4n1/pfannkuch.htmlWhen presented with classic probability examples, like dice or roulette wheels, students make reasonable probabilistic statements; when asked similar problems with real data (like the distribution of rare birth defects in New Zealand), they immediately look for deterministic explanations instead of probabilistic ones, assuming there must be some explanation for the results. Describes some activities intended to connect the real-world examples to the classic examples, by having students roll dice to simulate birth defect data and see what results are surprising or not.

- Well, A. D., Pollatsek, A., & Boyce, S. J. (1990). Understanding the effects of sample size on variability of the mean.
*Organizational Behavior and Human Decision Processes*,*47*(2), 289–312. doi:10.1016/0749-5978(90)90040-G

Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception.

*Theory & Psychology*,*5*(1), 75–98. doi:10.1177/0959354395051004Argues against the backwards logic of conventional significance testing, then reviews research showing that many scientists do not understand the meaning of alpha (interpreting it as the probability their conclusion is wrong). Argues for several root causes: the phrase “Type I error” sounds unconditional, without the important conditioning on the null; there are no easy mechanical alternatives; and it’s hard to push back against everyone else using hypothesis testing.

Goodman, S. (2008). A dirty dozen: Twelve p-value misconceptions.

*Seminars in Hematology*,*45*(3), 135–140. doi:10.1053/j.seminhematol.2008.04.003Aquilonius, B. C., & Brenner, M. E. (2015). Students’ reasoning about p-values.

*Statistics Education Research Journal*,*14*(2), 7–27. http://iase-web.org/documents/SERJ/SERJ14(2)_Aquilonius.pdfAsked 16 students to solve p value questions, then recorded their reasoning. None of the students could remember the formal definition of p values; nonetheless, they used their calculators to do tests and got the right answers. They couldn’t explain the meaning of the p values to the interviewer. They could draw bell curves and label the rejection regions, but didn’t know

*why*you reject in that region. Several students claimed their instructors never explained why.There’s an interesting example where students are asked to test if a coin is fair. It got 31 heads out of 50; some rejected the null, then couldn’t believe a coin would be unfair; others accepted the null, but then couldn’t understand how 62% heads could possibly be fair, since “it has to be 50.” Sampling variation didn’t seem to enter into it.

The students had been exposed to the usual definitions of p values and hypothesis testing, but it seems none of it was retained whatsoever.

Cooper, L. L., & Shore, F. S. (2008). Students’ misconceptions in interpreting center and variability of data represented via histograms and stem-and-leaf plots.

*Journal of Statistics Education*,*16*(2). https://ww2.amstat.org/publications/jse/V16n2/cooper.pdfA few questions and interviews about histograms and stem-and-leaf plots; they note (though the question was not directly designed to catch this misconception, and the interpretation is based on a small set of interviews)

The more troubling finding is that 50% of the students judged variability by focusing on the varying heights of the bars, implying variability in frequencies, rather than data values.

Many also “incorrectly interpreted the median to be the middle of the horizontal axis” on a histogram.

Kaplan, J. J., Gabrosek, J. G., Curtiss, P., & Malone, C. (2014). Investigating student understanding of histograms.

*Journal of Statistics Education*,*22*(4). http://www.amstat.org/publications/jse/v22n2/kaplan.pdfNotes four misconceptions previously observed:

- Students do not understand the distinction between a bar chart and a histogram, and why this distinction is important.
- Students use the frequency (y-axis) instead of the data values (x-axis) when reporting on the center of the distribution and the modal group of values.
- Students believe that a flatter histogram equates to less variability in the data.
- For data that has an implied (though unobserved) time component, students read the histogram as a time plot believing (incorrectly) that values on the left side of the graph took place earlier in time.

Reports on a pre/post test used to test these misconceptions in a college intro course, finding that these misconceptions are highly persistent through the course: “nearly 25% of the students still chose a bumpier histogram as having high variability at the post-test”, after 50% did on the pre-test.