Keyness (corpus linguistics)

Alex Reinhart – Updated December 19, 2025 notebooks · refsmmat.com

In corpus linguistic, keyness describes how key a word is in a corpus. (Hey, at least they’re better at naming things than statisticians are.) A key word is a word that is particularly prominent in the corpus compared to some reference corpus. For example, “error” might be a key word in statistical texts when compared to, say, children’s novels.

Keyness is typically evaluated on the basis of word counts and proportions. The most common measure is “log-likelihood”, which is really just the test statistic arising from a likelihood ratio test. We have a target corpus and a reference corpus, then count the occurrence of the target word in each corpus. Assume n_\text{target} \sim \text{Binomial}(n_T, p_T) and n_\text{reference} \sim \text{Binomial}(n_R, p_R) and test H_0: p_T = p_R.

This has several drawbacks:

  1. Ordering by the test statistic means small differences in frequent words rank just as high as large differences in rare words.
  2. Nobody would ever expect H_0 to be true, so the p-values aren’t useful; but nobody seems to use them anyway.
  3. A word might be key because one document in the target corpus uses it a lot, not because it’s common across the corpus. (See also Dispersion.)

So naturally there are proposals for various ways to improve keyness.