In corpus linguistic, keyness describes how key a word is in a corpus. (Hey, at least they’re better at naming things than statisticians are.) A key word is a word that is particularly prominent in the corpus compared to some reference corpus. For example, “error” might be a key word in statistical texts when compared to, say, children’s novels.
Keyness is typically evaluated on the basis of word counts and proportions. The most common measure is “log-likelihood”, which is really just the test statistic arising from a likelihood ratio test. We have a target corpus and a reference corpus, then count the occurrence of the target word in each corpus. Assume n_\text{target} \sim \text{Binomial}(n_T, p_T) and n_\text{reference} \sim \text{Binomial}(n_R, p_R) and test H_0: p_T = p_R.
This has several drawbacks:
So naturally there are proposals for various ways to improve keyness.
Egbert, J., & Biber, D. (2019). Incorporating text dispersion into keyword analyses. Corpora, 14(1), 77–104. doi:10.3366/cor.2019.0162
Proposes a method based on comparing the number of texts the document occurs in, rather than the counts of its occurrences. That is, compares the fraction of texts containing the word in the target corpus to the fraction containing it in the reference corpus. This ignores the sizes of the documents or the occurrence rates, which seems suspicious, but at least ensures key words occur in a lot of documents.
Gries, S. T. (2025). Not just frequency: Keyness should integrate frequency, association, and dispersion. In A. Pawłowski, S. Embleton, J. Mačutek, & A. Xanthos (Eds.), Mathematical modeling in linguistics and text analysis: Theory and applications. John Benjamins Publishing Company. doi:10.1075/cilt.370.02gri
Argues that keyness should be a combination ofProposes some measures that, basically, smash measures of these three together in different combinations.