In corpus linguistics, dispersion describes a word’s usage across many documents in a corpus. An evenly dispersed word is used at a similar rate across all documents; an unevenly dispersed word is used often in some documents and rarely in others.
(Annoyingly, this is the opposite of how statisticians would use the word “dispersion”. Normal dispersion would mean the rate varies as much as you’d expect from, say, a binomial; overdispersion would mean the word’s usage rate varies a lot from document to document. But that is a word that is not dispersed.)
In the below, let d be the number of documents in the corpus, each with n_i words, totaling N = \sum_i n_i words in the corpus. The target word (whose dispersion we are measuring) occurs y_i times per document. Its absolute frequency is \text{AF} = \sum_i y_i and its relative frequency is \text{RF} = \text{AF}/N.
Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. doi:10.1075/ijcl.13.4.02gri
Proposes the Deviation of Proportions (DP) metric, which is \text{DP} = \frac{1}{2} \sum_{i=1}^n \left| \frac{y_i}{\text{AF}} - \frac{n_i}{N} \right|.
This compares the fraction of occurrences that are in one document (y_i / \text{AF}) to the fraction of all text that is in this document (n_i / N). It is 0 when the word is perfectly evenly distributed and 1 when it is maximally uneven. However, it is correlated with relative frequency: DP is almost always close to 0 for words with high relative frequency, regardless of their distribution.
Burch, B., Egbert, J., & Biber, D. (2016). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3(2), 189–216. doi:10.1558/jrds.33066
Proposes a dispersion measure based on pairwise comparisons between all documents in a corpus, which appears to be equivalent to the Gini coefficient for the differences between relative frequencies across documents. The Gini coefficient would be G = \frac{\sum_{i=1}^d \sum_{j=1}^d \left| \frac{y_i}{n_i} - \frac{y_j}{n_j} \right|}{\frac{2}{d} \sum_{i=1}^d \frac{y_i}{n_i}}, half the relative mean absolute paired differences. To simplify, let r_i = \frac{y_i / n_i}{\sum_{j=1}^d y_j / n_j} be the normalized proportion for document i, so the proportions sum to 1. Then let D_A = 1 - \frac{\sum_{i=1}^{d-1} \sum_{j=i+1}^d \left|r_i - r_j\right|}{d-1}, the Gini flipped so 1 is maximum equality (dispersion). (The factor of 2 goes away because we adjust the limits of the sums to only do one triangle of the distance matrix. I do not see why we got d-1 instead of d, but Sönning’s tlda package offers a version with d, so perhaps d-1 is a mistake.)
Calculating this is O(d^2), though it can apparently be done in O(n) by first doing an O(n \log n) sort of the relative proportions. If r_{(1)}, r_{(2)}, \dots, r_{(d)} are the proportions in descending order, then D_A = \frac{\sum_{i=1}^d 2 i r_{(i)} - 1}{d - 1}. (I need to rederive this, because Wilcox (1973) gave the formula with the factor of 2 in a different place, which does not seem to give an equivalent answer.)
Gries, S. T. (2021). A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics, 9(2), 1–33. doi:10.32714/ricl.09.02.02
Proposes a dispersion measure based on Kullback–Leibler divergence: D_\mathrm{KL} = \sum_{i=1}^d \frac{y_i}{\text{AF}} \log_2 \left(\frac{y_i / \text{AF}}{n_i / N} \right). This is the divergence between the distribution of word counts across documents and the distribution of document sizes, so it is 0 if the words are perfectly evenly distributed. Gries argues this is less associated with relative frequency, though it still appears to be so empirically.
Gries, S. T. (2022). What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies, 5(2), 171–205. doi:10.1075/jsls.21029.gri
Acknowledges the frequency association of DP and D_\mathrm{KL} and proposes a new version of DP where the value is rescaled based on the maximum and minimum possible DP values for a word of the same frequency. This, basically, regresses out the frequency dependence. Also tries to empirically ground the dispersion measure by testing how well it predicts human reaction time on a task to decide whether an audio recording is of a word or non-word. I guess the theory is that frequency should predict this strongly, and if a dispersion measure is meaningful it will provide additional information beyond frequency.
Sönning, L. (2025). Advancing our understanding of dispersion measures in corpus research. Corpora, 20(1), 3–35. doi:10.3366/cor.2025.0326
Applies a bunch of dispersion measures to the Brown corpus to understand their properties, through exercises like deleting individual documents and measuring the change in dispersion. Also uses the negative binomial distribution to simulate realistic word counts in corpora, though I’m not sure why it doesn’t propose estimating the negative binomial overdispersion parameter as a measure of corpus dispersion – that seems like a reasonable step.