In corpus linguistics, dispersion describes a word’s usage across many documents in a corpus. An evenly dispersed word is used at a similar rate across all documents; an unevenly dispersed word is used often in some documents and rarely in others.
(Annoyingly, this is the opposite of how statisticians would use the word “dispersion”. Normal dispersion would mean the rate varies as much as you’d expect from, say, a binomial; overdispersion would mean the word’s usage rate varies a lot from document to document. But that is a word that is not dispersed.)
In the below, let d be the number of documents in the corpus, each with n_i words, totaling N = \sum_i n_i words in the corpus. The target word (whose dispersion we are measuring) occurs y_i times per document. Its absolute frequency is \text{AF} = \sum_i y_i and its relative frequency is \text{RF} = \text{AF}/N.
Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437. doi:10.1075/ijcl.13.4.02gri
Proposes the Deviation of Proportions (DP) metric, which is \text{DP} = \frac{1}{2} \sum_{i=1}^n \left| \frac{y_i}{\text{AF}} - \frac{n_i}{N} \right|.
This compares the fraction of occurrences that are in one document (y_i / \text{AF}) to the fraction of all text that is in this document (n_i / N). It is 0 when the word is perfectly evenly distributed and 1 when it is maximally uneven. However, it is correlated with relative frequency: DP is almost always close to 0 for words with high relative frequency, regardless of their distribution.
Burch, B., Egbert, J., & Biber, D. (2016). Measuring and interpreting lexical dispersion in corpus linguistics. Journal of Research Design and Statistics in Linguistics and Communication Science, 3(2), 189–216. doi:10.1558/jrds.33066
Proposes a dispersion measure based on pairwise comparisons between all documents in a corpus, which appears to be equivalent to the Gini coefficient for the differences between relative frequencies across documents. The Gini coefficient would be G = \frac{\sum_{i=1}^d \sum_{j=1}^d \left| \frac{y_i}{n_i} - \frac{y_j}{n_j} \right|}{\frac{2}{d} \sum_{i=1}^d \frac{y_i}{n_i}}, half the relative mean absolute paired differences. To simplify, let r_i = \frac{y_i / n_i}{\sum_{j=1}^d y_j / n_j} be the normalized proportion for document i, so the proportions sum to 1. Then let D_A = 1 - \frac{\sum_{i=1}^{d-1} \sum_{j=i+1}^d \left|r_i - r_j\right|}{d-1}, the Gini flipped so 1 is maximum equality (dispersion). (The factor of 2 goes away because we adjust the limits of the sums to only do one triangle of the distance matrix. I do not see why we got d-1 instead of d, but Sönning’s tlda package offers a version with d, so perhaps d-1 is a mistake.)
Calculating this is O(d^2), though it can apparently be done in O(n) by first doing an O(n \log n) sort of the relative proportions. If r_{(1)}, r_{(2)}, \dots, r_{(d)} are the proportions in descending order, then D_A = \frac{2 \left( -1 + \sum_{i=1}^d i r_{(i)}\right)}{d - 1}. (When the proportions are in descending order, the quantity in the absolute value is always nonnegative, so we can algebraically group the terms.)
Egbert, J., Burch, B., & Biber, D. (2020). Lexical dispersion and corpus design. International Journal of Corpus Linguistics, 25(1), 89–115. doi:10.1075/ijcl.18010.egb
Uses D_A to argue that dispersion should be measured across “linguistically meaningful” corpus units, i.e., not just by arbitrarily dividing up the corpus into 500-word chunks (or whatever). (Just another Modifiable Areal Unit Problem.) Also extensively compares DP and D_A on the Brown corpus.
Gries, S. T. (2021). A new approach to (key) keywords analysis: Using frequency, and now also dispersion. Research in Corpus Linguistics, 9(2), 1–33. doi:10.32714/ricl.09.02.02
Proposes a dispersion measure based on Kullback–Leibler divergence: D_\mathrm{KL} = \sum_{i=1}^d \frac{y_i}{\text{AF}} \log_2 \left(\frac{y_i / \text{AF}}{n_i / N} \right). This is the divergence between the distribution of word counts across documents and the distribution of document sizes, so it is 0 if the words are perfectly evenly distributed. Gries argues this is less associated with relative frequency, though it still appears to be so empirically.
Gries, S. T. (2022). What do (most of) our dispersion measures measure (most)? Dispersion? Journal of Second Language Studies, 5(2), 171–205. doi:10.1075/jsls.21029.gri
Acknowledges the frequency association of DP and D_\mathrm{KL} and proposes a new version of DP where the value is rescaled based on the maximum and minimum possible DP values for a word of the same frequency. This, basically, regresses out the frequency dependence. Also tries to empirically ground the dispersion measure by testing how well it predicts human reaction time on a task to decide whether an audio recording is of a word or non-word. I guess the theory is that frequency should predict this strongly, and if a dispersion measure is meaningful it will provide additional information beyond frequency.
Gries, S. T. (2022). Toward more careful corpus statistics: Uncertainty estimates for frequencies, dispersions, association measures, and more. Research Methods in Applied Linguistics, 1(1), 100002. doi:10.1016/j.rmal.2021.100002
Argues for presenting dispersion (and other measures) with uncertainty, via bootstrapping, and making joint confidence reasons for multiple corpus statistics simultaneously. For uncertainty, argues that a simple bag-of-words approach (e.g., for frequency, just do a binomial CI) is inappropriate because the sampling unit is documents, not words, so uncertainty should be calculated with a bootstrap of documents. For plots of dispersion vs frequency, he fits ellipses to the bootstrap distribution to make a joint confidence region. Applies the method to frequency, dispersion, and association measures.
Nelson, R. N. (2023). Too noisy at the bottom: Why Gries’ (2008, 2020) dispersion measures cannot identify unbiased distributions of words. Journal of Quantitative Linguistics, 30(2), 153–166. doi:10.1080/09296174.2023.2172711
Gries, S. T. (2024). Corrections to Nelson (2023): DPnorm and DKLnorm are not wrong on pi at all. Journal of Quantitative Linguistics, 31(1), 43–53. doi:10.1080/09296174.2024.2324616
Nelson, R. (2024). Author’s response. Journal of Quantitative Linguistics, 31(2), 161–164. doi:10.1080/09296174.2024.2344944
Nelson points out that DP is just \text{DP} = \frac{1}{2} \left\|\frac{y_i}{\text{AF}} - \frac{n_i}{N} \right\|_1, the Manhattan distance between the rate vectors. Since the rates must sum to 1, they live on a simplex; as d \to \infty, the volume near the surface of the simplex (extreme values) will grow faster than the interior points, leading to DP increasing. He has similar complaints about D_\mathrm{KL}.
Gries replies to claim some of the example calculations are wrong and Nelson isn’t using the normalized versions of each measure.
I’m not sure either is fully correct: normalizing doesn’t fix the underlying problems, but Nelson’s argument works only if you assume the rates are distributed uniformly on the simplex, rather than converging toward a limiting point. His simulations also increase d while keeping N fixed, which increases the sampling variance of the individual proportion estimates and hence makes them deviate more from uniform. The right comparison would be to increase d while holding the number of tokens per part fixed.
Sönning, L. (2025). Advancing our understanding of dispersion measures in corpus research. Corpora, 20(1), 3–35. doi:10.3366/cor.2025.0326
Applies a bunch of dispersion measures to the Brown corpus to understand their properties, through exercises like deleting individual documents and measuring the change in dispersion. Also uses the negative binomial distribution to simulate realistic word counts in corpora, though I’m not sure why it doesn’t propose estimating the negative binomial overdispersion parameter as a measure of corpus dispersion – that seems like a reasonable step.
Nelson, R. N. (2025). Groundhog Day is not a good model for corpus dispersion. Journal of Quantitative Linguistics, 32(2), 103–127. doi:10.1080/09296174.2024.2423415
Notes that perfect evenness (identical counts in every document) is not the ideal case for dispersion, because we always expect some random variation. Instead, proposes a measure based on the Poisson distribution:
I’m concerned by the idea of ignoring the original divisions, and this also throws away the actual counts, using only the number of parts that do or do not contain the word. It’s also not obvious why this should be Poisson and not binomial. But the idea that we should allow some random variation is clearly correct, so there should be a probabilistic model that is possible.