Dispersion (corpus linguistics)

Alex Reinhart – Updated December 4, 2025 notebooks · refsmmat.com

In corpus linguistics, dispersion describes a word’s usage across many documents in a corpus. An evenly dispersed word is used at a similar rate across all documents; an unevenly dispersed word is used often in some documents and rarely in others.

(Annoyingly, this is the opposite of how statisticians would use the word “dispersion”. Normal dispersion would mean the rate varies as much as you’d expect from, say, a binomial; overdispersion would mean the word’s usage rate varies a lot from document to document. But that is a word that is not dispersed.)

In the below, let d be the number of documents in the corpus, each with n_i words, totaling N = \sum_i n_i words in the corpus. The target word (whose dispersion we are measuring) occurs y_i times per document. Its absolute frequency is \text{AF} = \sum_i y_i and its relative frequency is \text{RF} = \text{AF}/N.