Analyze Word Frequency
Analyze word frequency in any text. Count occurrences, visualize distribution, filter stop words, and export results with this free online tool.
About
Word frequency analysis quantifies how often each distinct token appears in a corpus. It is foundational to computational linguistics, information retrieval, and natural language processing. Without frequency data, tasks like TF-IDF weighting, keyword extraction, and readability scoring cannot function. This tool tokenizes input text using a Unicode-aware regular expression, constructs an O(n) hash map of lowercased tokens, and reports fi for each word wi. Lexical density is computed as the ratio of unique words to total words: D = VN, where V is vocabulary size and N is total token count. A density below 0.4 typically indicates repetitive or formulaic prose.
The tool includes a curated English stop-word list of 175 function words. Toggling stop words on reveals content-carrying terms only. Note: tokenization assumes whitespace-delimited languages and does not handle CJK scripts or agglutinative morphology. Contractions like "don't" are treated as single tokens. For best results, supply plain text without HTML markup or code fragments.
Formulas
The absolute frequency of word wi is the count of its occurrences in the tokenized text.
The relative frequency expresses each word as a percentage of the total corpus.
Lexical density measures vocabulary richness.
Shannon entropy quantifies information content per word.
Where pi = fiN is the probability of word wi.
Where N = total token count, V = vocabulary size (unique words), tj = the j-th token in the text, and [⋅] is the Iverson bracket returning 1 if the condition is true.
Reference Data
| Metric | Symbol | Definition | Typical Range |
|---|---|---|---|
| Total Words | N | Count of all tokens after tokenization | 1 - ∞ |
| Unique Words (Vocabulary) | V | Count of distinct lowercased tokens | 1 - N |
| Lexical Density | D | VN | 0.40 - 0.60 (prose) |
| Average Word Length | Mean character count per token | 4 - 6 (English) | |
| Hapax Legomena | V1 | Words appearing exactly once | 40 - 60% of V |
| Frequency Rank | r | Ordinal position by descending count | 1 - V |
| Relative Frequency | frel | fiN × 100% | 0.01 - 7% |
| Zipf's Exponent | s | Power-law exponent: f ∝ 1rs | ≈ 1.0 (natural text) |
| Type-Token Ratio (TTR) | TTR | Same as lexical density for unlemmatized text | 0.40 - 0.60 |
| Stop Words | - | Function words (the, is, at, of) filtered out | 100 - 200 terms |
| Content Words | - | Nouns, verbs, adjectives, adverbs remaining after stop-word removal | 50 - 70% of V |
| Most Frequent Word (English) | - | "the" in general corpora | ≈ 7% of N |
| Heaps' Law | V = KNβ | Vocabulary growth rate | K ≈ 10 - 100, β ≈ 0.4 - 0.6 |
| Shannon Entropy | H | −V∑i=1 pi log2 pi | 4 - 12 bits |
| Yule's K | K | Vocabulary richness measure independent of text length | 80 - 200 (typical prose) |