About

Word frequency analysis quantifies how often each distinct token appears in a corpus. It is foundational to computational linguistics, information retrieval, and natural language processing. Without frequency data, tasks like TF-IDF weighting, keyword extraction, and readability scoring cannot function. This tool tokenizes input text using a Unicode-aware regular expression, constructs an O(n) hash map of lowercased tokens, and reports f_i for each word w_i. Lexical density is computed as the ratio of unique words to total words: D = VN, where V is vocabulary size and N is total token count. A density below 0.4 typically indicates repetitive or formulaic prose.

The tool includes a curated English stop-word list of 175 function words. Toggling stop words on reveals content-carrying terms only. Note: tokenization assumes whitespace-delimited languages and does not handle CJK scripts or agglutinative morphology. Contractions like "don't" are treated as single tokens. For best results, supply plain text without HTML markup or code fragments.

Formulas

The absolute frequency of word w_i is the count of its occurrences in the tokenized text.

f_i = N∑j=1 [t_j = w_i]

The relative frequency expresses each word as a percentage of the total corpus.

f_rel = f_iN × 100%

Lexical density measures vocabulary richness.

D = VN

Shannon entropy quantifies information content per word.

H = −V∑i=1 p_i log₂ p_i

Where p_i = f_iN is the probability of word w_i.

Where N = total token count, V = vocabulary size (unique words), t_j = the j-th token in the text, and [⋅] is the Iverson bracket returning 1 if the condition is true.

Reference Data

Metric	Symbol	Definition	Typical Range
Total Words	N	Count of all tokens after tokenization	1 - ∞
Unique Words (Vocabulary)	V	Count of distinct lowercased tokens	1 - N
Lexical Density	D	VN	0.40 - 0.60 (prose)
Average Word Length	L	Mean character count per token	4 - 6 (English)
Hapax Legomena	V₁	Words appearing exactly once	40 - 60% of V
Frequency Rank	r	Ordinal position by descending count	1 - V
Relative Frequency	f_rel	f_iN × 100%	0.01 - 7%
Zipf's Exponent	s	Power-law exponent: f ∝ 1r^s	≈ 1.0 (natural text)
Type-Token Ratio (TTR)	TTR	Same as lexical density for unlemmatized text	0.40 - 0.60
Stop Words	-	Function words (the, is, at, of) filtered out	100 - 200 terms
Content Words	-	Nouns, verbs, adjectives, adverbs remaining after stop-word removal	50 - 70% of V
Most Frequent Word (English)	-	"the" in general corpora	≈ 7% of N
Heaps' Law	V = KN^β	Vocabulary growth rate	K ≈ 10 - 100, β ≈ 0.4 - 0.6
Shannon Entropy	H	−V∑i=1 p_i log₂ p_i	4 - 12 bits
Yule's K	K	Vocabulary richness measure independent of text length	80 - 200 (typical prose)

Frequently Asked Questions

The tokenizer treats apostrophes as valid word characters. "don't" counts as one token, and "John's" also counts as one token. This prevents artificial inflation of the word count. If you need to separate contractions, expand them before pasting (e.g., "do not" instead of "don't").

This is described by Heaps' Law: vocabulary V grows sublinearly with corpus size N, approximately as V = KN^β where β < 1. Longer texts reuse common words more, naturally reducing the ratio VN. Compare texts of similar length for meaningful density comparisons.

The built-in list contains approximately 175 English function words sourced from standard NLP stop-word corpora (NLTK, spaCy defaults). Filtering them removes grammatical glue ("the", "is", "and") and surfaces content words (nouns, verbs, adjectives). However, in some domains, stop words carry meaning. For example, in legal text, "shall" vs. "may" is critical. Disable the filter when function-word distribution matters.

Zipf's Law predicts that the r-th most frequent word has frequency proportional to 1r^s with s ≈ 1. In the bar chart, you should see a steep initial drop-off followed by a long tail. If your text deviates significantly, it may indicate a restricted vocabulary (e.g., technical manuals) or highly repetitive content.

Yes. The tokenizer regex matches Unicode Latin Extended characters (U+00C0 through U+024F), covering accented letters in French, Spanish, German, Portuguese, and other European languages. However, it does not tokenize CJK (Chinese, Japanese, Korean) scripts, Arabic, Hebrew, or Devanagari, which require language-specific segmentation.

Shannon entropy H measures the average information per word in bits. A text where every word is equally likely yields maximum entropy (log₂ V). Repetitive text yields low entropy. Typical English prose ranges from 7 to 10 bits. Values below 5 suggest extreme repetition; values above 11 suggest high vocabulary diversity or very short text.

By default, all tokens are lowercased before counting, so "The" and "the" merge into one entry. This is standard practice for frequency analysis. If case-sensitive counting is needed (e.g., distinguishing proper nouns from common nouns), preprocess your text to tag proper nouns before analysis.