User Rating 0.0
Total Usage 0 times
Supports English and accented Latin characters.
words
Is this tool helpful?

Your feedback helps us improve.

About

Word frequency analysis quantifies how often each distinct token appears in a corpus. It is foundational to computational linguistics, information retrieval, and natural language processing. Without frequency data, tasks like TF-IDF weighting, keyword extraction, and readability scoring cannot function. This tool tokenizes input text using a Unicode-aware regular expression, constructs an O(n) hash map of lowercased tokens, and reports fi for each word wi. Lexical density is computed as the ratio of unique words to total words: D = VN, where V is vocabulary size and N is total token count. A density below 0.4 typically indicates repetitive or formulaic prose.

The tool includes a curated English stop-word list of 175 function words. Toggling stop words on reveals content-carrying terms only. Note: tokenization assumes whitespace-delimited languages and does not handle CJK scripts or agglutinative morphology. Contractions like "don't" are treated as single tokens. For best results, supply plain text without HTML markup or code fragments.

word frequency text analysis word counter frequency analyzer word cloud lexical density text statistics

Formulas

The absolute frequency of word wi is the count of its occurrences in the tokenized text.

fi = Nj=1 [tj = wi]

The relative frequency expresses each word as a percentage of the total corpus.

frel = fiN × 100%

Lexical density measures vocabulary richness.

D = VN

Shannon entropy quantifies information content per word.

H = Vi=1 pi log2 pi

Where pi = fiN is the probability of word wi.

Where N = total token count, V = vocabulary size (unique words), tj = the j-th token in the text, and [] is the Iverson bracket returning 1 if the condition is true.

Reference Data

MetricSymbolDefinitionTypical Range
Total WordsNCount of all tokens after tokenization1 -
Unique Words (Vocabulary)VCount of distinct lowercased tokens1 - N
Lexical DensityDVN0.40 - 0.60 (prose)
Average Word LengthLMean character count per token4 - 6 (English)
Hapax LegomenaV1Words appearing exactly once40 - 60% of V
Frequency RankrOrdinal position by descending count1 - V
Relative FrequencyfrelfiN × 100%0.01 - 7%
Zipf's ExponentsPower-law exponent: f 1rs 1.0 (natural text)
Type-Token Ratio (TTR)TTRSame as lexical density for unlemmatized text0.40 - 0.60
Stop Words - Function words (the, is, at, of) filtered out100 - 200 terms
Content Words - Nouns, verbs, adjectives, adverbs remaining after stop-word removal50 - 70% of V
Most Frequent Word (English) - "the" in general corpora 7% of N
Heaps' LawV = KNβVocabulary growth rateK 10 - 100, β 0.4 - 0.6
Shannon EntropyHVi=1 pi log2 pi4 - 12 bits
Yule's KKVocabulary richness measure independent of text length80 - 200 (typical prose)

Frequently Asked Questions

The tokenizer treats apostrophes as valid word characters. "don't" counts as one token, and "John's" also counts as one token. This prevents artificial inflation of the word count. If you need to separate contractions, expand them before pasting (e.g., "do not" instead of "don't").
This is described by Heaps' Law: vocabulary V grows sublinearly with corpus size N, approximately as V = KNβ where β < 1. Longer texts reuse common words more, naturally reducing the ratio VN. Compare texts of similar length for meaningful density comparisons.
The built-in list contains approximately 175 English function words sourced from standard NLP stop-word corpora (NLTK, spaCy defaults). Filtering them removes grammatical glue ("the", "is", "and") and surfaces content words (nouns, verbs, adjectives). However, in some domains, stop words carry meaning. For example, in legal text, "shall" vs. "may" is critical. Disable the filter when function-word distribution matters.
Zipf's Law predicts that the r-th most frequent word has frequency proportional to 1rs with s 1. In the bar chart, you should see a steep initial drop-off followed by a long tail. If your text deviates significantly, it may indicate a restricted vocabulary (e.g., technical manuals) or highly repetitive content.
Yes. The tokenizer regex matches Unicode Latin Extended characters (U+00C0 through U+024F), covering accented letters in French, Spanish, German, Portuguese, and other European languages. However, it does not tokenize CJK (Chinese, Japanese, Korean) scripts, Arabic, Hebrew, or Devanagari, which require language-specific segmentation.
Shannon entropy H measures the average information per word in bits. A text where every word is equally likely yields maximum entropy (log2 V). Repetitive text yields low entropy. Typical English prose ranges from 7 to 10 bits. Values below 5 suggest extreme repetition; values above 11 suggest high vocabulary diversity or very short text.
By default, all tokens are lowercased before counting, so "The" and "the" merge into one entry. This is standard practice for frequency analysis. If case-sensitive counting is needed (e.g., distinguishing proper nouns from common nouns), preprocess your text to tag proper nouns before analysis.