User Rating 0.0 ★★★★★

Total Usage 0 times

Category Text Analysis

Total Words 0

Unique (Types) 0

Characters 0

Sentences 0

Est. Reading Time 0m

      Lexical Diversity
      0%
    

Ignore Case Ignore Numbers Stop Words Strict Filter (800+) Stemming

Top Recurring Words

#	Term	Count	Density	Graph

Top Recurring Pairs

#	Bigram	Count	Density	Graph

Top Recurring Triplets

#	Trigram	Count	Density	Graph

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

In information retrieval and natural language processing, the frequency of lexical tokens obeys Zipf's Law, which states that the frequency of any word is inversely proportional to its rank in the frequency table. For SEO specialists and data linguists, relying solely on single-word counts is insufficient. Search engines analyze semantic context through N-Grams (contiguous sequences of n items from a given sample of text).

This tool surpasses standard counters by implementing a multi-layer analysis engine. It calculates Type-Token Ratio (TTR) to assess lexical diversity, processes Bigrams (n=2) and Trigrams (n=3) to detect recurring phrases, and estimates readability metrics. The local database includes an expanded library of over 800 stop words, including archaic forms and SEO filtering terms, ensuring that the noise-to-signal ratio is minimized during analysis.

Formulas

To identify the significance of a term beyond raw count, we consider the frequency f of a term t relative to the total word count N. However, to analyze phrase patterns, we utilize N-Gram probability approximation:

P(w_n | w_n-1) ≈ Count(w_n-1, w_n)Count(w_n-1)

The Lexical Diversity, or Type-Token Ratio (TTR), serves as an indicator of writing quality. A low TTR suggests repetitive, simple text, while a high TTR indicates complex vocabulary.

TTR = VN

Where V is the size of the vocabulary (unique words) and N is the total number of tokens.

Reference Data

Metric	Definition	Formula / Representation	Typical Value (SEO)
Keyword Density	Relative frequency of a term.	countTotalWords × 100	1% - 2.5%
Lexical Diversity (TTR)	Variety of vocabulary used.	Unique TypesTotal Tokens	> 0.45
Bigram	Two consecutive words.	w_i, w_i+1	Context Dependent
Readability (Auto)	Sentence complexity proxy.	Avg(Words / Sentence)	15 - 20 words

Frequently Asked Questions

Search engines moved beyond single keywords years ago. They now look for "Long Tail Keywords" and user intent, which are often phrases. A Bigram (e.g., 'credit card') or Trigram (e.g., 'best credit card') reveals the specific topic much better than the single word "card". Analyzing these sequences helps you align your content with how users actually search.

The standard filter removes common function words like "the", "and", or "is". The "Strict" mode activates an expanded database of over 800 words, including adverbs, generic qualifiers (e.g., "actually", 'various'), and archaic terms. This is useful when you want to distill a text down to its absolute core nouns and verbs for hard-data analysis.

We utilize the industry standard average reading speed of 238 words per minute (WPM) for silent reading on a screen. The formula is Total Words / 238. For "Speaking Time", we use a slower rate of 130 words per minute, which is typical for presentations or podcasts.

Yes. The tokenization engine uses Unicode-aware Regular Expressions. It can correctly count and process Cyrillic, Greek, and accented Latin characters. However, the Stemming and Stop Word databases are currently optimized for English morphology.