About

Text tokenization is the foundational step in natural language processing. Getting it wrong means downstream keyword analysis, search indexing, and content optimization all inherit compounded errors. This tool implements a full Porter Stemmer algorithm to reduce inflected words to their root form (fathers → father, running → run), then generates n-grams up to order 3 from the filtered token stream. It removes English stopwords from a 175+ word list and scores each token by term frequency TF. The tool approximates keyword relevance assuming monolingual English input with standard Latin characters. Non-ASCII and mixed-language texts may produce degraded stemming accuracy.

Pro tip: Short texts under 50 words produce sparse n-gram sets. For meaningful bigram and trigram extraction, input at least 200 words. The stemmer follows the original 1980 Porter algorithm. It does not handle irregular forms (e.g., went will not stem to go). For production NLP pipelines, validate output against your domain vocabulary.

Formulas

The term frequency for each token is calculated as the ratio of its occurrences to the total number of tokens in the processed stream:

TF(t, d) = count(t in d)total tokens in d

Where t is a token (word or n-gram) and d is the document (input text). The Porter Stemmer uses a "measure" m defined as the number of vowel-consonant sequences in a word:

m = |(VC)| count in pattern [C](VC)^m[V]

Where C represents a consonant sequence and V a vowel sequence. Suffix removal rules apply only when the remaining stem has measure m > 0 (or m > 1 for aggressive stripping in Step 4). N-grams of order n are generated by sliding a window of size n across the filtered token array:

ngrams_n = { tokens[i .. i+n] for i = 0 to L − n }

Where L is the length of the filtered token list. Both the original and stemmed forms of each n-gram are produced and deduplicated to maximize keyword coverage.

Reference Data

Stopword Category	Examples	Count	Impact on Token Reduction
Articles	a, an, the	3	~5% of English text
Prepositions	in, on, at, by, for, with, from, to	22	~8%
Conjunctions	and, but, or, nor, yet, so	7	~3%
Pronouns	I, you, he, she, it, we, they, me	28	~7%
Auxiliary Verbs	is, am, are, was, were, be, been, being	24	~10%
Common Adverbs	very, really, just, also, too, quite	15	~3%
Demonstratives	this, that, these, those	4	~2%
Quantifiers	some, any, many, few, much, all	12	~2%
Interrogatives	what, which, who, whom, where, when	8	~1%
Modal Verbs	can, could, may, might, will, would, shall	10	~2%
Negations	not, no, never, neither, nor	5	~1%
Miscellaneous	then, than, own, other, such, only	37	~4%
Total Stopwords in Filter		175+	~40 - 60% of typical text
Porter Stemmer Step Reference
Step 1a	Plurals & past participles	sses → ss, ies → i, s → ∅
Step 1b	Past tense & gerunds	eed → ee, ed/ing removal + cleanup
Step 1c	Terminal y	y → i (after consonant)
Step 2	Double suffixes	ational → ate, ousness → ous, etc. (20 rules)
Step 3	Final suffixes	icate → ic, alize → al, etc. (7 rules)
Step 4	Long suffixes	ement, tion, ence, etc. removed if m > 1
Step 5	Final cleanup	Trailing e/ll removal based on measure
N-gram Types
Unigram (n=1)	Single keyword	Highest recall, lowest specificity
Bigram (n=2)	Two-word phrase	Good balance of precision and recall
Trigram (n=3)	Three-word phrase	High precision, lower frequency

Frequently Asked Questions

The Porter Stemmer is a rule-based suffix-stripping algorithm. It does not use a dictionary lookup, so irregular forms like "went" will not map to "go", and "better" will not reduce to "good". It handles regular morphological patterns only (plurals, gerunds, past tense, derivational suffixes). For irregular forms, a lemmatizer with a full morphological dictionary would be required, which is beyond the scope of client-side tokenization.

The tokenizer intentionally outputs both forms to maximize keyword coverage. For example, "fathers" produces both "fathers" (original) and "father" (stemmed). When used for SEO or search indexing, having both variants ensures you capture exact-match and root-match queries. Duplicates within the same n-gram class are removed, but cross-form duplicates are preserved.

After stopword removal, a typical English text retains approximately 40-60% of its original word count. For trigrams to be statistically useful, you need at least 3 content words in sequence. In practice, input texts under 50 words will often produce fewer than 5 trigrams. For robust n-gram analysis, 200+ words is recommended. The tool will still function on shorter texts but will display a notice about limited n-gram output.

Yes. Stopwords are removed before n-gram generation, not after. This means a phrase like "the king of the north" becomes ["king", "north"], producing only the bigram "king north" rather than "the king", "king of", "of the", "the north". This is by design: content-bearing n-grams are more useful for keyword extraction. If you need stopword-inclusive phrases, disable the stopword filter using the toggle.

TF is calculated independently within each n-gram class. A unigram's frequency is its count divided by total unigrams. A bigram's frequency is its count divided by total bigrams. This prevents unigrams (which naturally have higher counts) from dominating the frequency ranking when all token types are displayed together. The percentage shown on each chip reflects its relative frequency within its own n-gram class.

The tokenizer will split and count tokens in any Latin-script language, but the Porter Stemmer and stopword list are English-only. For languages like French or German, stemming will produce incorrect roots (e.g., French plural "-s" removal works coincidentally, but verb conjugations will not be handled). For non-Latin scripts (Chinese, Arabic, Cyrillic), word boundary detection based on whitespace will fail for languages without spaces between words.