User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
0 words Β· 0 characters
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

Text tokenization is the foundational step in natural language processing. Getting it wrong means downstream keyword analysis, search indexing, and content optimization all inherit compounded errors. This tool implements a full Porter Stemmer algorithm to reduce inflected words to their root form (fathers β†’ father, running β†’ run), then generates n-grams up to order 3 from the filtered token stream. It removes English stopwords from a 175+ word list and scores each token by term frequency TF. The tool approximates keyword relevance assuming monolingual English input with standard Latin characters. Non-ASCII and mixed-language texts may produce degraded stemming accuracy.

Pro tip: Short texts under 50 words produce sparse n-gram sets. For meaningful bigram and trigram extraction, input at least 200 words. The stemmer follows the original 1980 Porter algorithm. It does not handle irregular forms (e.g., went will not stem to go). For production NLP pipelines, validate output against your domain vocabulary.

text tokenizer keyword extractor n-gram generator stemming NLP tool text analysis stopword removal

Formulas

The term frequency for each token is calculated as the ratio of its occurrences to the total number of tokens in the processed stream:

TF(t, d) = count(t in d)total tokens in d

Where t is a token (word or n-gram) and d is the document (input text). The Porter Stemmer uses a "measure" m defined as the number of vowel-consonant sequences in a word:

m = |(VC)| count in pattern [C](VC)m[V]

Where C represents a consonant sequence and V a vowel sequence. Suffix removal rules apply only when the remaining stem has measure m > 0 (or m > 1 for aggressive stripping in Step 4). N-grams of order n are generated by sliding a window of size n across the filtered token array:

ngramsn = { tokens[i .. i+n] for i = 0 to L βˆ’ n }

Where L is the length of the filtered token list. Both the original and stemmed forms of each n-gram are produced and deduplicated to maximize keyword coverage.

Reference Data

Stopword CategoryExamplesCountImpact on Token Reduction
Articlesa, an, the3~5% of English text
Prepositionsin, on, at, by, for, with, from, to22~8%
Conjunctionsand, but, or, nor, yet, so7~3%
PronounsI, you, he, she, it, we, they, me28~7%
Auxiliary Verbsis, am, are, was, were, be, been, being24~10%
Common Adverbsvery, really, just, also, too, quite15~3%
Demonstrativesthis, that, these, those4~2%
Quantifierssome, any, many, few, much, all12~2%
Interrogativeswhat, which, who, whom, where, when8~1%
Modal Verbscan, could, may, might, will, would, shall10~2%
Negationsnot, no, never, neither, nor5~1%
Miscellaneousthen, than, own, other, such, only37~4%
Total Stopwords in Filter175+~40 - 60% of typical text
Porter Stemmer Step Reference
Step 1aPlurals & past participlessses β†’ ss, ies β†’ i, s β†’ βˆ…
Step 1bPast tense & gerundseed β†’ ee, ed/ing removal + cleanup
Step 1cTerminal yy β†’ i (after consonant)
Step 2Double suffixesational β†’ ate, ousness β†’ ous, etc. (20 rules)
Step 3Final suffixesicate β†’ ic, alize β†’ al, etc. (7 rules)
Step 4Long suffixesement, tion, ence, etc. removed if m > 1
Step 5Final cleanupTrailing e/ll removal based on measure
N-gram Types
Unigram (n=1)Single keywordHighest recall, lowest specificity
Bigram (n=2)Two-word phraseGood balance of precision and recall
Trigram (n=3)Three-word phraseHigh precision, lower frequency

Frequently Asked Questions

The Porter Stemmer is a rule-based suffix-stripping algorithm. It does not use a dictionary lookup, so irregular forms like "went" will not map to "go", and "better" will not reduce to "good". It handles regular morphological patterns only (plurals, gerunds, past tense, derivational suffixes). For irregular forms, a lemmatizer with a full morphological dictionary would be required, which is beyond the scope of client-side tokenization.
The tokenizer intentionally outputs both forms to maximize keyword coverage. For example, "fathers" produces both "fathers" (original) and "father" (stemmed). When used for SEO or search indexing, having both variants ensures you capture exact-match and root-match queries. Duplicates within the same n-gram class are removed, but cross-form duplicates are preserved.
After stopword removal, a typical English text retains approximately 40-60% of its original word count. For trigrams to be statistically useful, you need at least 3 content words in sequence. In practice, input texts under 50 words will often produce fewer than 5 trigrams. For robust n-gram analysis, 200+ words is recommended. The tool will still function on shorter texts but will display a notice about limited n-gram output.
Yes. Stopwords are removed before n-gram generation, not after. This means a phrase like "the king of the north" becomes ["king", "north"], producing only the bigram "king north" rather than "the king", "king of", "of the", "the north". This is by design: content-bearing n-grams are more useful for keyword extraction. If you need stopword-inclusive phrases, disable the stopword filter using the toggle.
TF is calculated independently within each n-gram class. A unigram's frequency is its count divided by total unigrams. A bigram's frequency is its count divided by total bigrams. This prevents unigrams (which naturally have higher counts) from dominating the frequency ranking when all token types are displayed together. The percentage shown on each chip reflects its relative frequency within its own n-gram class.
The tokenizer will split and count tokens in any Latin-script language, but the Porter Stemmer and stopword list are English-only. For languages like French or German, stemming will produce incorrect roots (e.g., French plural "-s" removal works coincidentally, but verb conjugations will not be handled). For non-Latin scripts (Chinese, Arabic, Cyrillic), word boundary detection based on whitespace will fail for languages without spaces between words.