Text to Keywords Tokenizer
Tokenize any text into keywords with stemming, stopword removal, and n-gram generation. Extract unigrams, bigrams, and trigrams with frequency scoring.
About
Text tokenization is the foundational step in natural language processing. Getting it wrong means downstream keyword analysis, search indexing, and content optimization all inherit compounded errors. This tool implements a full Porter Stemmer algorithm to reduce inflected words to their root form (fathers β father, running β run), then generates n-grams up to order 3 from the filtered token stream. It removes English stopwords from a 175+ word list and scores each token by term frequency TF. The tool approximates keyword relevance assuming monolingual English input with standard Latin characters. Non-ASCII and mixed-language texts may produce degraded stemming accuracy.
Pro tip: Short texts under 50 words produce sparse n-gram sets. For meaningful bigram and trigram extraction, input at least 200 words. The stemmer follows the original 1980 Porter algorithm. It does not handle irregular forms (e.g., went will not stem to go). For production NLP pipelines, validate output against your domain vocabulary.
Formulas
The term frequency for each token is calculated as the ratio of its occurrences to the total number of tokens in the processed stream:
Where t is a token (word or n-gram) and d is the document (input text). The Porter Stemmer uses a "measure" m defined as the number of vowel-consonant sequences in a word:
Where C represents a consonant sequence and V a vowel sequence. Suffix removal rules apply only when the remaining stem has measure m > 0 (or m > 1 for aggressive stripping in Step 4). N-grams of order n are generated by sliding a window of size n across the filtered token array:
Where L is the length of the filtered token list. Both the original and stemmed forms of each n-gram are produced and deduplicated to maximize keyword coverage.
Reference Data
| Stopword Category | Examples | Count | Impact on Token Reduction |
|---|---|---|---|
| Articles | a, an, the | 3 | ~5% of English text |
| Prepositions | in, on, at, by, for, with, from, to | 22 | ~8% |
| Conjunctions | and, but, or, nor, yet, so | 7 | ~3% |
| Pronouns | I, you, he, she, it, we, they, me | 28 | ~7% |
| Auxiliary Verbs | is, am, are, was, were, be, been, being | 24 | ~10% |
| Common Adverbs | very, really, just, also, too, quite | 15 | ~3% |
| Demonstratives | this, that, these, those | 4 | ~2% |
| Quantifiers | some, any, many, few, much, all | 12 | ~2% |
| Interrogatives | what, which, who, whom, where, when | 8 | ~1% |
| Modal Verbs | can, could, may, might, will, would, shall | 10 | ~2% |
| Negations | not, no, never, neither, nor | 5 | ~1% |
| Miscellaneous | then, than, own, other, such, only | 37 | ~4% |
| Total Stopwords in Filter | 175+ | ~40 - 60% of typical text | |
| Porter Stemmer Step Reference | |||
| Step 1a | Plurals & past participles | sses β ss, ies β i, s β β | |
| Step 1b | Past tense & gerunds | eed β ee, ed/ing removal + cleanup | |
| Step 1c | Terminal y | y β i (after consonant) | |
| Step 2 | Double suffixes | ational β ate, ousness β ous, etc. (20 rules) | |
| Step 3 | Final suffixes | icate β ic, alize β al, etc. (7 rules) | |
| Step 4 | Long suffixes | ement, tion, ence, etc. removed if m > 1 | |
| Step 5 | Final cleanup | Trailing e/ll removal based on measure | |
| N-gram Types | |||
| Unigram (n=1) | Single keyword | Highest recall, lowest specificity | |
| Bigram (n=2) | Two-word phrase | Good balance of precision and recall | |
| Trigram (n=3) | Three-word phrase | High precision, lower frequency | |