About

Text fuzziness is the deliberate introduction of controlled noise into a string. Applications range from adversarial testing of NLP pipelines and spam filter evasion analysis to data augmentation for training robust OCR and text-classification models. A misapplied fuzziness level can render test data useless or produce artifacts that don't reflect real-world corruption patterns. This tool applies four distinct distortion algorithms - homoglyph substitution using Unicode confusables from Latin, Cyrillic, and Greek blocks, stochastic typo injection based on a QWERTY adjacency matrix, Zalgo diacritical stacking via combining characters in the range U+0300 - 036F, and weighted leetspeak mapping - each governed by a probability parameter p ∈ [0, 1]. Note: homoglyph output may appear identical to the original on certain fonts but differs at the codepoint level. Results depend on the rendering engine and typeface.

Formulas

Each character in the input string is independently subjected to a distortion decision. The probability of any character being altered is governed by the fuzziness parameter:

P(alter) = f × w_mode

where f ∈ [0, 1] is the fuzziness intensity from the slider, and w_mode is the mode-specific weight (homoglyph: 0.9, typo: 0.4, Zalgo: 0.7, leet: 0.8). For each character c_i, a uniform random value r ∈ [0, 1) is drawn:

{

distort(c_i) if r < P(alter)c_i otherwise

For Zalgo mode, the number of combining marks n stacked on each affected character scales linearly with intensity:

n_marks = floor(f × 15) + 1

where f at maximum yields up to 16 combining characters per base glyph. For typo mode, the error type is selected uniformly from the set {swap, duplicate, omit, neighbor}, with neighbor-key substitution using a precomputed QWERTY adjacency lookup of 26 × ~4.5 average neighbors per key.

Reference Data

Mode	Technique	Unicode Range / Mechanism	Detectability	Use Case
Homoglyph	Visual lookalike substitution	Cyrillic (U+0400 - U+04FF), Greek (U+0370 - U+03FF)	Low (visually identical)	Phishing research, filter bypass testing
Typo	Adjacent-key swap, duplication, omission	QWERTY keyboard adjacency map	Medium (human-readable errors)	NLP robustness testing, data augmentation
Zalgo	Combining diacritical marks stacking	U+0300 - U+036F (Combining Diacriticals)	High (visually chaotic)	Artistic text, stress testing renderers
Leetspeak	Alpha → numeric/symbol replacement	ASCII substitution dictionary	Medium (recognizable pattern)	Gaming culture, basic obfuscation
Mixed	Weighted blend of all four modes	All of the above	Variable	Comprehensive fuzzing
Common Homoglyph Pairs (Latin → Cyrillic/Greek)
a	а (Cyrillic Small A)	U+0430	Visually identical	Confusable substitution
e	е (Cyrillic Small Ie)	U+0435	Visually identical	Confusable substitution
o	ο (Greek Small Omicron)	U+03BF	Visually identical	Confusable substitution
p	р (Cyrillic Small Er)	U+0440	Visually identical	Confusable substitution
c	с (Cyrillic Small Es)	U+0441	Visually identical	Confusable substitution
x	х (Cyrillic Small Kha)	U+0445	Visually identical	Confusable substitution
y	у (Cyrillic Small U)	U+0443	Visually identical	Confusable substitution
s	ѕ (Cyrillic Small Dze)	U+0455	Visually identical	Confusable substitution
i	і (Cyrillic Small Byelorussian-Ukrainian I)	U+0456	Visually identical	Confusable substitution
H	Н (Cyrillic Capital En)	U+041D	Visually identical	Confusable substitution
T	Т (Cyrillic Capital Te)	U+0422	Visually identical	Confusable substitution
B	В (Cyrillic Capital Ve)	U+0412	Visually identical	Confusable substitution
Combining Diacritical Marks (Zalgo)
Above	U+0300 - U+0315	Grave, Acute, Circumflex, Tilde, etc.	22 marks	Stack above glyphs
Below	U+0316 - U+0333	Grave below, Cedilla, etc.	30 marks	Stack below glyphs
Overlay	U+0334 - U+0338	Tilde overlay, Stroke, etc.	5 marks	Strike-through effect

Frequently Asked Questions

Homoglyph substitution replaces ASCII characters (1 byte in UTF-8) with visually identical Unicode codepoints from Cyrillic or Greek blocks, which encode as 2-3 bytes in UTF-8. The string appears unchanged to the human eye, but string comparison, hashing, and regex matching will all fail. This is the core mechanism behind internationalized domain name (IDN) homograph attacks studied in security research.

At intensity f = 0.5, approximately 50% of eligible characters are distorted (scaled by the mode weight w). For a 1000-character corpus, expect roughly 350-450 altered characters in homoglyph mode (w = 0.9) versus ~200 in typo mode (w = 0.4). For statistically significant augmentation, generate multiple variants at different intensity levels and compare model accuracy degradation curves.

Yes. At maximum intensity, each character can carry up to 16 combining diacritical marks, expanding vertical glyph bounds dramatically. Some older browsers and terminal emulators may clip, overlap, or crash when rendering deeply stacked combiners. Modern browsers handle it gracefully but line-height calculations become unreliable. Avoid pasting high-intensity Zalgo into fixed-height UI elements in production.

The typo algorithm uses a QWERTY physical adjacency map, so substituted characters are spatially near the original key. This produces errors consistent with motor-control slip patterns observed in human typing studies. However, it does not model phonetic confusion (e.g., 'their/there') or autocorrect artifacts. For phonetic error simulation, a phoneme-mapping layer would be required, which this tool does not implement.

By default, the tool uses Math.random, which is non-deterministic. Each "Fuzzify" click produces a different output even with identical input and settings. If you need reproducibility for test suites, re-run the tool multiple times and select the variant that best represents your target error distribution. The state (input text, mode, intensity) is persisted to LocalStorage for session continuity.

Whitespace characters (space, tab, newline), digits in non-leet modes, and punctuation marks without known homoglyphs are preserved. In homoglyph mode, only characters with entries in the confusables dictionary are eligible for substitution. Characters outside the Basic Latin block (U+0000 - U+007F) pass through unaltered in all modes to avoid double-encoding.