Combining Character Remover
Remove combining Unicode characters, diacritical marks, and Zalgo text artifacts from any string. Clean text instantly in your browser.
About
Combining characters (Unicode category M) are non-spacing or spacing marks that attach to preceding base characters to form composed glyphs. Diacritical marks like acute (U+0301), cedilla (U+0327), and tilde (U+0303) fall into this class. When stacked excessively, they produce "Zalgo" artifacts that corrupt layouts, break search indexing, and cause accessibility failures in screen readers. This tool applies NFD decomposition via normalize("NFD") to split base characters from their marks, then strips marks matching the Unicode property \p{M}. The result is plain, unadorned text safe for databases, filenames, and cross-platform display.
Limitations: removal of combining marks from scripts where they carry semantic meaning (e.g., Arabic vowel marks, Devanagari matras) will alter pronunciation and meaning. The tool flags spacing combining marks (Mc) separately so you can preserve them if needed. Processing is entirely client-side. No text leaves your browser.
Formulas
The removal pipeline applies two Unicode operations in sequence. First, canonical decomposition separates composite characters into a base character and its combining mark sequence:
After decomposition, a regular expression targeting all Unicode Mark categories strips the combining code points:
The Unicode General Category M (Mark) encompasses three sub-categories used for granular control:
Where Mn = Non-spacing marks (diacritics, tone marks), Mc = Spacing combining marks (vowel signs in Indic scripts), Me = Enclosing marks (circles, squares around base characters). Zalgo intensity Z per base character b is measured as the count of consecutive combining characters following it:
Text is flagged as Zalgo when Z(b) > 3 for any base character b.
Reference Data
| Unicode Block | Range | Category | Script | Example | Description |
|---|---|---|---|---|---|
| Combining Diacritical Marks | U+0300 - U+036F | Mn | Common | ฬ (acute) | Accents, tilde, macron, hook, horn |
| Combining Diacritical Extended | U+1AB0 - U+1AFF | Mn | Common | แชฐ | Medieval superscript letters |
| Combining Diacritical Supplement | U+1DC0 - U+1DFF | Mn | Common | แท | Additional accents for IPA, UPA |
| Combining Half Marks | U+FE20 - U+FE2F | Mn | Common | ๏ธ | Ligature half marks, Cyrillic titlo |
| Combining for Symbols | U+20D0 - U+20FF | Mn/Me | Common | โ | Arrows, enclosing circles for symbols |
| Cyrillic Combining | U+0483 - U+0489 | Mn/Me | Cyrillic | า | Titlo, palatalization, dasia pneumata |
| Hebrew Combining | U+0591 - U+05C7 | Mn | Hebrew | ึฐ (sheva) | Cantillation marks, vowel points |
| Arabic Combining | U+0610 - U+065F | Mn | Arabic | ู (fatha) | Vowel signs, shadda, sukun |
| Devanagari Combining | U+0900 - U+0903 | Mn/Mc | Devanagari | เค (anusvara) | Nasalization, visarga, candrabindu |
| Thai Combining | U+0E31 - U+0E3A | Mn | Thai | เธฑ | Vowel signs, tone marks |
| Tibetan Combining | U+0F71 - U+0F84 | Mn | Tibetan | เฝฑ | Vowel signs, virama |
| Myanmar Combining | U+1039 - U+103A | Mn | Myanmar | แน | Virama, asat |
| Musical Symbols Combining | U+1D165 - U+1D169 | Mc/Mn | Common | - | Stem, flag, tremolo combining marks |
| Enclosing Marks (CJK) | U+20DD - U+20E4 | Me | Common | โ | Enclosing circle, square, diamond |
| Variation Selectors | U+FE00 - U+FE0F | Mn | Common | ๏ธ | Emoji vs text presentation selectors |