User Rating 0.0 โ˜…โ˜…โ˜…โ˜…โ˜…
Total Usage 0 times
Supports any Unicode text. Combining marks will be highlighted in the preview.
Removal Mode
Is this tool helpful?

Your feedback helps us improve.

โ˜… โ˜… โ˜… โ˜… โ˜…

About

Combining characters (Unicode category M) are non-spacing or spacing marks that attach to preceding base characters to form composed glyphs. Diacritical marks like acute (U+0301), cedilla (U+0327), and tilde (U+0303) fall into this class. When stacked excessively, they produce "Zalgo" artifacts that corrupt layouts, break search indexing, and cause accessibility failures in screen readers. This tool applies NFD decomposition via normalize("NFD") to split base characters from their marks, then strips marks matching the Unicode property \p{M}. The result is plain, unadorned text safe for databases, filenames, and cross-platform display.

Limitations: removal of combining marks from scripts where they carry semantic meaning (e.g., Arabic vowel marks, Devanagari matras) will alter pronunciation and meaning. The tool flags spacing combining marks (Mc) separately so you can preserve them if needed. Processing is entirely client-side. No text leaves your browser.

combining characters unicode diacritics remover zalgo text cleaner text sanitizer unicode normalization strip accents

Formulas

The removal pipeline applies two Unicode operations in sequence. First, canonical decomposition separates composite characters into a base character and its combining mark sequence:

Sdecomposed = normalize(S, "NFD")

After decomposition, a regular expression targeting all Unicode Mark categories strips the combining code points:

Sclean = replace(Sdecomposed, /\p{M}/gu, '')

The Unicode General Category M (Mark) encompasses three sub-categories used for granular control:

M = Mn โˆช Mc โˆช Me

Where Mn = Non-spacing marks (diacritics, tone marks), Mc = Spacing combining marks (vowel signs in Indic scripts), Me = Enclosing marks (circles, squares around base characters). Zalgo intensity Z per base character b is measured as the count of consecutive combining characters following it:

Z(b) = |{ c โˆˆ M : c follows b contiguously }|

Text is flagged as Zalgo when Z(b) > 3 for any base character b.

Reference Data

Unicode BlockRangeCategoryScriptExampleDescription
Combining Diacritical MarksU+0300 - U+036FMnCommonฬ (acute)Accents, tilde, macron, hook, horn
Combining Diacritical ExtendedU+1AB0 - U+1AFFMnCommonแชฐMedieval superscript letters
Combining Diacritical SupplementU+1DC0 - U+1DFFMnCommonแท€Additional accents for IPA, UPA
Combining Half MarksU+FE20 - U+FE2FMnCommon๏ธ Ligature half marks, Cyrillic titlo
Combining for SymbolsU+20D0 - U+20FFMn/MeCommonโƒ’Arrows, enclosing circles for symbols
Cyrillic CombiningU+0483 - U+0489Mn/MeCyrillicาƒTitlo, palatalization, dasia pneumata
Hebrew CombiningU+0591 - U+05C7MnHebrewึฐ (sheva)Cantillation marks, vowel points
Arabic CombiningU+0610 - U+065FMnArabicูŽ (fatha)Vowel signs, shadda, sukun
Devanagari CombiningU+0900 - U+0903Mn/McDevanagariเค‚ (anusvara)Nasalization, visarga, candrabindu
Thai CombiningU+0E31 - U+0E3AMnThaiเธฑVowel signs, tone marks
Tibetan CombiningU+0F71 - U+0F84MnTibetanเฝฑVowel signs, virama
Myanmar CombiningU+1039 - U+103AMnMyanmarแ€นVirama, asat
Musical Symbols CombiningU+1D165 - U+1D169Mc/MnCommon - Stem, flag, tremolo combining marks
Enclosing Marks (CJK)U+20DD - U+20E4MeCommonโƒEnclosing circle, square, diamond
Variation SelectorsU+FE00 - U+FE0FMnCommon๏ธEmoji vs text presentation selectors

Frequently Asked Questions

Yes. Arabic vowel marks (harakat) like fatha (U+064E), damma (U+064F), and kasra (U+0650) are combining characters of category Mn. Removing them strips vowelization, making the text consonantal-only. Similarly, Hebrew nikkud (vowel points) will be lost. Use the "Diacritics only (Mn)" mode and review results carefully when processing Semitic scripts. For Arabic, consonantal text is the standard written form, so removal may be acceptable. For fully vocalized religious texts, it is not.
Some combined characters exist as single precomposed code points (NFC form). For example, "รฉ" can be U+00E9 (single code point) or U+0065 + U+0301 (base "e" + combining acute). Without NFD decomposition first, the regex \p{M} will not match the precomposed U+00E9 because it is not a combining mark - it is a letter. NFD decomposition breaks it into the two-code-point sequence, making the accent removable. Always decompose before filtering.
Zalgo text stacks dozens of combining characters (typically from U+0300 - U+036F and U+0489) on single base characters. The tool detects Zalgo by counting consecutive combining marks per base character. Any base character followed by more than 3 combining marks is flagged. The statistics panel reports total Zalgo-affected characters. All combining marks are removed uniformly - there is no "partial Zalgo cleanup" because distinguishing intentional diacritics from Zalgo stacking is ambiguous.
Variation selectors (U+FE00 - U+FE0F) are category Mn and will be stripped. U+FE0F is the emoji presentation selector - removing it may cause some emoji to render as text glyphs instead of colored pictographs (e.g., โ˜บ vs โ˜บ๏ธ). Skin tone modifiers (U+1F3FB - U+1F3FF) are not combining characters (they are category Sk), so they are preserved. Enclosing combining marks (Me category, like U+20E3 for keycap sequences) will be removed if the Me filter is active, which can break keycap emoji like 1๏ธโƒฃ.
The "Common Diacritics Only" mode restricts removal to the U+0300 - U+036F block, which covers Latin/Greek/Cyrillic diacritical marks. This preserves Arabic harakat, Hebrew nikkud, Devanagari matras, and Thai tone marks. It is the safest mode for multilingual text where you only want to strip Western accents (e.g., converting "cafรฉ" to 'cafe').
Texts exceeding 50,000 characters are processed in chunks of 10,000 characters using asynchronous iteration (setTimeout batching). A progress bar displays completion percentage. The regex engine's \p{M} Unicode property escape is executed by the browser's native regex compiler, which is highly optimized. Typical processing speed is approximately 1 million characters per second on modern hardware. The bottleneck is DOM rendering of the output, not the regex itself.