User Rating 0.0 ★★★★★

Total Usage 0 times

Category Text Formatting

Input Text Supports any Unicode text. Combining marks will be highlighted in the preview.

Removal Mode All Marks (Mn + Mc + Me) Non-spacing Only (Mn) Common Diacritics (U+0300–036F)

Apply NFD decomposition first

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

Combining characters (Unicode category M) are non-spacing or spacing marks that attach to preceding base characters to form composed glyphs. Diacritical marks like acute (U+0301), cedilla (U+0327), and tilde (U+0303) fall into this class. When stacked excessively, they produce "Zalgo" artifacts that corrupt layouts, break search indexing, and cause accessibility failures in screen readers. This tool applies NFD decomposition via normalize("NFD") to split base characters from their marks, then strips marks matching the Unicode property \p{M}. The result is plain, unadorned text safe for databases, filenames, and cross-platform display.

Limitations: removal of combining marks from scripts where they carry semantic meaning (e.g., Arabic vowel marks, Devanagari matras) will alter pronunciation and meaning. The tool flags spacing combining marks (Mc) separately so you can preserve them if needed. Processing is entirely client-side. No text leaves your browser.

Formulas

The removal pipeline applies two Unicode operations in sequence. First, canonical decomposition separates composite characters into a base character and its combining mark sequence:

S_decomposed = normalize(S, "NFD")

After decomposition, a regular expression targeting all Unicode Mark categories strips the combining code points:

S_clean = replace(S_decomposed, /\p{M}/gu, '')

The Unicode General Category M (Mark) encompasses three sub-categories used for granular control:

M = Mn ∪ Mc ∪ Me

Where Mn = Non-spacing marks (diacritics, tone marks), Mc = Spacing combining marks (vowel signs in Indic scripts), Me = Enclosing marks (circles, squares around base characters). Zalgo intensity Z per base character b is measured as the count of consecutive combining characters following it:

Z(b) = |{ c ∈ M : c follows b contiguously }|

Text is flagged as Zalgo when Z(b) > 3 for any base character b.

Reference Data

Unicode Block	Range	Category	Script	Example	Description
Combining Diacritical Marks	U+0300 - U+036F	Mn	Common	́ (acute)	Accents, tilde, macron, hook, horn
Combining Diacritical Extended	U+1AB0 - U+1AFF	Mn	Common	᪰	Medieval superscript letters
Combining Diacritical Supplement	U+1DC0 - U+1DFF	Mn	Common	᷀	Additional accents for IPA, UPA
Combining Half Marks	U+FE20 - U+FE2F	Mn	Common	︠	Ligature half marks, Cyrillic titlo
Combining for Symbols	U+20D0 - U+20FF	Mn/Me	Common	⃒	Arrows, enclosing circles for symbols
Cyrillic Combining	U+0483 - U+0489	Mn/Me	Cyrillic	҃	Titlo, palatalization, dasia pneumata
Hebrew Combining	U+0591 - U+05C7	Mn	Hebrew	ְ (sheva)	Cantillation marks, vowel points
Arabic Combining	U+0610 - U+065F	Mn	Arabic	َ (fatha)	Vowel signs, shadda, sukun
Devanagari Combining	U+0900 - U+0903	Mn/Mc	Devanagari	ं (anusvara)	Nasalization, visarga, candrabindu
Thai Combining	U+0E31 - U+0E3A	Mn	Thai	ั	Vowel signs, tone marks
Tibetan Combining	U+0F71 - U+0F84	Mn	Tibetan	ཱ	Vowel signs, virama
Myanmar Combining	U+1039 - U+103A	Mn	Myanmar	္	Virama, asat
Musical Symbols Combining	U+1D165 - U+1D169	Mc/Mn	Common	-	Stem, flag, tremolo combining marks
Enclosing Marks (CJK)	U+20DD - U+20E4	Me	Common	⃝	Enclosing circle, square, diamond
Variation Selectors	U+FE00 - U+FE0F	Mn	Common	️	Emoji vs text presentation selectors

Frequently Asked Questions

Yes. Arabic vowel marks (harakat) like fatha (U+064E), damma (U+064F), and kasra (U+0650) are combining characters of category Mn. Removing them strips vowelization, making the text consonantal-only. Similarly, Hebrew nikkud (vowel points) will be lost. Use the "Diacritics only (Mn)" mode and review results carefully when processing Semitic scripts. For Arabic, consonantal text is the standard written form, so removal may be acceptable. For fully vocalized religious texts, it is not.

Some combined characters exist as single precomposed code points (NFC form). For example, "é" can be U+00E9 (single code point) or U+0065 + U+0301 (base "e" + combining acute). Without NFD decomposition first, the regex \p{M} will not match the precomposed U+00E9 because it is not a combining mark - it is a letter. NFD decomposition breaks it into the two-code-point sequence, making the accent removable. Always decompose before filtering.

Zalgo text stacks dozens of combining characters (typically from U+0300 - U+036F and U+0489) on single base characters. The tool detects Zalgo by counting consecutive combining marks per base character. Any base character followed by more than 3 combining marks is flagged. The statistics panel reports total Zalgo-affected characters. All combining marks are removed uniformly - there is no "partial Zalgo cleanup" because distinguishing intentional diacritics from Zalgo stacking is ambiguous.

Variation selectors (U+FE00 - U+FE0F) are category Mn and will be stripped. U+FE0F is the emoji presentation selector - removing it may cause some emoji to render as text glyphs instead of colored pictographs (e.g., ☺ vs ☺️). Skin tone modifiers (U+1F3FB - U+1F3FF) are not combining characters (they are category Sk), so they are preserved. Enclosing combining marks (Me category, like U+20E3 for keycap sequences) will be removed if the Me filter is active, which can break keycap emoji like 1️⃣.

The "Common Diacritics Only" mode restricts removal to the U+0300 - U+036F block, which covers Latin/Greek/Cyrillic diacritical marks. This preserves Arabic harakat, Hebrew nikkud, Devanagari matras, and Thai tone marks. It is the safest mode for multilingual text where you only want to strip Western accents (e.g., converting "café" to 'cafe').

Texts exceeding 50,000 characters are processed in chunks of 10,000 characters using asynchronous iteration (setTimeout batching). A progress bar displays completion percentage. The regex engine's \p{M} Unicode property escape is executed by the browser's native regex compiler, which is highly optimized. Typical processing speed is approximately 1 million characters per second on modern hardware. The bottleneck is DOM rendering of the output, not the regex itself.