About

Accented characters cause silent failures in systems that expect ASCII input. Database collation mismatches, broken URL slugs, failed search queries, and CSV import errors all trace back to unhandled diacritical marks. This tool applies Unicode NFD (Normalization Form Canonical Decomposition) to decompose composite characters like é into base letter e plus combining acute accent U+0301, then strips the combining marks. It also handles non-decomposable special letters - ø, đ, ł, ß - via direct mapping tables, which NFD alone cannot resolve.

The tool processes text in O(n) time where n is the string length. It preserves whitespace, punctuation, numbers, and non-Latin scripts (CJK, Arabic, Cyrillic) untouched. Limitation: transliteration of full non-Latin alphabets (e.g., Greek α → a) is not performed - only combining diacritical marks in the Unicode range U+0300 - U+036F are removed. Pro tip: always test your output against your target system's character whitelist before bulk processing.

Formulas

The accent removal process follows a two-stage algorithm. Stage 1 applies Unicode Normalization Form D. Stage 2 strips combining diacritical marks and maps special characters.

removeAccents(s) = stripMarks(mapSpecial(NFD(s)))

Where NFD(s) decomposes each composite character into its base character plus combining marks:

NFD(é) → e + U+0301

The stripMarks function applies a regular expression to remove all Unicode combining diacritical marks:

stripMarks(s) = s.replace(/[\u0300−\u036f]/g, '')

The mapSpecial function handles characters that NFD cannot decompose - ligatures and modified letters with no combining mark equivalent:

mapSpecial: { ø → o, ł → l, đ → d, ß → ss, æ → ae, œ → oe, ð → d, þ → th }

Where s is the input string. The algorithm runs in O(n) time complexity where n is the character count. The Unicode combining diacritical marks block spans codepoints U+0300 through U+036F, covering 112 combining marks including accents, cedillas, ogonek, horn, and various other modifications.

Reference Data

Accented Character	Unicode Codepoint	NFD Decomposition	Result After Stripping	Language Origin
é	U+00E9	e + U+0301 (acute)	e	French, Portuguese, Spanish
ñ	U+00F1	n + U+0303 (tilde)	n	Spanish
ü	U+00FC	u + U+0308 (diaeresis)	u	German, Turkish
ç	U+00E7	c + U+0327 (cedilla)	c	French, Portuguese, Turkish
ö	U+00F6	o + U+0308 (diaeresis)	o	German, Swedish, Finnish
à	U+00E0	a + U+0300 (grave)	a	French, Italian, Portuguese
â	U+00E2	a + U+0302 (circumflex)	a	French, Romanian
ž	U+017E	z + U+030C (caron)	z	Czech, Slovak, Slovenian
ø	U+00F8	Non-decomposable (special map)	o	Danish, Norwegian
ł	U+0142	Non-decomposable (special map)	l	Polish
đ	U+0111	Non-decomposable (special map)	d	Croatian, Vietnamese
ß	U+00DF	Non-decomposable (special map)	ss	German
å	U+00E5	a + U+030A (ring above)	a	Swedish, Norwegian, Danish
ă	U+0103	a + U+0306 (breve)	a	Romanian, Vietnamese
ț	U+021B	t + U+0326 (comma below)	t	Romanian
ś	U+015B	s + U+0301 (acute)	s	Polish
ī	U+012B	i + U+0304 (macron)	i	Latvian, Latin transliteration
ğ	U+011F	g + U+0306 (breve)	g	Turkish
ê	U+00EA	e + U+0302 (circumflex)	e	French, Portuguese
ï	U+00EF	i + U+0308 (diaeresis)	i	French, Catalan
ô	U+00F4	o + U+0302 (circumflex)	o	French, Portuguese, Slovak
ū	U+016B	u + U+0304 (macron)	u	Latvian, Japanese romaji
ý	U+00FD	y + U+0301 (acute)	y	Czech, Icelandic
ð	U+00F0	Non-decomposable (special map)	d	Icelandic, Old English
þ	U+00FE	Non-decomposable (special map)	th	Icelandic, Old English

Frequently Asked Questions

These characters have no canonical decomposition in the Unicode standard. The letter ø (U+00F8) is classified as a distinct letter in Danish and Norwegian alphabets, not as "o with a mark." Similarly, ß (U+00DF) is the German Eszett - a ligature historically derived from long-s + z, not a modified "s". NFD only decomposes characters that have a defined canonical decomposition mapping. This tool supplements NFD with a manual mapping table for approximately 30 such special characters to ensure complete coverage.

No. The stripping regex targets only Unicode block U+0300 - U+036F (Combining Diacritical Marks), which affects Latin, Greek, and Cyrillic combining marks. CJK ideographs, Arabic letters, and other script blocks are passed through unchanged. However, if Cyrillic text contains combining marks (e.g., й decomposed via NFD to и + U+0306), the breve mark would be stripped. If you need to preserve Cyrillic combining marks, disable real-time processing and review character-by-character.

The NFD decomposition and combining mark removal process is case-preserving. É (U+00C9) decomposes to E + U+0301 via NFD, and the combining acute accent is stripped, leaving E. The special character mapping table includes both lowercase and uppercase variants - for example, Ø maps to O, Ł maps to L, and Đ maps to D. Case is never altered during processing.

NFD (Canonical Decomposition) breaks composite characters into base + combining marks. NFC (Canonical Composition) does the opposite - it recombines them. NFKD (Compatibility Decomposition) goes further: it decomposes compatibility characters like ﬁ (ligature) into fi, and ² into 2. NFKC recomposes after compatibility decomposition. This tool uses NFD because it separates marks from bases without altering compatibility characters, giving precise control over which marks to strip.

In many languages, yes. In Spanish, año (year) becomes ano (anus). In Turkish, removing the dot from İ (capital I with dot) produces a different letter semantically. In Vietnamese, tonal marks are essential - removing them renders words ambiguous or meaningless. This tool is designed for technical normalization (URL slugs, filenames, database keys, search indexing) where ASCII compatibility matters more than linguistic accuracy. Do not use the output as a substitute for properly localized text.

Yes. Zalgo text works by stacking dozens of combining diacritical marks on single base characters. Since the regex strips all characters in the U+0300 - U+036F range, it effectively removes all standard Zalgo modifications. For extended Zalgo that uses marks from other combining blocks (U+0489, U+1DC0 - U+1DFF, U+20D0 - U+20FF, U+FE20 - U+FE2F), the standard regex will not catch them. The tool's extended mode covers the primary Combining Diacritical Marks block which handles the vast majority of real-world use cases.