About

Incorrect character encoding causes mojibake - garbled text that destroys user trust and breaks search engine indexing. Any non-ASCII character (codepoint above U+007F) risks corruption when transmitted through systems that assume 7-bit ASCII or mismatched codepages. This tool converts every non-Latin character in your text to its precise numeric or named entity representation: HTML decimal (ä), HTML hexadecimal (ä), named HTML (ä), CSS hex (\00E4), JavaScript Unicode escape (\u00E4), or URL percent-encoding (%C3%A4). It handles the full Unicode range including astral plane characters above U+FFFF using surrogate-aware codepoint iteration.

The converter processes characters by comparing each codepoint against the ASCII boundary at 0x7F. Characters at or below this threshold pass through unchanged in "non-Latin only" mode. This tool approximates encoding needs assuming UTF-8 source text. It does not handle legacy multi-byte encodings like Shift_JIS or Big5 - pre-convert those to UTF-8 first. Pro tip: named HTML entities (like ä) improve source readability but are limited to roughly 250 characters defined in the HTML specification. Numeric entities cover every Unicode codepoint without exception.

Formulas

Each character in the input string is examined by its Unicode codepoint cp, obtained via codePointAt. The ASCII boundary is defined at codepoint 0x7F (127 decimal). In "non-Latin only" mode, characters satisfying cp ≤ 0x7F pass through unchanged.

{

char unchanged if cp ≤ 0x7F and mode = NON_LATINencode(cp, format) otherwise

The encoding functions per format are:

HTML Decimal: output = &# + cp + ;

HTML Hex: output = &#x + toHex(cp) + ;

CSS Hex: output = \ + padHex(cp, 4)

JS Escape: output = \u + padHex(cp, 4) or \u{ + toHex(cp) + } if cp > 0xFFFF

URL Encoding: output = encodeURIComponent(char)

Where cp = Unicode codepoint (integer). toHex(cp) converts to uppercase hexadecimal string. padHex(cp, n) zero-pads the hex to at least n digits. For HTML Named entities, a dictionary lookup maps cp → entityName. If no named entity exists, the converter falls back to HTML Hex format.

Reference Data

Character	Name	Codepoint	HTML Decimal	HTML Hex	HTML Named	CSS Hex	JS Escape
ä	Latin Small A with Diaeresis	U+00E4	ä	ä	ä	\00E4	\u00E4
ö	Latin Small O with Diaeresis	U+00F6	ö	ö	ö	\00F6	\u00F6
ü	Latin Small U with Diaeresis	U+00FC	ü	ü	ü	\00FC	\u00FC
ß	Latin Small Sharp S	U+00DF	ß	ß	ß	\00DF	\u00DF
é	Latin Small E with Acute	U+00E9	é	é	é	\00E9	\u00E9
ñ	Latin Small N with Tilde	U+00F1	ñ	ñ	ñ	\00F1	\u00F1
©	Copyright Sign	U+00A9	©	©	©	\00A9	\u00A9
€	Euro Sign	U+20AC	€	€	€	\20AC	\u20AC
£	Pound Sign	U+00A3	£	£	£	\00A3	\u00A3
¥	Yen Sign	U+00A5	¥	¥	¥	\00A5	\u00A5
中	CJK Unified - Middle	U+4E2D	中	中	-	\4E2D	\u4E2D
日	CJK Unified - Sun/Day	U+65E5	日	日	-	\65E5	\u65E5
Д	Cyrillic Capital De	U+0414	Д	Д	-	\0414	\u0414
я	Cyrillic Small Ya	U+044F	я	я	-	\044F	\u044F
α	Greek Small Alpha	U+03B1	α	α	α	\03B1	\u03B1
π	Greek Small Pi	U+03C0	π	π	π	\03C0	\u03C0
→	Rightwards Arrow	U+2192	→	→	→	\2192	\u2192
∞	Infinity	U+221E	∞	∞	∞	\221E	\u221E
♠	Black Spade Suit	U+2660	♠	♠	♠	\2660	\u2660
😀	Grinning Face	U+1F600	😀	😀	-	\1F600	\u{1F600}
™	Trade Mark Sign	U+2122	™	™	™	\2122	\u2122
®	Registered Sign	U+00AE	®	®	®	\00AE	\u00AE
°	Degree Sign	U+00B0	°	°	°	\00B0	\u00B0
µ	Micro Sign	U+00B5	µ	µ	µ	\00B5	\u00B5
½	Vulgar Fraction One Half	U+00BD	½	½	½	\00BD	\u00BD

Frequently Asked Questions

Characters above codepoint U+FFFF occupy two UTF-16 code units (surrogate pairs). This converter uses codePointAt() which correctly reads the full codepoint. For JS escape format, these produce \u{1F600} syntax (ES6 extended escape). For CSS, the hex code is emitted directly (e.g., \1F600). HTML decimal and hex entities handle them natively since they accept any integer codepoint value.

Named entities like ä are human-readable in source code, making maintenance easier. However, only about 250 characters have named entities defined in the HTML specification. Numeric entities (decimal or hex) cover every Unicode codepoint without exception. If your text contains CJK, Cyrillic, or emoji, named entities will not exist for most characters and the converter falls back to hex encoding automatically.

Yes. The CSS specification states that a hex escape sequence (e.g., \00E4) is terminated by the first non-hex character. If the next character in your string happens to be a valid hex digit (0-9, A-F), the browser would misinterpret the boundary. Appending a single space after the escape is the standard solution. This converter adds a trailing space after each CSS hex entity to prevent ambiguity.

Characters like < (U+003C), > (U+003E), and & (U+0026) fall within the ASCII range (below U+007F). In "non-Latin only" mode, they pass through unchanged since they are standard Latin characters. In "convert all" mode, they are encoded like any other character - for example, & becomes & in HTML decimal format. If you need to specifically escape HTML markup characters, use "convert all" mode with HTML decimal output.

URL encoding (percent-encoding) operates on UTF-8 byte sequences, not codepoints directly. The character ü (U+00FC) encodes to %C3%BC because its UTF-8 representation is two bytes: 0xC3 and 0xBC. This differs from HTML/CSS/JS formats which reference the Unicode codepoint number. URL encoding is required for query parameters, path segments, and form data in HTTP. It is defined in RFC 3986.

Yes. The converter iterates character-by-character using codePointAt() with proper index advancement for surrogate pairs. Each character is independently evaluated against the ASCII threshold. Latin characters pass through (in non-Latin mode) while Japanese (U+3000 - U+9FFF), Arabic (U+0600 - U+06FF), Cyrillic (U+0400 - U+04FF), and any other script characters are converted to the selected entity format. No script-specific logic is needed because Unicode codepoints are script-agnostic integers.