User Rating 0.0
Total Usage 0 times
0 characters
Examples:
Converted Output

        
Is this tool helpful?

Your feedback helps us improve.

About

Incorrect character encoding causes mojibake - garbled text that destroys user trust and breaks search engine indexing. Any non-ASCII character (codepoint above U+007F) risks corruption when transmitted through systems that assume 7-bit ASCII or mismatched codepages. This tool converts every non-Latin character in your text to its precise numeric or named entity representation: HTML decimal (ä), HTML hexadecimal (ä), named HTML (ä), CSS hex (\00E4), JavaScript Unicode escape (\u00E4), or URL percent-encoding (%C3%A4). It handles the full Unicode range including astral plane characters above U+FFFF using surrogate-aware codepoint iteration.

The converter processes characters by comparing each codepoint against the ASCII boundary at 0x7F. Characters at or below this threshold pass through unchanged in "non-Latin only" mode. This tool approximates encoding needs assuming UTF-8 source text. It does not handle legacy multi-byte encodings like Shift_JIS or Big5 - pre-convert those to UTF-8 first. Pro tip: named HTML entities (like ä) improve source readability but are limited to roughly 250 characters defined in the HTML specification. Numeric entities cover every Unicode codepoint without exception.

html entities character encoding unicode converter css charcode special characters entity converter non-latin characters html escape

Formulas

Each character in the input string is examined by its Unicode codepoint cp, obtained via codePointAt. The ASCII boundary is defined at codepoint 0x7F (127 decimal). In "non-Latin only" mode, characters satisfying cp 0x7F pass through unchanged.

{
char unchanged if cp 0x7F and mode = NON_LATINencode(cp, format) otherwise

The encoding functions per format are:

HTML Decimal: output = &# + cp + ;
HTML Hex: output = &#x + toHex(cp) + ;
CSS Hex: output = \ + padHex(cp, 4)
JS Escape: output = \u + padHex(cp, 4) or \u{ + toHex(cp) + } if cp > 0xFFFF
URL Encoding: output = encodeURIComponent(char)

Where cp = Unicode codepoint (integer). toHex(cp) converts to uppercase hexadecimal string. padHex(cp, n) zero-pads the hex to at least n digits. For HTML Named entities, a dictionary lookup maps cp entityName. If no named entity exists, the converter falls back to HTML Hex format.

Reference Data

CharacterNameCodepointHTML DecimalHTML HexHTML NamedCSS HexJS Escape
äLatin Small A with DiaeresisU+00E4äää\00E4\u00E4
öLatin Small O with DiaeresisU+00F6ööö\00F6\u00F6
üLatin Small U with DiaeresisU+00FCüüü\00FC\u00FC
ßLatin Small Sharp SU+00DFßßß\00DF\u00DF
éLatin Small E with AcuteU+00E9ééé\00E9\u00E9
ñLatin Small N with TildeU+00F1ñññ\00F1\u00F1
©Copyright SignU+00A9©©©\00A9\u00A9
Euro SignU+20AC\20AC\u20AC
£Pound SignU+00A3£££\00A3\u00A3
¥Yen SignU+00A5¥¥¥\00A5\u00A5
CJK Unified - MiddleU+4E2D - \4E2D\u4E2D
CJK Unified - Sun/DayU+65E5 - \65E5\u65E5
ДCyrillic Capital DeU+0414ДД - \0414\u0414
яCyrillic Small YaU+044Fяя - \044F\u044F
αGreek Small AlphaU+03B1ααα\03B1\u03B1
πGreek Small PiU+03C0πππ\03C0\u03C0
Rightwards ArrowU+2192\2192\u2192
InfinityU+221E\221E\u221E
Black Spade SuitU+2660\2660\u2660
😀Grinning FaceU+1F600😀😀 - \1F600\u{1F600}
Trade Mark SignU+2122\2122\u2122
®Registered SignU+00AE®®®\00AE\u00AE
°Degree SignU+00B0°°°\00B0\u00B0
µMicro SignU+00B5µµµ\00B5\u00B5
½Vulgar Fraction One HalfU+00BD½½½\00BD\u00BD

Frequently Asked Questions

Characters above codepoint U+FFFF occupy two UTF-16 code units (surrogate pairs). This converter uses codePointAt() which correctly reads the full codepoint. For JS escape format, these produce \u{1F600} syntax (ES6 extended escape). For CSS, the hex code is emitted directly (e.g., \1F600). HTML decimal and hex entities handle them natively since they accept any integer codepoint value.
Named entities like ä are human-readable in source code, making maintenance easier. However, only about 250 characters have named entities defined in the HTML specification. Numeric entities (decimal or hex) cover every Unicode codepoint without exception. If your text contains CJK, Cyrillic, or emoji, named entities will not exist for most characters and the converter falls back to hex encoding automatically.
Yes. The CSS specification states that a hex escape sequence (e.g., \00E4) is terminated by the first non-hex character. If the next character in your string happens to be a valid hex digit (0-9, A-F), the browser would misinterpret the boundary. Appending a single space after the escape is the standard solution. This converter adds a trailing space after each CSS hex entity to prevent ambiguity.
Characters like < (U+003C), > (U+003E), and & (U+0026) fall within the ASCII range (below U+007F). In "non-Latin only" mode, they pass through unchanged since they are standard Latin characters. In "convert all" mode, they are encoded like any other character - for example, & becomes & in HTML decimal format. If you need to specifically escape HTML markup characters, use "convert all" mode with HTML decimal output.
URL encoding (percent-encoding) operates on UTF-8 byte sequences, not codepoints directly. The character ü (U+00FC) encodes to %C3%BC because its UTF-8 representation is two bytes: 0xC3 and 0xBC. This differs from HTML/CSS/JS formats which reference the Unicode codepoint number. URL encoding is required for query parameters, path segments, and form data in HTTP. It is defined in RFC 3986.
Yes. The converter iterates character-by-character using codePointAt() with proper index advancement for surrogate pairs. Each character is independently evaluated against the ASCII threshold. Latin characters pass through (in non-Latin mode) while Japanese (U+3000 - U+9FFF), Arabic (U+0600 - U+06FF), Cyrillic (U+0400 - U+04FF), and any other script characters are converted to the selected entity format. No script-specific logic is needed because Unicode codepoints are script-agnostic integers.