User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
Input Text 0 characters
Output
Quick Presets:
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

Character encoding errors corrupt data silently. A single misidentified code point can break JSON parsing, corrupt database fields, or render text as garbled mojibake across systems that disagree on encoding. This tool converts between raw text and its Unicode code point representation using the full U+0000 notation defined by the Unicode Consortium (ISO/IEC 10646). It handles the complete Basic Multilingual Plane (BMP, U+0000 to U+FFFF) and Supplementary Planes up to U+10FFFF) via surrogate-aware parsing. Output formats include hexadecimal, decimal, octal, binary, and HTML entities. The reverse direction accepts mixed notation: U+XXXX, \uXXXX, &#xHH;, and raw decimal &#DDD;. Limitation: Unicode character names are provided for common blocks only. Normalization (NFC/NFD) is not applied.

ascii unicode converter code points hex utf-8 character encoding text encoding

Formulas

Each character in a string maps to an integer code point. The conversion from character c to its hexadecimal Unicode representation follows:

U+hex(codePoint(c)) padded to 4 - 6 digits

Where codePoint(c) extracts the scalar value via JavaScript's codePointAt(0), which correctly handles astral plane characters (code points > 0xFFFF) encoded as surrogate pairs in UTF-16. The padding rule: BMP characters (≀ FFFF) use 4 hex digits; supplementary characters use 5 or 6.

For the reverse (Unicode β†’ text), the parser matches multiple notations via a union pattern:

pattern = U+XXXX | \uXXXX | &#xHH; | &#DDD; | 0xHHHH

Each matched token is parsed to an integer n, validated against 0 ≀ n ≀ 0x10FFFF (excluding surrogates 0xD800 - 0xDFFF), then reconstituted via String.fromCodePoint(n).

Alternative output bases use standard positional notation:

Decimal: n10 = codePoint
Octal: n8 = toString(8)
Binary: n2 = toString(2) padded to 8/16/21 bits

Reference Data

Block NameRangeCharactersCommon Usage
Basic Latin (ASCII)U+0000 - 007F128English letters, digits, punctuation
Latin-1 SupplementU+0080 - 00FF128Accented letters (Γ©, ΓΌ, Γ±)
Latin Extended-AU+0100 - 017F128Central/Eastern European scripts
Greek and CopticU+0370 - 03FF135Ξ±, Ξ², Ξ³, Ξ”, Ξ£ math symbols
CyrillicU+0400 - 04FF256Russian, Ukrainian, Bulgarian
ArabicU+0600 - 06FF256Arabic script languages
DevanagariU+0900 - 097F128Hindi, Sanskrit, Marathi
CJK Unified IdeographsU+4E00 - 9FFF20,992Chinese, Japanese Kanji, Korean Hanja
HiraganaU+3040 - 309F93Japanese phonetic script
KatakanaU+30A0 - 30FF96Japanese foreign loanwords
Hangul SyllablesU+AC00 - D7AF11,172Korean syllable blocks
General PunctuationU+2000 - 206F111Em dash, ellipsis, non-breaking spaces
Currency SymbolsU+20A0 - 20CF33€, Β£, Β₯, β‚Ή, β‚Ώ
Mathematical OperatorsU+2200 - 22FF256βˆ€, βˆƒ, ∞, ∫, βˆ‘, ∏
ArrowsU+2190 - 21FF112β†’, ←, ↑, ↓, ↔
Box DrawingU+2500 - 257F128Terminal/console UI borders
EmoticonsU+1F600 - 1F64F80πŸ˜€πŸ˜‚πŸ˜ emoji faces
Misc Symbols & PictographsU+1F300 - 1F5FF768πŸŒπŸŽ‰πŸ”₯ common emoji
Private Use AreaU+E000 - F8FF6,400Custom font glyphs (icon fonts)
Surrogates (reserved)U+D800 - DFFF2,048UTF-16 encoding pairs (not characters)

Frequently Asked Questions

ASCII (American Standard Code for Information Interchange) defines 128 characters in the range U+0000 to U+007F, covering English letters, digits, and control characters. Unicode is a superset that assigns code points up to U+10FFFF, encoding over 149,000 characters across 161 scripts. Every valid ASCII character has the same code point in Unicode. This tool handles the full Unicode range, not just the 7-bit ASCII subset.
Characters above U+FFFF (such as emoji πŸ˜€ at U+1F600) are stored as surrogate pairs in JavaScript's UTF-16 string representation. This converter uses codePointAt() instead of charCodeAt(), which correctly reads the full 21-bit code point rather than splitting it into two 16-bit surrogates. The output will show a single code point like U+1F600, not the surrogate pair D83D+DE00.
The decoder accepts five formats: U+XXXX (standard Unicode notation), \uXXXX (JavaScript/JSON escape), HHHH; (HTML hex entity), DDD; (HTML decimal entity), and 0xHHHH (hexadecimal literal). Formats can be mixed in a single input. Tokens are separated by spaces, commas, or detected automatically by their prefix pattern.
Code points in the surrogate range (U+D800 to U+DFFF) are reserved for UTF-16 encoding mechanics and do not represent characters. Any value above U+10FFFF exceeds the defined Unicode space. The converter rejects these values and marks them as invalid in the output, conforming to the Unicode Standard Chapter 3 conformance requirements.
Hex (U+0041) is the standard in Unicode charts and specifications. Decimal (65) is used in programming contexts and ASCII tables. HTML hex entities (A) and decimal entities (A) are used in HTML/XML documents to represent characters that may not be directly typeable. Binary output is useful for understanding the bit-level storage structure, particularly when debugging encoding at the byte level.
No. The tool converts characters to code points and back without applying normalization. Characters like Γ© can exist as a single precomposed code point (U+00E9, NFC) or as two code points: e (U+0065) + combining acute accent (U+0301, NFD). Both representations will be shown as-is. If normalization is required, apply it to your text before or after using this converter.