User Rating 0.0 ★★★★★

Total Usage 0 times

Category Code Utilities

Output Format

Separator

Input Text 0 characters

Output

Character Breakdown

#	Char	Hex	Decimal	Octal	Binary	HTML	Block

Quick Presets:

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

Character encoding errors corrupt data silently. A single misidentified code point can break JSON parsing, corrupt database fields, or render text as garbled mojibake across systems that disagree on encoding. This tool converts between raw text and its Unicode code point representation using the full U+0000 notation defined by the Unicode Consortium (ISO/IEC 10646). It handles the complete Basic Multilingual Plane (BMP, U+0000 to U+FFFF) and Supplementary Planes up to U+10FFFF) via surrogate-aware parsing. Output formats include hexadecimal, decimal, octal, binary, and HTML entities. The reverse direction accepts mixed notation: U+XXXX, \uXXXX, &#xHH;, and raw decimal &#DDD;. Limitation: Unicode character names are provided for common blocks only. Normalization (NFC/NFD) is not applied.

Formulas

Each character in a string maps to an integer code point. The conversion from character c to its hexadecimal Unicode representation follows:

U+hex(codePoint(c)) padded to 4 - 6 digits

Where codePoint(c) extracts the scalar value via JavaScript's codePointAt(0), which correctly handles astral plane characters (code points > 0xFFFF) encoded as surrogate pairs in UTF-16. The padding rule: BMP characters (≤ FFFF) use 4 hex digits; supplementary characters use 5 or 6.

For the reverse (Unicode → text), the parser matches multiple notations via a union pattern:

pattern = U+XXXX | \uXXXX | &#xHH; | &#DDD; | 0xHHHH

Each matched token is parsed to an integer n, validated against 0 ≤ n ≤ 0x10FFFF (excluding surrogates 0xD800 - 0xDFFF), then reconstituted via String.fromCodePoint(n).

Alternative output bases use standard positional notation:

Decimal: n₁₀ = codePoint

Octal: n₈ = toString(8)

Binary: n₂ = toString(2) padded to 8/16/21 bits

Reference Data

Block Name	Range	Characters	Common Usage
Basic Latin (ASCII)	U+0000 - 007F	128	English letters, digits, punctuation
Latin-1 Supplement	U+0080 - 00FF	128	Accented letters (é, ü, ñ)
Latin Extended-A	U+0100 - 017F	128	Central/Eastern European scripts
Greek and Coptic	U+0370 - 03FF	135	α, β, γ, Δ, Σ math symbols
Cyrillic	U+0400 - 04FF	256	Russian, Ukrainian, Bulgarian
Arabic	U+0600 - 06FF	256	Arabic script languages
Devanagari	U+0900 - 097F	128	Hindi, Sanskrit, Marathi
CJK Unified Ideographs	U+4E00 - 9FFF	20,992	Chinese, Japanese Kanji, Korean Hanja
Hiragana	U+3040 - 309F	93	Japanese phonetic script
Katakana	U+30A0 - 30FF	96	Japanese foreign loanwords
Hangul Syllables	U+AC00 - D7AF	11,172	Korean syllable blocks
General Punctuation	U+2000 - 206F	111	Em dash, ellipsis, non-breaking spaces
Currency Symbols	U+20A0 - 20CF	33	€, £, ¥, ₹, ₿
Mathematical Operators	U+2200 - 22FF	256	∀, ∃, ∞, ∫, ∑, ∏
Arrows	U+2190 - 21FF	112	→, ←, ↑, ↓, ↔
Box Drawing	U+2500 - 257F	128	Terminal/console UI borders
Emoticons	U+1F600 - 1F64F	80	😀😂😍 emoji faces
Misc Symbols & Pictographs	U+1F300 - 1F5FF	768	🌍🎉🔥 common emoji
Private Use Area	U+E000 - F8FF	6,400	Custom font glyphs (icon fonts)
Surrogates (reserved)	U+D800 - DFFF	2,048	UTF-16 encoding pairs (not characters)

Frequently Asked Questions

ASCII (American Standard Code for Information Interchange) defines 128 characters in the range U+0000 to U+007F, covering English letters, digits, and control characters. Unicode is a superset that assigns code points up to U+10FFFF, encoding over 149,000 characters across 161 scripts. Every valid ASCII character has the same code point in Unicode. This tool handles the full Unicode range, not just the 7-bit ASCII subset.

Characters above U+FFFF (such as emoji 😀 at U+1F600) are stored as surrogate pairs in JavaScript's UTF-16 string representation. This converter uses codePointAt() instead of charCodeAt(), which correctly reads the full 21-bit code point rather than splitting it into two 16-bit surrogates. The output will show a single code point like U+1F600, not the surrogate pair D83D+DE00.

The decoder accepts five formats: U+XXXX (standard Unicode notation), \uXXXX (JavaScript/JSON escape), HHHH; (HTML hex entity), DDD; (HTML decimal entity), and 0xHHHH (hexadecimal literal). Formats can be mixed in a single input. Tokens are separated by spaces, commas, or detected automatically by their prefix pattern.

Code points in the surrogate range (U+D800 to U+DFFF) are reserved for UTF-16 encoding mechanics and do not represent characters. Any value above U+10FFFF exceeds the defined Unicode space. The converter rejects these values and marks them as invalid in the output, conforming to the Unicode Standard Chapter 3 conformance requirements.

Hex (U+0041) is the standard in Unicode charts and specifications. Decimal (65) is used in programming contexts and ASCII tables. HTML hex entities (A) and decimal entities (A) are used in HTML/XML documents to represent characters that may not be directly typeable. Binary output is useful for understanding the bit-level storage structure, particularly when debugging encoding at the byte level.

No. The tool converts characters to code points and back without applying normalization. Characters like é can exist as a single precomposed code point (U+00E9, NFC) or as two code points: e (U+0065) + combining acute accent (U+0301, NFD). Both representations will be shown as-is. If normalization is required, apply it to your text before or after using this converter.