About

Unicode assigns a unique numerical identifier - a code point - to every character across all writing systems, technical symbols, and emoji. The full range spans from U+0000 to U+10FFFF, covering 1,114,112 possible positions across 17 planes. Misinterpreting a code point format or ignoring supplementary plane characters (anything above U+FFFF) leads to corrupted output, replacement characters (U+FFFD), or silent data loss in databases and APIs. This tool parses six common code point notations - U+XXXX, 0xXXXX, decimal integers, &#xHHHH;, &#DDD;, and \uXXXX - validates each against the legal Unicode range, rejects lone surrogates (U+D800 - U+DFFF), and reconstructs the original text using String.fromCodePoint. The tool approximates correct rendering assuming your browser and OS have the required fonts installed; missing glyphs will display as placeholder boxes, not conversion errors.

Formulas

Each input token is matched against format-specific regular expressions. The extracted hexadecimal or decimal string is parsed to an integer code point value cp. The conversion rule is:

cp = parseInt(hexStr, 16) for hex formats
cp = parseInt(decStr, 10) for decimal formats

Validation requires:

0 ≤ cp ≤ 1,114,111 (0x10FFFF)
cp ∉ [0xD800, 0xDFFF] (surrogate range is illegal)

Valid code points are converted to characters via String.fromCodePoint(cp). For supplementary plane characters (cp > 0xFFFF), this function internally creates a UTF-16 surrogate pair:

cp′ = cp − 0x10000
hi = 0xD800 + (cp′ >> 10)
lo = 0xDC00 + (cp′ & 0x3FF)

Where hi is the high surrogate and lo is the low surrogate. UTF-8 byte count per code point follows the encoding scheme:

{

1 byte if cp ≤ 0x7F2 bytes if cp ≤ 0x7FF3 bytes if cp ≤ 0xFFFF4 bytes if cp ≤ 0x10FFFF

Reference Data

Unicode Plane	Range	Name	Characters	Common Content
0	U+0000 - U+FFFF	Basic Multilingual Plane (BMP)	65,536	Latin, Cyrillic, Greek, CJK, common symbols
1	U+10000 - U+1FFFF	Supplementary Multilingual Plane	65,536	Emoji, historic scripts, musical symbols
2	U+20000 - U+2FFFF	Supplementary Ideographic Plane	65,536	CJK Unified Ideographs Extension B
3	U+30000 - U+3FFFF	Tertiary Ideographic Plane	65,536	CJK Extension G, H
4-13	U+40000 - U+DFFFF	Unassigned	655,360	Reserved for future use
14	U+E0000 - U+EFFFF	Supplementary Special-purpose Plane	65,536	Tag characters, variation selectors
15	U+F0000 - U+FFFFF	Supplementary Private Use Area-A	65,536	Private-use characters
16	U+100000 - U+10FFFF	Supplementary Private Use Area-B	65,536	Private-use characters

Input Format	Example	Regex Pattern	Base	Notes
U+XXXX	U+0041	U\+[0-9A-Fa-f]{1,6}	Hexadecimal	Most common Unicode notation
0xXXXX	0x0041	0x[0-9A-Fa-f]{1,6}	Hexadecimal	Programming hex literal
Decimal	65	[0-9]+	Decimal	Raw integer code point value
&#xHHHH;	A	&#x[0-9A-Fa-f]+;	Hexadecimal	HTML hex character reference
&#DDD;	A	&#[0-9]+;	Decimal	HTML decimal character reference
\uXXXX	\u0041	\\u[0-9A-Fa-f]{4}	Hexadecimal	JavaScript/Java escape (BMP only)
\u{XXXXX}	\u{1F600}	\\u\{[0-9A-Fa-f]{1,6}\}	Hexadecimal	ES6+ extended escape (all planes)

Code Point	Character	Name	Block	UTF-8 Bytes
U+0041	A	Latin Capital Letter A	Basic Latin	1
U+00E9	é	Latin Small Letter E with Acute	Latin-1 Supplement	2
U+4E16	世	CJK Unified Ideograph	CJK Unified Ideographs	3
U+0410	А	Cyrillic Capital Letter A	Cyrillic	2
U+2603	☃	Snowman	Miscellaneous Symbols	3
U+1F600	😀	Grinning Face	Emoticons (Plane 1)	4
U+1F4A9	💩	Pile of Poo	Miscellaneous Symbols (Plane 1)	4
U+0000	NUL	Null Character	Basic Latin (C0 Controls)	1
U+FEFF	BOM	Byte Order Mark	Arabic Presentation Forms-B	3
U+FFFD	�	Replacement Character	Specials	3
U+200B	(invisible)	Zero Width Space	General Punctuation	3
U+20AC	€	Euro Sign	Currency Symbols	3

Frequently Asked Questions

Code points U+D800 - U+DFFF are reserved for UTF-16 surrogate pairs. They are not valid Unicode scalar values and cannot represent standalone characters. Attempting to use String.fromCodePoint with a surrogate value throws a RangeError. This tool detects and flags them before conversion to prevent runtime errors.

Characters above U+FFFF (such as 😀 at U+1F600) reside on supplementary planes 1-16. JavaScript internally represents them as two UTF-16 code units (a surrogate pair), but String.fromCodePoint handles this transparently. The tool correctly converts any code point up to U+10FFFF regardless of plane. Rendering depends on font support in your browser and operating system.

Yes, when the format selector is set to Auto-Detect, the parser independently identifies each token's format using regex matching. You can freely mix U+0041, 0x42, 67, D, E, and \u0046 in the same input. Each token is parsed according to its detected format. Ambiguous tokens (e.g., 65 could be decimal or a bare hex value) are resolved by the auto-detect priority: HTML entities first, then U+ prefix, then 0x prefix, then \u escapes, and finally bare numbers as decimal.

Control characters in the range U+0000 - U+001F and U+007F - U+009F are valid Unicode code points and will convert successfully. However, they are non-printable. The output will contain the character but it may appear invisible or cause formatting side-effects (e.g., line breaks for U+000A). The detail table marks these as control characters and displays their Unicode name instead of attempting to render a glyph.

Yes. Unicode combining sequences are order-dependent. An accented letter like é can be represented as a single code point (U+00E9) or as two: U+0065 (e) followed by U+0301 (combining acute accent). Flag emoji require specific Regional Indicator pairs (e.g., U+1F1FA U+1F1F8 for 🇺🇸). If the order is wrong or pairs are incomplete, the characters render individually rather than as a combined glyph.

The conversion itself succeeded - the code point is valid. The box (□ or ▯) or question mark diamond (�) indicates your system lacks a font containing a glyph for that code point. This is common for rare CJK extensions (Plane 2/3), historic scripts, and newly added emoji. Installing a comprehensive Unicode font like Noto Sans or adjusting your OS fallback font chain resolves most cases. The detail table still shows the correct code point value and Unicode name.