User Rating 0.0 ★★★★★

Total Usage 0 times

Category Code Utilities

Enter Code Points Accepted formats: U+XXXX, 0xHH, decimal, &#xHH;, &#DD;, 0b..., 0o... — separated by spaces, commas, or newlines.

Enter code points above and press Convert

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

Unicode code points are the canonical numeric identifiers for every character in the Unicode Standard (currently version 15.1, covering 149,813 assigned characters). A code point is expressed as U+XXXX where XXXX is a hexadecimal value in the range 0 to 10FFFF. Misinterpreting a code point during serialization - confusing UTF-8 byte length with UTF-16 code unit count, for instance - produces mojibake, data corruption, or security vulnerabilities such as overlong encoding exploits. This tool parses one or more code points in multiple input notations and outputs the resolved character, its UTF-8 byte sequence, UTF-16 code units (including surrogate pair decomposition for supplementary plane characters above U+FFFF), and the assigned Unicode block name.

Limitations: unassigned code points within valid ranges will resolve to a replacement glyph (□ or �) depending on font support. Surrogate code points (D800 - DFFF) are flagged as invalid because they are reserved for the UTF-16 encoding mechanism and do not represent characters. The tool assumes well-formed input; ambiguous bare numbers default to hexadecimal interpretation unless prefixed.

Formulas

UTF-8 encodes a code point U into a variable-length byte sequence based on its numeric range. The encoding rules follow RFC 3629:

{

1 byte: 0xxxxxxx if U ≤ 7F2 bytes: 110xxxxx 10xxxxxx if U ≤ 7FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx if U ≤ FFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if U ≤ 10FFFF

For UTF-16, code points in the Basic Multilingual Plane (U ≤ FFFF) map directly to a single 16-bit code unit. Supplementary plane code points (U ≥ 10000) require a surrogate pair:

U′ = U − 10000
H = D800 + U′400 (high surrogate)
L = DC00 + (U′ mod 400) (low surrogate)

Where U = the code point value (hexadecimal), U′ = the offset from 10000, H = high surrogate (D800 - DBFF), L = low surrogate (DC00 - DFFF). Division by 400₁₆ (1024₁₀) partitions the 20-bit offset into two 10-bit halves.

Reference Data

Block Range	Block Name	Characters	UTF-8 Bytes	Example
0000 - 007F	Basic Latin	128	1	A (U+0041)
0080 - 00FF	Latin-1 Supplement	128	2	é (U+00E9)
0100 - 017F	Latin Extended-A	128	2	š (U+0161)
0370 - 03FF	Greek and Coptic	135	2	α (U+03B1)
0400 - 04FF	Cyrillic	256	2	Д (U+0414)
0600 - 06FF	Arabic	256	2	ع (U+0639)
0900 - 097F	Devanagari	128	3	अ (U+0905)
2000 - 206F	General Punctuation	112	3	- (U+2014)
2100 - 214F	Letterlike Symbols	80	3	™ (U+2122)
2190 - 21FF	Arrows	112	3	→ (U+2192)
2200 - 22FF	Mathematical Operators	256	3	≠ (U+2260)
2500 - 257F	Box Drawing	128	3	│ (U+2502)
2600 - 26FF	Miscellaneous Symbols	256	3	★ (U+2605)
3000 - 303F	CJK Symbols and Punctuation	64	3	、 (U+3001)
4E00 - 9FFF	CJK Unified Ideographs	20,992	3	世 (U+4E16)
AC00 - D7AF	Hangul Syllables	11,184	3	한 (U+D55C)
D800 - DFFF	Surrogates (INVALID)	2,048	-	Reserved for UTF-16
E000 - F8FF	Private Use Area	6,400	3	Vendor-specific
FB00 - FB06	Alphabetic Presentation Forms	7	3	ﬁ (U+FB01)
FE00 - FE0F	Variation Selectors	16	3	VS1 - VS16
FF00 - FFEF	Halfwidth and Fullwidth Forms	240	3	Ａ (U+FF21)
FEFF	BOM / Zero Width No-Break Space	1	3	BOM marker
FFFD	Replacement Character	1	3	� (U+FFFD)
10000 - 1007F	Linear B Syllabary	88	4	Ancient script
1D400 - 1D7FF	Mathematical Alphanumeric Symbols	996	4	𝐀 (U+1D400)
1F300 - 1F5FF	Miscellaneous Symbols and Pictographs	768	4	🌟 (U+1F31F)
1F600 - 1F64F	Emoticons	80	4	😀 (U+1F600)
1F680 - 1F6FF	Transport and Map Symbols	128	4	🚀 (U+1F680)
1F900 - 1F9FF	Supplemental Symbols and Pictographs	256	4	🤔 (U+1F914)
E0001 - E007F	Tags	97	4	Language tags

Frequently Asked Questions

Surrogate code points are reserved exclusively for the UTF-16 encoding mechanism. They do not represent characters and cannot appear in well-formed Unicode text. The converter flags these as invalid and does not attempt to render a character. Attempting to use String.fromCodePoint() with a surrogate value throws a RangeError in JavaScript.

The parser uses prefix detection: U+XXXX, 0xXXXX, and HHHH; are treated as hexadecimal. DDD; is treated as decimal. Bare numbers without a prefix default to hexadecimal to align with Unicode convention. To force decimal interpretation, use the DDD; HTML entity notation. Binary (0b...) and octal (0o...) prefixes are also supported.

A valid code point does not guarantee a visible glyph. The character may be unassigned in the current Unicode version, may be a control character (U+0000 - U+001F), or may simply lack font support in your browser. The converter reports the correct encoding bytes regardless of rendering. Supplementary plane characters (above U+FFFF) often require system fonts like Segoe UI Emoji or Noto to render.

The Unicode Standard defines code points from U+0000 to U+10FFFF, yielding a theoretical space of 1,114,112 code points. This upper bound is constrained by the UTF-16 encoding scheme: 20 bits of payload from surrogate pairs plus the 16-bit BMP yields exactly 17 planes of 65,536 code points each. Any value above 10FFFF is rejected by this converter.

Code points U+0000 - U+007F require 1 byte (ASCII-compatible). U+0080 - U+07FF require 2 bytes. U+0800 - U+FFFF require 3 bytes. U+10000 - U+10FFFF require 4 bytes. This is not arbitrary: each additional byte provides 5-6 more bits of payload. A common error is assuming all Unicode characters are 2 bytes (the "UTF-16 = Unicode" myth); in UTF-8, a single emoji like U+1F600 occupies 4 bytes.

Yes. Separate code points with spaces, commas, newlines, or any whitespace. The converter parses each token independently. For example, entering "U+0048 U+0065 U+006C U+006C U+006F" produces the string "Hello". Mixed formats are supported: "0x48, e U+6C 108 0x6F" also resolves correctly.

The converter processes each code point individually. Variation selectors (U+FE00 - U+FE0F) and combining marks (U+0300 - U+036F) are valid code points and will be converted correctly. However, their visual rendering depends on the preceding base character. Entering a combining mark alone will show it attached to nothing, which is technically correct but visually confusing. The combined output string will render them in sequence.