Code Points to Unicode Converter
Convert Unicode code points (U+XXXX, hex, decimal) to characters with UTF-8/UTF-16 byte sequences, block names, and detailed encoding info.
About
Unicode code points are the canonical numeric identifiers for every character in the Unicode Standard (currently version 15.1, covering 149,813 assigned characters). A code point is expressed as U+XXXX where XXXX is a hexadecimal value in the range 0 to 10FFFF. Misinterpreting a code point during serialization - confusing UTF-8 byte length with UTF-16 code unit count, for instance - produces mojibake, data corruption, or security vulnerabilities such as overlong encoding exploits. This tool parses one or more code points in multiple input notations and outputs the resolved character, its UTF-8 byte sequence, UTF-16 code units (including surrogate pair decomposition for supplementary plane characters above U+FFFF), and the assigned Unicode block name.
Limitations: unassigned code points within valid ranges will resolve to a replacement glyph (β‘ or οΏ½) depending on font support. Surrogate code points (D800 - DFFF) are flagged as invalid because they are reserved for the UTF-16 encoding mechanism and do not represent characters. The tool assumes well-formed input; ambiguous bare numbers default to hexadecimal interpretation unless prefixed.
Formulas
UTF-8 encodes a code point U into a variable-length byte sequence based on its numeric range. The encoding rules follow RFC 3629:
For UTF-16, code points in the Basic Multilingual Plane (U β€ FFFF) map directly to a single 16-bit code unit. Supplementary plane code points (U β₯ 10000) require a surrogate pair:
H = D800 + Uβ²400 (high surrogate)
L = DC00 + (Uβ² mod 400) (low surrogate)
Where U = the code point value (hexadecimal), Uβ² = the offset from 10000, H = high surrogate (D800 - DBFF), L = low surrogate (DC00 - DFFF). Division by 40016 (102410) partitions the 20-bit offset into two 10-bit halves.
Reference Data
| Block Range | Block Name | Characters | UTF-8 Bytes | Example |
|---|---|---|---|---|
| 0000 - 007F | Basic Latin | 128 | 1 | A (U+0041) |
| 0080 - 00FF | Latin-1 Supplement | 128 | 2 | Γ© (U+00E9) |
| 0100 - 017F | Latin Extended-A | 128 | 2 | Ε‘ (U+0161) |
| 0370 - 03FF | Greek and Coptic | 135 | 2 | Ξ± (U+03B1) |
| 0400 - 04FF | Cyrillic | 256 | 2 | Π (U+0414) |
| 0600 - 06FF | Arabic | 256 | 2 | ΨΉ (U+0639) |
| 0900 - 097F | Devanagari | 128 | 3 | ΰ€ (U+0905) |
| 2000 - 206F | General Punctuation | 112 | 3 | - (U+2014) |
| 2100 - 214F | Letterlike Symbols | 80 | 3 | β’ (U+2122) |
| 2190 - 21FF | Arrows | 112 | 3 | β (U+2192) |
| 2200 - 22FF | Mathematical Operators | 256 | 3 | β (U+2260) |
| 2500 - 257F | Box Drawing | 128 | 3 | β (U+2502) |
| 2600 - 26FF | Miscellaneous Symbols | 256 | 3 | β (U+2605) |
| 3000 - 303F | CJK Symbols and Punctuation | 64 | 3 | γ (U+3001) |
| 4E00 - 9FFF | CJK Unified Ideographs | 20,992 | 3 | δΈ (U+4E16) |
| AC00 - D7AF | Hangul Syllables | 11,184 | 3 | ν (U+D55C) |
| D800 - DFFF | Surrogates (INVALID) | 2,048 | - | Reserved for UTF-16 |
| E000 - F8FF | Private Use Area | 6,400 | 3 | Vendor-specific |
| FB00 - FB06 | Alphabetic Presentation Forms | 7 | 3 | ο¬ (U+FB01) |
| FE00 - FE0F | Variation Selectors | 16 | 3 | VS1 - VS16 |
| FF00 - FFEF | Halfwidth and Fullwidth Forms | 240 | 3 | οΌ‘ (U+FF21) |
| FEFF | BOM / Zero Width No-Break Space | 1 | 3 | BOM marker |
| FFFD | Replacement Character | 1 | 3 | οΏ½ (U+FFFD) |
| 10000 - 1007F | Linear B Syllabary | 88 | 4 | Ancient script |
| 1D400 - 1D7FF | Mathematical Alphanumeric Symbols | 996 | 4 | π (U+1D400) |
| 1F300 - 1F5FF | Miscellaneous Symbols and Pictographs | 768 | 4 | π (U+1F31F) |
| 1F600 - 1F64F | Emoticons | 80 | 4 | π (U+1F600) |
| 1F680 - 1F6FF | Transport and Map Symbols | 128 | 4 | π (U+1F680) |
| 1F900 - 1F9FF | Supplemental Symbols and Pictographs | 256 | 4 | π€ (U+1F914) |
| E0001 - E007F | Tags | 97 | 4 | Language tags |