User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
Accepted formats: U+XXXX, 0xHH, decimal, &#xHH;, &#DD;, 0b..., 0o... β€” separated by spaces, commas, or newlines.
Enter code points above and press Convert
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

Unicode code points are the canonical numeric identifiers for every character in the Unicode Standard (currently version 15.1, covering 149,813 assigned characters). A code point is expressed as U+XXXX where XXXX is a hexadecimal value in the range 0 to 10FFFF. Misinterpreting a code point during serialization - confusing UTF-8 byte length with UTF-16 code unit count, for instance - produces mojibake, data corruption, or security vulnerabilities such as overlong encoding exploits. This tool parses one or more code points in multiple input notations and outputs the resolved character, its UTF-8 byte sequence, UTF-16 code units (including surrogate pair decomposition for supplementary plane characters above U+FFFF), and the assigned Unicode block name.

Limitations: unassigned code points within valid ranges will resolve to a replacement glyph (β–‘ or οΏ½) depending on font support. Surrogate code points (D800 - DFFF) are flagged as invalid because they are reserved for the UTF-16 encoding mechanism and do not represent characters. The tool assumes well-formed input; ambiguous bare numbers default to hexadecimal interpretation unless prefixed.

unicode code points utf-8 utf-16 character converter encoding hex to unicode

Formulas

UTF-8 encodes a code point U into a variable-length byte sequence based on its numeric range. The encoding rules follow RFC 3629:

{
1 byte: 0xxxxxxx if U ≀ 7F2 bytes: 110xxxxx 10xxxxxx if U ≀ 7FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx if U ≀ FFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if U ≀ 10FFFF

For UTF-16, code points in the Basic Multilingual Plane (U ≀ FFFF) map directly to a single 16-bit code unit. Supplementary plane code points (U β‰₯ 10000) require a surrogate pair:

Uβ€² = U βˆ’ 10000
H = D800 + Uβ€²400 (high surrogate)
L = DC00 + (Uβ€² mod 400) (low surrogate)

Where U = the code point value (hexadecimal), Uβ€² = the offset from 10000, H = high surrogate (D800 - DBFF), L = low surrogate (DC00 - DFFF). Division by 40016 (102410) partitions the 20-bit offset into two 10-bit halves.

Reference Data

Block RangeBlock NameCharactersUTF-8 BytesExample
0000 - 007FBasic Latin1281A (U+0041)
0080 - 00FFLatin-1 Supplement1282Γ© (U+00E9)
0100 - 017FLatin Extended-A1282Ε‘ (U+0161)
0370 - 03FFGreek and Coptic1352Ξ± (U+03B1)
0400 - 04FFCyrillic2562Π” (U+0414)
0600 - 06FFArabic2562ΨΉ (U+0639)
0900 - 097FDevanagari1283ΰ€… (U+0905)
2000 - 206FGeneral Punctuation1123- (U+2014)
2100 - 214FLetterlike Symbols803β„’ (U+2122)
2190 - 21FFArrows1123β†’ (U+2192)
2200 - 22FFMathematical Operators2563β‰  (U+2260)
2500 - 257FBox Drawing1283β”‚ (U+2502)
2600 - 26FFMiscellaneous Symbols2563β˜… (U+2605)
3000 - 303FCJK Symbols and Punctuation643、 (U+3001)
4E00 - 9FFFCJK Unified Ideographs20,9923δΈ– (U+4E16)
AC00 - D7AFHangul Syllables11,1843ν•œ (U+D55C)
D800 - DFFFSurrogates (INVALID)2,048 - Reserved for UTF-16
E000 - F8FFPrivate Use Area6,4003Vendor-specific
FB00 - FB06Alphabetic Presentation Forms73fi (U+FB01)
FE00 - FE0FVariation Selectors163VS1 - VS16
FF00 - FFEFHalfwidth and Fullwidth Forms2403οΌ‘ (U+FF21)
FEFFBOM / Zero Width No-Break Space13BOM marker
FFFDReplacement Character13οΏ½ (U+FFFD)
10000 - 1007FLinear B Syllabary884Ancient script
1D400 - 1D7FFMathematical Alphanumeric Symbols9964𝐀 (U+1D400)
1F300 - 1F5FFMiscellaneous Symbols and Pictographs7684🌟 (U+1F31F)
1F600 - 1F64FEmoticons804πŸ˜€ (U+1F600)
1F680 - 1F6FFTransport and Map Symbols1284πŸš€ (U+1F680)
1F900 - 1F9FFSupplemental Symbols and Pictographs2564πŸ€” (U+1F914)
E0001 - E007FTags974Language tags

Frequently Asked Questions

Surrogate code points are reserved exclusively for the UTF-16 encoding mechanism. They do not represent characters and cannot appear in well-formed Unicode text. The converter flags these as invalid and does not attempt to render a character. Attempting to use String.fromCodePoint() with a surrogate value throws a RangeError in JavaScript.
The parser uses prefix detection: U+XXXX, 0xXXXX, and HHHH; are treated as hexadecimal. DDD; is treated as decimal. Bare numbers without a prefix default to hexadecimal to align with Unicode convention. To force decimal interpretation, use the DDD; HTML entity notation. Binary (0b...) and octal (0o...) prefixes are also supported.
A valid code point does not guarantee a visible glyph. The character may be unassigned in the current Unicode version, may be a control character (U+0000 - U+001F), or may simply lack font support in your browser. The converter reports the correct encoding bytes regardless of rendering. Supplementary plane characters (above U+FFFF) often require system fonts like Segoe UI Emoji or Noto to render.
The Unicode Standard defines code points from U+0000 to U+10FFFF, yielding a theoretical space of 1,114,112 code points. This upper bound is constrained by the UTF-16 encoding scheme: 20 bits of payload from surrogate pairs plus the 16-bit BMP yields exactly 17 planes of 65,536 code points each. Any value above 10FFFF is rejected by this converter.
Code points U+0000 - U+007F require 1 byte (ASCII-compatible). U+0080 - U+07FF require 2 bytes. U+0800 - U+FFFF require 3 bytes. U+10000 - U+10FFFF require 4 bytes. This is not arbitrary: each additional byte provides 5-6 more bits of payload. A common error is assuming all Unicode characters are 2 bytes (the "UTF-16 = Unicode" myth); in UTF-8, a single emoji like U+1F600 occupies 4 bytes.
Yes. Separate code points with spaces, commas, newlines, or any whitespace. The converter parses each token independently. For example, entering "U+0048 U+0065 U+006C U+006C U+006F" produces the string "Hello". Mixed formats are supported: "0x48, e U+6C 108 0x6F" also resolves correctly.
The converter processes each code point individually. Variation selectors (U+FE00 - U+FE0F) and combining marks (U+0300 - U+036F) are valid code points and will be converted correctly. However, their visual rendering depends on the preceding base character. Entering a combining mark alone will show it attached to nothing, which is technically correct but visually confusing. The combined output string will render them in sequence.