Code Points to UTF8 Converter
Convert Unicode code points (U+XXXX) to UTF-8 byte sequences and back. Supports hex, decimal, batch input with full RFC 3629 encoding.
Accepted: U+XXXX, 0xXXXX, or decimal. Separate with spaces, commas, or newlines.
| Code Point | Character | UTF-8 Bytes | Binary | Byte Count |
|---|
About
UTF-8 encodes each Unicode code point into a variable-length sequence of 1 to 4 bytes. Getting the byte boundaries wrong corrupts entire text streams. A single misplaced continuation byte (0x80 - 0xBF) can cascade into mojibake across thousands of characters. This tool implements the RFC 3629 encoding algorithm directly: it reads your code point U, determines the byte count from the scalar value, applies the correct bit masks, and outputs the exact hex byte sequence. It handles surrogate validation (rejecting U+D800 - U+DFFF) and enforces the ceiling at U+10FFFF. Batch input is supported for processing entire character sets at once.
Note: this tool operates on Unicode scalar values only. It does not process UTF-16 surrogate pairs as input. If you need to debug a UTF-16 stream, decode the surrogate pair to its code point first using the formula cp = 0x10000 + (H β 0xD800) Γ 0x400 + (L β 0xDC00), then feed the result here. Pro tip: when debugging network protocols, remember that BOM (byte order mark, U+FEFF) encodes to EF BB BF in UTF-8. Its unexpected presence is the most common cause of invisible parsing failures in JSON and XML feeds.
Formulas
The UTF-8 encoding algorithm maps a Unicode scalar value U to a byte sequence of length n (1 β€ n β€ 4). The byte count is determined by the magnitude of U:
For a 1-byte encoding, the output byte equals U directly: byte = U.
For multi-byte sequences, the leading byte carries a prefix of n one-bits followed by a zero, plus the top bits of U. Each continuation byte has the prefix 10 and carries 6 bits of payload. The encoding for n = 2:
byte2 = 0x80 | (U & 0x3F)
For n = 3:
byte2 = 0x80 | ((U >> 6) & 0x3F)
byte3 = 0x80 | (U & 0x3F)
For n = 4:
byte2 = 0x80 | ((U >> 12) & 0x3F)
byte3 = 0x80 | ((U >> 6) & 0x3F)
byte4 = 0x80 | (U & 0x3F)
Where U = Unicode scalar value (integer), | = bitwise OR, >> = right shift, & = bitwise AND. The surrogate range 0xD800 - 0xDFFF is excluded as these are not valid scalar values per the Unicode Standard.
Reference Data
| Code Point Range | UTF-8 Bytes | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Example | Encoded |
|---|---|---|---|---|---|---|---|
| U+0000 - U+007F | 1 | 0xxxxxxx | - | - | - | U+0041 (A) | 41 |
| U+0080 - U+07FF | 2 | 110xxxxx | 10xxxxxx | - | - | U+00E9 (Γ©) | C3 A9 |
| U+0800 - U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | - | U+4E16 (δΈ) | E4 B8 96 |
| U+10000 - U+10FFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | U+1F600 (π) | F0 9F 98 80 |
| U+0000 | 1 | 00 | - | - | - | NULL | 00 |
| U+007F | 1 | 7F | - | - | - | DEL | 7F |
| U+0080 | 2 | C2 | 80 | - | - | PAD | C2 80 |
| U+07FF | 2 | DF | BF | - | - | ίΏ | DF BF |
| U+0800 | 3 | E0 | A0 | 80 | - | ΰ | E0 A0 80 |
| U+FEFF | 3 | EF | BB | BF | - | BOM | EF BB BF |
| U+FFFD | 3 | EF | BF | BD | - | οΏ½ Replacement | EF BF BD |
| U+FFFF | 3 | EF | BF | BF | - | Noncharacter | EF BF BF |
| U+10000 | 4 | F0 | 90 | 80 | 80 | Linear B Syllable B008A | F0 90 80 80 |
| U+10FFFF | 4 | F4 | 8F | BF | BF | Max code point | F4 8F BF BF |
| U+D800 - U+DFFF | INVALID - Surrogate range, not encodable in UTF-8 | N/A | Rejected | ||||
| U+0024 | 1 | 24 | - | - | - | $ | 24 |
| U+00A3 | 2 | C2 | A3 | - | - | Β£ | C2 A3 |
| U+20AC | 3 | E2 | 82 | AC | - | β¬ | E2 82 AC |
| U+1F4A9 | 4 | F0 | 9F | 92 | A9 | π© | F0 9F 92 A9 |