User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times

Accepted: U+XXXX, 0xXXXX, or decimal. Separate with spaces, commas, or newlines.

Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

UTF-8 encodes each Unicode code point into a variable-length sequence of 1 to 4 bytes. Getting the byte boundaries wrong corrupts entire text streams. A single misplaced continuation byte (0x80 - 0xBF) can cascade into mojibake across thousands of characters. This tool implements the RFC 3629 encoding algorithm directly: it reads your code point U, determines the byte count from the scalar value, applies the correct bit masks, and outputs the exact hex byte sequence. It handles surrogate validation (rejecting U+D800 - U+DFFF) and enforces the ceiling at U+10FFFF. Batch input is supported for processing entire character sets at once.

Note: this tool operates on Unicode scalar values only. It does not process UTF-16 surrogate pairs as input. If you need to debug a UTF-16 stream, decode the surrogate pair to its code point first using the formula cp = 0x10000 + (H βˆ’ 0xD800) Γ— 0x400 + (L βˆ’ 0xDC00), then feed the result here. Pro tip: when debugging network protocols, remember that BOM (byte order mark, U+FEFF) encodes to EF BB BF in UTF-8. Its unexpected presence is the most common cause of invisible parsing failures in JSON and XML feeds.

unicode utf-8 code points encoding converter byte sequence character encoding

Formulas

The UTF-8 encoding algorithm maps a Unicode scalar value U to a byte sequence of length n (1 ≀ n ≀ 4). The byte count is determined by the magnitude of U:

{
n = 1 if 0x00 ≀ U ≀ 0x7Fn = 2 if 0x80 ≀ U ≀ 0x7FFn = 3 if 0x800 ≀ U ≀ 0xFFFFn = 4 if 0x10000 ≀ U ≀ 0x10FFFF

For a 1-byte encoding, the output byte equals U directly: byte = U.

For multi-byte sequences, the leading byte carries a prefix of n one-bits followed by a zero, plus the top bits of U. Each continuation byte has the prefix 10 and carries 6 bits of payload. The encoding for n = 2:

byte1 = 0xC0 | (U >> 6)
byte2 = 0x80 | (U & 0x3F)

For n = 3:

byte1 = 0xE0 | (U >> 12)
byte2 = 0x80 | ((U >> 6) & 0x3F)
byte3 = 0x80 | (U & 0x3F)

For n = 4:

byte1 = 0xF0 | (U >> 18)
byte2 = 0x80 | ((U >> 12) & 0x3F)
byte3 = 0x80 | ((U >> 6) & 0x3F)
byte4 = 0x80 | (U & 0x3F)

Where U = Unicode scalar value (integer), | = bitwise OR, >> = right shift, & = bitwise AND. The surrogate range 0xD800 - 0xDFFF is excluded as these are not valid scalar values per the Unicode Standard.

Reference Data

Code Point RangeUTF-8 BytesByte 1Byte 2Byte 3Byte 4ExampleEncoded
U+0000 - U+007F10xxxxxxx - - - U+0041 (A)41
U+0080 - U+07FF2110xxxxx10xxxxxx - - U+00E9 (Γ©)C3 A9
U+0800 - U+FFFF31110xxxx10xxxxxx10xxxxxx - U+4E16 (δΈ–)E4 B8 96
U+10000 - U+10FFFF411110xxx10xxxxxx10xxxxxx10xxxxxxU+1F600 (πŸ˜€)F0 9F 98 80
U+0000100 - - - NULL00
U+007F17F - - - DEL7F
U+00802C280 - - PADC2 80
U+07FF2DFBF - - ίΏDF BF
U+08003E0A080 - ΰ €E0 A0 80
U+FEFF3EFBBBF - BOMEF BB BF
U+FFFD3EFBFBD - οΏ½ ReplacementEF BF BD
U+FFFF3EFBFBF - NoncharacterEF BF BF
U+100004F0908080Linear B Syllable B008AF0 90 80 80
U+10FFFF4F48FBFBFMax code pointF4 8F BF BF
U+D800 - U+DFFFINVALID - Surrogate range, not encodable in UTF-8N/ARejected
U+0024124 - - - $24
U+00A32C2A3 - - Β£C2 A3
U+20AC3E282AC - €E2 82 AC
U+1F4A94F09F92A9πŸ’©F0 9F 92 A9

Frequently Asked Questions

Code points U+D800 through U+DFFF are UTF-16 surrogate halves. The Unicode Standard defines them as non-scalar values, meaning they do not represent characters and must never appear in a well-formed UTF-8 stream. RFC 3629 explicitly prohibits their encoding. If you have a surrogate pair from a UTF-16 source, compute the actual code point first: cp = 0x10000 + (H βˆ’ 0xD800) Γ— 0x400 + (L βˆ’ 0xDC00), then encode the result.
An overlong encoding uses more bytes than necessary to represent a code point. For example, encoding U+002F (the slash character) as C0 AF (2 bytes) instead of 2F (1 byte). This is strictly forbidden by RFC 3629 because it creates security vulnerabilities - attackers have historically used overlong encodings to bypass input filters. This tool always produces the shortest possible encoding and rejects overlong sequences during decoding.
When decoding UTF-8 bytes back to code points, the tool validates every byte. A valid leading byte must match a specific bit pattern: 0xxxxxxx, 110xxxxx, 1110xxxx, or 11110xxx. Continuation bytes must match 10xxxxxx. If a continuation byte is missing or a leading byte appears where a continuation is expected, the tool marks that position as an error and reports the specific invalid byte offset. It does not silently skip or substitute.
Yes. The tool accepts three formats: U+XXXX (standard Unicode notation), 0xXXXX (hexadecimal with prefix), and plain decimal integers (e.g., 65 for the letter A). You can mix formats in the same input, separated by spaces, commas, or newlines. Each token is parsed independently.
The Unicode Standard defines U+10FFFF as the absolute maximum code point. While the original UTF-8 design by Ken Thompson could theoretically encode values up to U+7FFFFFFF using up to 6 bytes, RFC 3629 restricted the range to match UTF-16's capacity. This tool rejects any input above U+10FFFF and reports it as out of range. The maximum valid 4-byte UTF-8 sequence is F4 8F BF BF.
In JavaScript, String.length returns the number of UTF-16 code units, not characters or bytes. A code point above U+FFFF (like emoji U+1F600) occupies 2 UTF-16 code units but 4 UTF-8 bytes. In byte-oriented contexts (network buffers, file I/O), you must calculate the UTF-8 byte length. This tool displays both the code point count and the total byte count so you can plan buffer sizes accurately.