About

Raw byte sequences carry no meaning until decoded against a character encoding standard. A single misidentified byte in a UTF-8 stream produces the replacement character U+FFFD and corrupts every subsequent multi-byte codepoint in the sequence. This tool decodes byte arrays - expressed as decimal (0 - 255), hexadecimal (00 - FF), binary (00000000 - 11111111), or octal (0 - 377) - into their corresponding Unicode string using proper UTF-8 multi-byte reassembly. It also reverses the operation: any plaintext string converts back to its constituent byte values in your chosen base.

The converter handles all four UTF-8 byte lengths: single-byte ASCII (U+0000 - U+007F), two-byte sequences up to U+07FF, three-byte sequences covering the Basic Multilingual Plane up to U+FFFF, and four-byte sequences for supplementary planes up to U+10FFFF. Note: this tool assumes valid UTF-8 encoding. Malformed continuation bytes or overlong encodings are flagged as errors rather than silently accepted. Pro tip: when debugging network protocols or binary file headers, paste raw hex dumps directly - the delimiter auto-detection handles spaces, commas, 0x prefixes, and contiguous streams.

Formulas

UTF-8 encodes Unicode codepoints into variable-length byte sequences. The encoding rule maps a codepoint U to 1 - 4 bytes based on its magnitude:

{

0xxxxxxx if U ≤ 0x7F (1 byte, 7 bits)110xxxxx 10xxxxxx if U ≤ 0x7FF (2 bytes, 11 bits)1110xxxx 10xxxxxx 10xxxxxx if U ≤ 0xFFFF (3 bytes, 16 bits)11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if U ≤ 0x10FFFF (4 bytes, 21 bits)

To decode bytes back to a codepoint, extract the payload bits from each byte and concatenate. For a 2-byte sequence with lead byte b₀ and continuation byte b₁:

U = (b₀ ∧ 0x1F) << 6 + (b₁ ∧ 0x3F)

For base conversion of individual byte values, a byte B in decimal relates to other bases as:

B_dec = B_hex = B_oct = B_bin , where 0 ≤ B ≤ 255

Where U = Unicode codepoint value, b_n = the n-th byte in the UTF-8 sequence, B = a single byte value in range 0 - 255, and ∧ = bitwise AND, << = left bit-shift.

Reference Data

Byte Range (Dec)	Hex Range	Binary Pattern	UTF-8 Bytes	Unicode Range	Description
0 - 127	00 - 7F	0xxxxxxx	1	U+0000 - U+007F	ASCII (Latin letters, digits, punctuation)
194 - 223	C2 - DF	110xxxxx	2	U+0080 - U+07FF	Latin Extended, Greek, Cyrillic, Arabic, Hebrew
224 - 239	E0 - EF	1110xxxx	3	U+0800 - U+FFFF	CJK, Devanagari, Thai, BMP symbols
240 - 247	F0 - F7	11110xxx	4	U+10000 - U+10FFFF	Emoji, historic scripts, musical symbols
128 - 191	80 - BF	10xxxxxx	-	-	Continuation byte (never starts a character)
192 - 193	C0 - C1	110000xx	-	-	Overlong encoding (invalid in UTF-8)
248 - 255	F8 - FF	11111xxx	-	-	Invalid UTF-8 lead bytes (RFC 3629)
32	20	00100000	1	U+0020	Space character
10	0A	00001010	1	U+000A	Line Feed (LF / \n)
13	0D	00001101	1	U+000D	Carriage Return (CR / \r)
9	09	00001001	1	U+0009	Horizontal Tab (\t)
0	00	00000000	1	U+0000	Null character (NUL)
239 187 191	EF BB BF	-	3	U+FEFF	UTF-8 BOM (Byte Order Mark)
239 191 189	EF BF BD	-	3	U+FFFD	Replacement Character (decoding error)
48 - 57	30 - 39	0011xxxx	1	U+0030 - U+0039	Digits 0-9
65 - 90	41 - 5A	01xxxxxx	1	U+0041 - U+005A	Uppercase A - Z
97 - 122	61 - 7A	01xxxxxx	1	U+0061 - U+007A	Lowercase a - z
240 159 152 128	F0 9F 98 80	-	4	U+1F600	Grinning Face emoji 😀

Frequently Asked Questions

If you map each byte individually via Latin-1 or ASCII, any codepoint above U+007F produces wrong characters. UTF-8 uses lead bytes (with prefixes 110, 1110, 11110) to signal that 1-3 continuation bytes (prefix 10) follow. The converter reassembles these into a single codepoint. For example, the Euro sign € is three bytes: E2 82 AC. Treating them as three Latin-1 characters yields â¬ instead.

The converter flags it as an invalid sequence and inserts the Unicode replacement character U+FFFD (�). Bytes in the range 0x80 - 0xBF are continuation bytes and must never appear at the start of a character. Similarly, lead bytes 0xC0 - 0xC1 and 0xF8 - 0xFF are invalid per RFC 3629.

Yes. Emoji like 😀 (Grinning Face) occupy codepoint U+1F600, which requires a 4-byte UTF-8 sequence: F0 9F 98 80. The converter correctly reassembles all four bytes into the single emoji character. In the reverse direction (string to bytes), JavaScript's surrogate pairs are resolved to the true codepoint before encoding.

The auto-delimiter detection looks for spaces, commas, 0x prefixes, or contiguous pairs. If your hex string has odd-length segments (e.g., A 4F 3), the parser cannot form valid byte pairs. Ensure each hex value is exactly two characters (pad with a leading zero: 0A instead of A). When using the 0x prefix format, each value must be 0x00 - 0xFF.

In Decimal mode, each byte is output as a base-10 integer (0 - 255). In Hex mode, bytes appear as two-character hexadecimal values. Binary mode produces 8-digit binary strings. Octal mode outputs 1-3 digit octal values (range 0 - 377). All modes represent the same underlying UTF-8 byte sequence; only the numeric base changes.

The UTF-8 BOM is the three-byte sequence EF BB BF (codepoint U+FEFF). This converter does not strip it automatically - it decodes to the zero-width no-break space character. If your source data begins with these bytes, the output string will contain an invisible character at position 0. Remove the first three bytes manually if unwanted.