Bytes to UTF-8 Converter
Convert raw bytes (hex, decimal, binary, octal) to UTF-8 text and back. Decode byte sequences, inspect code points, and analyze UTF-8 encoding structure.
About
UTF-8 encodes 1,112,064 valid Unicode code points using a variable-width scheme of 1 to 4 bytes per character. A single misinterpreted byte produces the replacement character U+FFFD, corrupting downstream text processing, database storage, and API responses. This tool performs real decoding via the TextDecoder API with strict error reporting. It parses raw byte input in hexadecimal, decimal, binary, or octal notation, validates each byte against the 0 - 255 range, and reconstructs the original UTF-8 string. Reverse mode encodes any Unicode string back to its constituent byte sequence.
The tool exposes the internal bit structure of each code point: leading bits (0, 110, 1110, 11110) that signal byte count, and continuation markers (10) that carry payload data. This matters when debugging malformed sequences, diagnosing mojibake from charset mismatches, or verifying that a system correctly handles multi-byte characters like CJK ideographs or emoji. Note: this tool assumes valid byte boundaries. It cannot recover data from streams split mid-sequence without context.
Formulas
UTF-8 is a variable-length encoding defined in RFC 3629. The number of leading 1 bits in the first byte determines the total byte count n for that code point. Each subsequent continuation byte begins with 10 and carries 6 payload bits.
Encoding rule for a code point U:
Decoding extracts the payload bits x from each byte and reconstructs the code point:
Where b0 is the leading byte, bi are continuation bytes, n is total byte count, and mask0 extracts the data bits from the leading byte (0x7F for 1-byte, 0x1F for 2-byte, 0x0F for 3-byte, 0x07 for 4-byte).
Reference Data
| Byte Count | Leading Bits | Code Point Range | Hex Range | Payload Bits | Example Character | Byte Sequence (Hex) | Description |
|---|---|---|---|---|---|---|---|
| 1 | 0xxxxxxx | U+0000 - U+007F | 00 - 7F | 7 | A | 41 | ASCII compatible range |
| 2 | 110xxxxx 10xxxxxx | U+0080 - U+07FF | C2 80 - DF BF | 11 | Γ© | C3 A9 | Latin extended, Greek, Cyrillic, Arabic, Hebrew |
| 3 | 1110xxxx 10xxxxxx 10xxxxxx | U+0800 - U+FFFF | E0 A0 80 - EF BF BF | 16 | δΈ | E4 B8 96 | CJK ideographs, BMP symbols |
| 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | U+10000 - U+10FFFF | F0 90 80 80 - F4 8F BF BF | 21 | π | F0 9F 98 80 | Emoji, historic scripts, math symbols |
| Common Characters & Their Byte Representations | |||||||
| 1 | 00100000 | U+0020 | 20 | 7 | (space) | 20 | Space character |
| 1 | 00001010 | U+000A | 0A | 7 | \n | 0A | Line feed (newline) |
| 2 | 11000010 10101100 | U+00AC | C2 AC | 11 | Β¬ | C2 AC | Not sign |
| 3 | 11100010 10000010 10101100 | U+20AC | E2 82 AC | 16 | β¬ | E2 82 AC | Euro sign |
| 3 | 11101111 10111111 10111101 | U+FFFD | EF BF BD | 16 | οΏ½ | EF BF BD | Replacement character (invalid sequence marker) |
| 3 | 11100010 10011100 10100000 | U+2720 | E2 9C A0 | 16 | β | E2 9C A0 | Maltese cross (Dingbat) |
| 4 | 11110000 10011111 10100100 10101001 | U+1F929 | F0 9F A4 A9 | 21 | π€© | F0 9F A4 A9 | Star-struck emoji |
| 2 | 11010000 10000001 | U+0401 | D0 81 | 11 | Π | D0 81 | Cyrillic capital IO |
| 3 | 11100011 10000001 10000010 | U+3042 | E3 81 82 | 16 | γ | E3 81 82 | Hiragana letter A |
| Invalid / Forbidden Byte Values | |||||||
| - | 11111xxx | - | F8 - FF | - | - | - | Never valid in UTF-8 (would imply 5+ byte sequences) |
| - | 1100000x | - | C0 - C1 | - | - | - | Overlong encoding of ASCII (forbidden by RFC 3629) |
| - | 10xxxxxx | - | 80 - BF | - | - | - | Continuation byte without leading byte (orphan) |