User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
0 values
Presets:
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

UTF-8 encodes 1,112,064 valid Unicode code points using a variable-width scheme of 1 to 4 bytes per character. A single misinterpreted byte produces the replacement character U+FFFD, corrupting downstream text processing, database storage, and API responses. This tool performs real decoding via the TextDecoder API with strict error reporting. It parses raw byte input in hexadecimal, decimal, binary, or octal notation, validates each byte against the 0 - 255 range, and reconstructs the original UTF-8 string. Reverse mode encodes any Unicode string back to its constituent byte sequence.

The tool exposes the internal bit structure of each code point: leading bits (0, 110, 1110, 11110) that signal byte count, and continuation markers (10) that carry payload data. This matters when debugging malformed sequences, diagnosing mojibake from charset mismatches, or verifying that a system correctly handles multi-byte characters like CJK ideographs or emoji. Note: this tool assumes valid byte boundaries. It cannot recover data from streams split mid-sequence without context.

bytes to utf-8 utf-8 decoder byte converter hex to text binary to text utf-8 encoding unicode converter byte sequence decoder

Formulas

UTF-8 is a variable-length encoding defined in RFC 3629. The number of leading 1 bits in the first byte determines the total byte count n for that code point. Each subsequent continuation byte begins with 10 and carries 6 payload bits.

Encoding rule for a code point U:

{
1 byte: 0xxxxxxx β†’ 7 bits β†’ U ≀ 0x7F2 bytes: 110xxxxx 10xxxxxx β†’ 11 bits β†’ U ≀ 0x7FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx β†’ 16 bits β†’ U ≀ 0xFFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx β†’ 21 bits β†’ U ≀ 0x10FFFF

Decoding extracts the payload bits x from each byte and reconstructs the code point:

U = (b0 ∧ mask0) Γ— 26(nβˆ’1) + nβˆ’1βˆ‘i=1 (bi ∧ 0x3F) Γ— 26(nβˆ’1βˆ’i)

Where b0 is the leading byte, bi are continuation bytes, n is total byte count, and mask0 extracts the data bits from the leading byte (0x7F for 1-byte, 0x1F for 2-byte, 0x0F for 3-byte, 0x07 for 4-byte).

Reference Data

Byte CountLeading BitsCode Point RangeHex RangePayload BitsExample CharacterByte Sequence (Hex)Description
10xxxxxxxU+0000 - U+007F00 - 7F7A41ASCII compatible range
2110xxxxx 10xxxxxxU+0080 - U+07FFC2 80 - DF BF11Γ©C3 A9Latin extended, Greek, Cyrillic, Arabic, Hebrew
31110xxxx 10xxxxxx 10xxxxxxU+0800 - U+FFFFE0 A0 80 - EF BF BF16δΈ–E4 B8 96CJK ideographs, BMP symbols
411110xxx 10xxxxxx 10xxxxxx 10xxxxxxU+10000 - U+10FFFFF0 90 80 80 - F4 8F BF BF21πŸ˜€F0 9F 98 80Emoji, historic scripts, math symbols
Common Characters & Their Byte Representations
100100000U+0020207(space)20Space character
100001010U+000A0A7\n0ALine feed (newline)
211000010 10101100U+00ACC2 AC11Β¬C2 ACNot sign
311100010 10000010 10101100U+20ACE2 82 AC16€E2 82 ACEuro sign
311101111 10111111 10111101U+FFFDEF BF BD16οΏ½EF BF BDReplacement character (invalid sequence marker)
311100010 10011100 10100000U+2720E2 9C A016✠E2 9C A0Maltese cross (Dingbat)
411110000 10011111 10100100 10101001U+1F929F0 9F A4 A921🀩F0 9F A4 A9Star-struck emoji
211010000 10000001U+0401D0 8111ЁD0 81Cyrillic capital IO
311100011 10000001 10000010U+3042E3 81 8216あE3 81 82Hiragana letter A
Invalid / Forbidden Byte Values
- 11111xxx - F8 - FF - - - Never valid in UTF-8 (would imply 5+ byte sequences)
- 1100000x - C0 - C1 - - - Overlong encoding of ASCII (forbidden by RFC 3629)
- 10xxxxxx - 80 - BF - - - Continuation byte without leading byte (orphan)

Frequently Asked Questions

Per the Unicode Standard (Chapter 3, conformance clause D93), a conformant decoder must treat each maximal subpart of an ill-formed subsequence as a single replacement character U+FFFD. For example, the sequence C0 AF is an overlong encoding (forbidden by RFC 3629) and produces one U+FFFD. An orphan continuation byte like 80 appearing without a leading byte also produces one U+FFFD. This tool uses the browser's TextDecoder with fatal mode off, which follows this replacement strategy and reports each invalid position.
Overlong encoding uses more bytes than necessary for a code point. For example, encoding U+002F (the / character, normally 1 byte: 2F) as the 2-byte sequence C0 AF. While technically decodable, RFC 3629 explicitly forbids this because it creates security vulnerabilities - attackers could bypass string filters using alternate representations. Bytes C0 and C1 can never begin a valid UTF-8 sequence. Bytes F8 - FF are structurally invalid as they would imply 5-to-8-byte sequences that UTF-8 does not define.
UTF-8 has a detectable structure: any byte above 7F must follow the leading/continuation bit pattern. If you see bytes like E2 82 AC decoding to "€", that is valid UTF-8. In ISO-8859-1, the same bytes E2, 82, AC would be three separate characters: Γ’, β€š, Β¬. A practical test: decode as UTF-8 with the fatal flag set to true. If it throws, the data is not valid UTF-8. This tool provides that validation - any malformed sequence is flagged with its byte position.
Emoji code points start at U+1F600 and above, well beyond the Basic Multilingual Plane (BMP) limit of U+FFFF. The BMP covers code points representable in 3 UTF-8 bytes (up to 16 payload bits). Emoji need code points up to U+10FFFF, requiring 21 payload bits, which demands the 4-byte format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. For example, πŸ˜€ (U+1F600) encodes as F0 9F 98 80. In JavaScript, these are represented as surrogate pairs in UTF-16, which is why "πŸ˜€".length returns 2, not 1.
The byte sequence EF BB BF is the UTF-8 encoding of U+FEFF (Zero Width No-Break Space), used as a BOM. While the Unicode Standard permits it in UTF-8, it is generally discouraged because UTF-8 has no byte-order ambiguity. This tool decodes it as a normal character and displays it in the output. It is visually zero-width, but the code point breakdown table will show its presence at U+FEFF, alerting you to unexpected BOMs that can cause issues in file concatenation or CSV parsing.
If the input ends with a leading byte (e.g., E4) but is missing its expected continuation bytes, the decoder treats it as an incomplete sequence and emits U+FFFD. This commonly occurs when reading fixed-size buffers from a stream where the buffer boundary splits a multi-byte character. This tool flags such cases with the specific byte index and expected vs. actual byte count. In production systems, you should buffer incomplete trailing bytes and prepend them to the next chunk.