About

UTF-8 encodes 1,112,064 valid Unicode code points using a variable-width scheme of 1 to 4 bytes per character. A single misinterpreted byte produces the replacement character U+FFFD, corrupting downstream text processing, database storage, and API responses. This tool performs real decoding via the TextDecoder API with strict error reporting. It parses raw byte input in hexadecimal, decimal, binary, or octal notation, validates each byte against the 0 - 255 range, and reconstructs the original UTF-8 string. Reverse mode encodes any Unicode string back to its constituent byte sequence.

The tool exposes the internal bit structure of each code point: leading bits (0, 110, 1110, 11110) that signal byte count, and continuation markers (10) that carry payload data. This matters when debugging malformed sequences, diagnosing mojibake from charset mismatches, or verifying that a system correctly handles multi-byte characters like CJK ideographs or emoji. Note: this tool assumes valid byte boundaries. It cannot recover data from streams split mid-sequence without context.

Formulas

UTF-8 is a variable-length encoding defined in RFC 3629. The number of leading 1 bits in the first byte determines the total byte count n for that code point. Each subsequent continuation byte begins with 10 and carries 6 payload bits.

Encoding rule for a code point U:

{

1 byte: 0xxxxxxx → 7 bits → U ≤ 0x7F2 bytes: 110xxxxx 10xxxxxx → 11 bits → U ≤ 0x7FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx → 16 bits → U ≤ 0xFFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx → 21 bits → U ≤ 0x10FFFF

Decoding extracts the payload bits x from each byte and reconstructs the code point:

U = (b₀ ∧ mask₀) × 2⁶⁽ⁿ⁻¹⁾ + n−1∑i=1 (b_i ∧ 0x3F) × 2^6(n−1−i)

Where b₀ is the leading byte, b_i are continuation bytes, n is total byte count, and mask₀ extracts the data bits from the leading byte (0x7F for 1-byte, 0x1F for 2-byte, 0x0F for 3-byte, 0x07 for 4-byte).

Reference Data

Byte Count	Leading Bits	Code Point Range	Hex Range	Payload Bits	Example Character	Byte Sequence (Hex)	Description
1	0xxxxxxx	U+0000 - U+007F	00 - 7F	7	A	41	ASCII compatible range
2	110xxxxx 10xxxxxx	U+0080 - U+07FF	C2 80 - DF BF	11	é	C3 A9	Latin extended, Greek, Cyrillic, Arabic, Hebrew
3	1110xxxx 10xxxxxx 10xxxxxx	U+0800 - U+FFFF	E0 A0 80 - EF BF BF	16	世	E4 B8 96	CJK ideographs, BMP symbols
4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	U+10000 - U+10FFFF	F0 90 80 80 - F4 8F BF BF	21	😀	F0 9F 98 80	Emoji, historic scripts, math symbols
Common Characters & Their Byte Representations
1	00100000	U+0020	20	7	(space)	20	Space character
1	00001010	U+000A	0A	7	\n	0A	Line feed (newline)
2	11000010 10101100	U+00AC	C2 AC	11	¬	C2 AC	Not sign
3	11100010 10000010 10101100	U+20AC	E2 82 AC	16	€	E2 82 AC	Euro sign
3	11101111 10111111 10111101	U+FFFD	EF BF BD	16	�	EF BF BD	Replacement character (invalid sequence marker)
3	11100010 10011100 10100000	U+2720	E2 9C A0	16	✠	E2 9C A0	Maltese cross (Dingbat)
4	11110000 10011111 10100100 10101001	U+1F929	F0 9F A4 A9	21	🤩	F0 9F A4 A9	Star-struck emoji
2	11010000 10000001	U+0401	D0 81	11	Ё	D0 81	Cyrillic capital IO
3	11100011 10000001 10000010	U+3042	E3 81 82	16	あ	E3 81 82	Hiragana letter A
Invalid / Forbidden Byte Values
-	11111xxx	-	F8 - FF	-	-	-	Never valid in UTF-8 (would imply 5+ byte sequences)
-	1100000x	-	C0 - C1	-	-	-	Overlong encoding of ASCII (forbidden by RFC 3629)
-	10xxxxxx	-	80 - BF	-	-	-	Continuation byte without leading byte (orphan)

Frequently Asked Questions

Per the Unicode Standard (Chapter 3, conformance clause D93), a conformant decoder must treat each maximal subpart of an ill-formed subsequence as a single replacement character U+FFFD. For example, the sequence C0 AF is an overlong encoding (forbidden by RFC 3629) and produces one U+FFFD. An orphan continuation byte like 80 appearing without a leading byte also produces one U+FFFD. This tool uses the browser's TextDecoder with fatal mode off, which follows this replacement strategy and reports each invalid position.

Overlong encoding uses more bytes than necessary for a code point. For example, encoding U+002F (the / character, normally 1 byte: 2F) as the 2-byte sequence C0 AF. While technically decodable, RFC 3629 explicitly forbids this because it creates security vulnerabilities - attackers could bypass string filters using alternate representations. Bytes C0 and C1 can never begin a valid UTF-8 sequence. Bytes F8 - FF are structurally invalid as they would imply 5-to-8-byte sequences that UTF-8 does not define.

UTF-8 has a detectable structure: any byte above 7F must follow the leading/continuation bit pattern. If you see bytes like E2 82 AC decoding to "€", that is valid UTF-8. In ISO-8859-1, the same bytes E2, 82, AC would be three separate characters: â, ‚, ¬. A practical test: decode as UTF-8 with the fatal flag set to true. If it throws, the data is not valid UTF-8. This tool provides that validation - any malformed sequence is flagged with its byte position.

Emoji code points start at U+1F600 and above, well beyond the Basic Multilingual Plane (BMP) limit of U+FFFF. The BMP covers code points representable in 3 UTF-8 bytes (up to 16 payload bits). Emoji need code points up to U+10FFFF, requiring 21 payload bits, which demands the 4-byte format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. For example, 😀 (U+1F600) encodes as F0 9F 98 80. In JavaScript, these are represented as surrogate pairs in UTF-16, which is why "😀".length returns 2, not 1.

The byte sequence EF BB BF is the UTF-8 encoding of U+FEFF (Zero Width No-Break Space), used as a BOM. While the Unicode Standard permits it in UTF-8, it is generally discouraged because UTF-8 has no byte-order ambiguity. This tool decodes it as a normal character and displays it in the output. It is visually zero-width, but the code point breakdown table will show its presence at U+FEFF, alerting you to unexpected BOMs that can cause issues in file concatenation or CSV parsing.

If the input ends with a leading byte (e.g., E4) but is missing its expected continuation bytes, the decoder treats it as an incomplete sequence and emits U+FFFD. This commonly occurs when reading fixed-size buffers from a stream where the buffer boundary splits a multi-byte character. This tool flags such cases with the specific byte index and expected vs. actual byte count. In production systems, you should buffer incomplete trailing bytes and prepend them to the next chunk.