About

Incorrect byte interpretation corrupts data silently. A single misidentified encoding transforms valid text into mojibake - garbled characters that propagate through databases, APIs, and file systems. This tool decodes raw byte sequences (hex, decimal, binary, octal) into Unicode characters using the actual TextDecoder API with support for UTF-8, UTF-16LE, UTF-16BE, ISO-8859-1, Windows-1252, and ASCII. It parses multi-byte sequences correctly, handles surrogate pairs in UTF-16, and reports each character's codepoint (U+XXXX), byte length, and category. The tool assumes well-formed input per the selected encoding. Malformed sequences produce the replacement character U+FFFD rather than failing silently.

Pro tip: if you receive bytes from a network protocol or binary file and the output looks wrong, try ISO-8859-1 first - it maps every single byte (0x00 - 0xFF) to a valid codepoint, confirming your raw data before re-decoding with the correct encoding. Remember that UTF-8 uses 1 - 4 bytes per character. Feeding a UTF-16 stream to a UTF-8 decoder will produce nonsense, not an error.

Formulas

Byte-to-character decoding follows the encoding's mapping rules. For UTF-8, the byte length of a character is determined by the leading byte pattern:

{

1 byte: 0xxxxxxx → U+0000 - U+007F2 bytes: 110xxxxx 10xxxxxx → U+0080 - U+07FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx → U+0800 - U+FFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx → U+10000 - U+10FFFF

The codepoint cp is extracted by masking the payload bits and shifting. For a 2-byte sequence with lead byte b₀ and continuation byte b₁:

cp = (b₀ ∧ 0x1F) << 6 + (b₁ ∧ 0x3F)

For UTF-16, codepoints above U+FFFF use surrogate pairs. High surrogate H and low surrogate L reconstruct the codepoint:

cp = (H − 0xD800) × 0x400 + (L − 0xDC00) + 0x10000

Where H ∈ [0xD800, 0xDBFF] and L ∈ [0xDC00, 0xDFFF]. Single-byte encodings (ISO-8859-1, Windows-1252) use a direct 1:1 lookup table mapping each byte value to its codepoint.

Reference Data

Encoding	Byte Range	Max Bytes/Char	BOM	Coverage	Common Use
ASCII	0x00 - 0x7F	1	None	128 characters	Legacy protocols, plain English text
UTF-8	0x00 - 0xF4 (lead)	4	0xEF 0xBB 0xBF	All Unicode (1,114,112 codepoints)	Web (HTML, JSON, XML), Linux, modern APIs
UTF-16LE	0x00 - 0xFF per byte	4	0xFF 0xFE	All Unicode	Windows internals, Java strings, .NET
UTF-16BE	0x00 - 0xFF per byte	4	0xFE 0xFF	All Unicode	Network protocols (big-endian default)
ISO-8859-1	0x00 - 0xFF	1	None	256 characters	Western European, HTTP default fallback
Windows-1252	0x00 - 0xFF	1	None	251 characters	Legacy Windows apps, email, old Word docs
UTF-32LE	4 bytes fixed	4	0xFF 0xFE 0x00 0x00	All Unicode	Internal processing, Python 3 (CPython)
UTF-32BE	4 bytes fixed	4	0x00 0x00 0xFE 0xFF	All Unicode	Rare, academic use
KOI8-R	0x00 - 0xFF	1	None	Russian Cyrillic	Unix/Linux Russian locale, older email
Shift_JIS	0x00 - 0xFF	2	None	JIS X 0208 + ASCII	Japanese Windows, legacy web pages
EUC-KR	0x00 - 0xFF	2	None	KS X 1001 + ASCII	Korean legacy systems
GB2312 / GBK	0x00 - 0xFF	2	None	Simplified Chinese	Chinese websites, legacy systems
Big5	0x00 - 0xFF	2	None	Traditional Chinese	Taiwan/HK legacy systems
ISO-8859-5	0x00 - 0xFF	1	None	Cyrillic subset	Eastern European legacy
ISO-8859-15	0x00 - 0xFF	1	None	256 characters (€ sign added)	Updated Western European

Frequently Asked Questions

The TextDecoder API inserts the Unicode Replacement Character U+FFFD (�) at each position where a malformed or incomplete sequence is detected. This is the standard behavior defined by the Unicode specification. If you see clusters of � in the output, your bytes likely belong to a different encoding than the one selected. Try ISO-8859-1 first to see the raw byte-to-codepoint mapping, then switch to the correct encoding.

If your input begins with BOM bytes (e.g., 0xEF 0xBB 0xBF for UTF-8, or 0xFF 0xFE for UTF-16LE), the TextDecoder will consume them silently and not include the BOM character in the output. This matches real-world decoding behavior. If you explicitly want to see the BOM character U+FEFF, select ISO-8859-1 encoding which treats every byte independently.

Each encoding defines its own mapping table from byte values to Unicode codepoints. For example, byte 0xE4 maps to U+00E4 (ä) in ISO-8859-1 but is a UTF-8 lead byte indicating a 3-byte sequence starting a CJK character. In Windows-1252, byte 0x80 maps to the Euro sign € (U+20AC), while ISO-8859-1 defines it as a control character. The encoding is not metadata attached to the bytes - it is an external contract between sender and receiver.

Yes. Emoji such as 😀 (Grinning Face) are encoded as 4 bytes in UTF-8: 0xF0 0x9F 0x98 0x80. Enter these bytes in any supported format (hex, decimal, binary) and select UTF-8 encoding. The tool will correctly reconstruct codepoint U+1F600 and display the emoji. In UTF-16, the same character requires a surrogate pair: 0xD83D 0xDE00 (4 bytes total, 2 code units).

The converter accepts four formats, auto-detected by prefix: hexadecimal (0xFF or plain FF), decimal (255), octal (0o377), and binary (0b11111111). Values can be separated by spaces, commas, or newlines. Mixed formats are supported in a single input. Each token must resolve to a value in the range 0 - 255 (0x00 - 0xFF). Values outside this range are flagged as errors.

Use the Reverse Mode toggle. Paste or type Unicode text, select the target encoding, and the tool outputs the corresponding byte sequence in your chosen format (hex, decimal, binary, octal). This is useful for verifying round-trip consistency: convert bytes → text, then text → bytes, and confirm the byte sequences match. Note that some characters cannot be represented in single-byte encodings. For example, the character æ (U+00E6) exists in ISO-8859-1 but not in ASCII, which will produce an encoding error.