User Rating 0.0 ★★★★★

Total Usage 0 times

Category Code Utilities

Encoding

Delimiter

Examples:

Binary Input

0 bits · 0 groups

Unicode Output

Character Breakdown

#	Char	Code Point	Binary	Hex Bytes

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

Binary-to-Unicode conversion is not a trivial bit-flip. A single misaligned byte in a UTF-8 stream corrupts every subsequent character. UTF-8 uses variable-width encoding: ASCII maps to 7 bits, but a Chinese character like "漢" requires 3 bytes (24 bits), and emoji like "🚀" demand 4 bytes (32 bits). This tool implements the full RFC 3629 (UTF-8), RFC 2781 (UTF-16 with surrogate pair handling for code points above U+FFFF), and UTF-32 decoding pipelines. It validates continuation byte patterns, rejects overlong encodings, and flags orphaned surrogates instead of silently producing garbage.

Manual conversion with a calculator is error-prone because you must track byte boundaries, endianness, and multi-byte state across potentially thousands of bits. A miscount of even one bit shifts the entire frame. This converter auto-detects delimiters, validates binary input structure against the selected encoding, and reports the exact position of malformed sequences. It also operates in reverse: paste any Unicode text to get its precise binary representation in your chosen encoding. Limitations: this tool assumes big-endian byte order for UTF-16 and UTF-32. It does not handle Byte Order Marks (BOM) in input.

Formulas

UTF-8 encodes code points into 1 to 4 bytes. The leading byte determines the sequence length. Continuation bytes always match the pattern 10xxxxxx.

{

0xxxxxxx if U ≤ 007F (1 byte)110xxxxx 10xxxxxx if U ≤ 07FF (2 bytes)1110xxxx 10xxxxxx 10xxxxxx if U ≤ FFFF (3 bytes)11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if U ≤ 10FFFF (4 bytes)

For UTF-16 surrogate pairs (code points above U+FFFF):

U′ = U − 0x10000
High Surrogate = 0xD800 + (U′ >> 10)
Low Surrogate = 0xDC00 + (U′ & 0x3FF)

Where U is the Unicode code point (hexadecimal). The x bits in the UTF-8 template are filled with the binary representation of the code point, most significant bit first. UTF-32 is a direct 32-bit representation of the code point with zero-padding.

Reference Data

Character	Code Point	UTF-8 Binary	UTF-8 Bytes	UTF-16 Binary	UTF-16 Units	UTF-32 Binary
A	U+0041	01000001	1	00000000 01000001	1	00000000 00000000 00000000 01000001
é	U+00E9	11000011 10101001	2	00000000 11101001	1	00000000 00000000 00000000 11101001
€	U+20AC	11100010 10000010 10101100	3	00100000 10101100	1	00000000 00000000 00100000 10101100
漢	U+6F22	11100110 10111100 10100010	3	01101111 00100010	1	00000000 00000000 01101111 00100010
𐍈	U+10348	11110000 10010000 10001101 10001000	4	11011000 00000000 11011111 01001000	2 (surrogate pair)	00000000 00000001 00000011 01001000
🚀	U+1F680	11110000 10011111 10011010 10000000	4	11011000 00111101 11011110 10000000	2 (surrogate pair)	00000000 00000001 11110110 10000000
♠	U+2660	11100010 10011001 10100000	3	00100110 01100000	1	00000000 00000000 00100110 01100000
Ω	U+03A9	11001110 10101001	2	00000011 10101001	1	00000000 00000000 00000011 10101001
→	U+2192	11100010 10000110 10010010	3	00100001 10010010	1	00000000 00000000 00100001 10010010
∞	U+221E	11100010 10001000 10011110	3	00100010 00011110	1	00000000 00000000 00100010 00011110
NUL	U+0000	00000000	1	00000000 00000000	1	00000000 00000000 00000000 00000000
DEL	U+007F	01111111	1	00000000 01111111	1	00000000 00000000 00000000 01111111
Space	U+0020	00100000	1	00000000 00100000	1	00000000 00000000 00000000 00100000
©	U+00A9	11000010 10101001	2	00000000 10101001	1	00000000 00000000 00000000 10101001
中	U+4E2D	11100100 10111000 10101101	3	01001110 00101101	1	00000000 00000000 01001110 00101101
🎵	U+1F3B5	11110000 10011111 10001110 10110101	4	11011000 00111100 11011111 10110101	2 (surrogate pair)	00000000 00000001 11110011 10110101

Frequently Asked Questions

UTF-8 uses variable-width encoding. If your binary was generated using UTF-16 (16-bit units) or UTF-32 (32-bit units), selecting UTF-8 will misinterpret byte boundaries. A UTF-16 encoded "A" is 00000000 01000001 (16 bits), but UTF-8 would read those as two separate bytes - the first (00000000) as a NUL character. Always match the encoding to how the binary was originally produced.

Characters above U+FFFF (such as emoji, historic scripts, and mathematical symbols) require 4 bytes in UTF-8 and a surrogate pair (two 16-bit units) in UTF-16. The converter fully implements surrogate pair encoding and decoding per RFC 2781. In UTF-32, these code points are simply zero-padded to 32 bits. If you see a replacement character (U+FFFD), it means the binary sequence was malformed for that encoding.

An overlong encoding represents a code point using more bytes than necessary - for example, encoding U+002F (/) as C0 AF (2 bytes) instead of the correct single byte 2F. This is a security vulnerability (used in directory traversal attacks). This converter rejects overlong encodings and flags them as invalid, per RFC 3629 Section 3.

Yes. Select the "None (fixed-width)" delimiter option. The converter will split the input into chunks based on the encoding: 8-bit chunks for UTF-8, 16-bit for UTF-16, and 32-bit for UTF-32. The total number of bits must be evenly divisible by the chunk size, or the converter will report an error with the exact count mismatch.

The range U+D800 to U+DFFF is permanently reserved for UTF-16 surrogate pairs. These are not valid Unicode scalar values and must never appear as standalone code points. If a UTF-32 binary input decodes to a value in this range, or a UTF-8 stream encodes one of these values, the converter flags it as an error. This range contains exactly 2048 code points that are structurally excluded from the Unicode character set.

This tool assumes big-endian byte order for UTF-16 and UTF-32 inputs. In big-endian, the most significant byte comes first. If your binary data was produced in little-endian format (common on x86 systems), you need to reverse the byte order within each 16-bit or 32-bit unit before conversion. UTF-8 is byte-order independent because it is defined as a stream of individual bytes.