User Rating 0.0 ★★★★★

Total Usage 0 times

Category Code Utilities

Input Text

0 characters

0 Characters

0 UTF-8 Bytes

0 ASCII Chars

0 Non-ASCII

0 Multi-byte Chars

0% ASCII Purity

Char	Codepoint	Dec	UTF-8 Hex	Binary	Bytes	Type

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

ASCII encodes 128 characters in 7 bits. UTF-8 extends this to 1,112,064 valid codepoints using a variable-width scheme of 1 to 4 bytes per character. Every valid ASCII string is already valid UTF-8 because UTF-8 preserves the ASCII range (U+0000 to U+007F) as single-byte sequences. The real complexity arises when text contains characters beyond codepoint 127. A misidentified encoding produces mojibake: garbled output caused by interpreting bytes under the wrong scheme. This tool performs real encoding via the browser's native TextEncoder API, exposes the raw byte structure, and flags every non-ASCII character so you can diagnose encoding issues before they corrupt a database or break a data pipeline.

Limitations: this tool operates on valid Unicode strings as represented by JavaScript's internal UTF-16. Lone surrogates and byte sequences that do not form valid UTF-8 will be replaced with the replacement character U+FFFD. If you are debugging raw binary files, examine the hex dump output rather than the decoded text. Pro tip: CSV imports fail silently on encoding mismatch. Validate your source encoding here before bulk inserts.

Formulas

UTF-8 encodes Unicode codepoints into variable-length byte sequences. The number of bytes depends on the codepoint range:

{

1 byte: 0xxxxxxx → U+0000 to U+007F (128 chars, pure ASCII)2 bytes: 110xxxxx 10xxxxxx → U+0080 to U+07FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx → U+0800 to U+FFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx → U+10000 to U+10FFFF

The encoding formula extracts the codepoint value cp and distributes its bits into the template above. For a 2-byte character:

Byte 1 = 0xC0 ∨ (cp >> 6)
Byte 2 = 0x80 ∨ (cp ∧ 0x3F)

Where cp = Unicode codepoint (integer). The >> operator performs a bitwise right shift, and ∧ masks the lower 6 bits. ASCII characters (codepoint ≤ 127) pass through unchanged because their single-byte UTF-8 representation is identical to their ASCII value.

Reference Data

Character	ASCII Dec	Unicode Codepoint	UTF-8 Bytes (Hex)	UTF-8 Byte Count	Category
A	65	U+0041	41	1	Latin uppercase
z	122	U+007A	7A	1	Latin lowercase
0	48	U+0030	30	1	Digit
Space	32	U+0020	20	1	Whitespace
~	126	U+007E	7E	1	Printable (last ASCII)
¢	-	U+00A2	C2 A2	2	Currency symbol
£	-	U+00A3	C2 A3	2	Currency symbol
€	-	U+20AC	E2 82 AC	3	Currency symbol
©	-	U+00A9	C2 A9	2	Miscellaneous symbol
°	-	U+00B0	C2 B0	2	Miscellaneous symbol
ü	-	U+00FC	C3 BC	2	Latin extended
ñ	-	U+00F1	C3 B1	2	Latin extended
α	-	U+03B1	CE B1	2	Greek
Ω	-	U+03A9	CE A9	2	Greek
世	-	U+4E16	E4 B8 96	3	CJK Ideograph
А	-	U+0410	D0 90	2	Cyrillic
☃	-	U+2603	E2 98 83	3	Miscellaneous symbol
😀	-	U+1F600	F0 9F 98 80	4	Emoji (supplementary)
💩	-	U+1F4A9	F0 9F 92 A9	4	Emoji (supplementary)
NUL	0	U+0000	00	1	Control character
TAB	9	U+0009	09	1	Control character
LF	10	U+000A	0A	1	Control character
CR	13	U+000D	0D	1	Control character
DEL	127	U+007F	7F	1	Control character
BOM	-	U+FEFF	EF BB BF	3	Byte Order Mark

Frequently Asked Questions

ASCII defines only codepoints U+0000 through U+007F (0 - 127). Any character with a codepoint above 127 is not ASCII. This tool flags such characters and encodes them using their proper UTF-8 multi-byte representation. If your source claims to be ASCII but contains bytes above 127, the file is likely encoded in ISO-8859-1, Windows-1252, or already UTF-8. Misidentifying the source encoding is the primary cause of mojibake in data pipelines.

UTF-8 is a variable-width encoding. Characters in the ASCII range (U+0000 - U+007F) use 1 byte. Latin extended and common diacritics (U+0080 - U+07FF) use 2 bytes. CJK ideographs, most symbols, and the Basic Multilingual Plane remainder (U+0800 - U+FFFF) use 3 bytes. Supplementary characters including emoji (U+10000 - U+10FFFF) use 4 bytes. The leading bits of each byte signal the total length, enabling decoders to read the stream without external length metadata.

The Byte Order Mark (U+FEFF) encodes to EF BB BF in UTF-8. Unlike UTF-16, UTF-8 has no byte-order ambiguity, so the BOM is unnecessary. However, some Windows applications (notably Notepad) prepend it. This can cause failures in Unix shell scripts (shebang line breaks), CSV parsers (first column name gets a hidden prefix), and JSON parsers (RFC 8259 forbids BOM). Always strip it before feeding data into strict parsers.

The tool operates on the text as your browser interprets it. If you paste text that was decoded under the wrong encoding before reaching the browser, the damage is already done. The byte analysis view helps you identify suspicious patterns. For example, the sequence C3 A9 represents é in UTF-8 but if you see C3 and A9 as two separate visible characters (Ã©), your source was UTF-8 but was decoded as ISO-8859-1. This double-encoding pattern is the most common encoding bug in web applications.

The maximum is 4 bytes, covering codepoints up to U+10FFFF. The original UTF-8 specification allowed up to 6 bytes (covering 31-bit values), but RFC 3629 restricted it to match Unicode's actual range. Any sequence claiming to be 5 or 6 bytes is invalid under current standards and will be rejected by conformant decoders.

JavaScript strings use UTF-16 internally. Characters outside the Basic Multilingual Plane (codepoint > 0xFFFF) are represented as surrogate pairs: two 16-bit code units. When this tool encodes such a character to UTF-8, it first resolves the surrogate pair to the true codepoint using codePointAt, then encodes that codepoint into 4 UTF-8 bytes. The surrogate values themselves (U+D800 - U+DFFF) are never valid in UTF-8.