User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
0 characters
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

ASCII encodes 128 characters in 7 bits. UTF-8 extends this to 1,112,064 valid codepoints using a variable-width scheme of 1 to 4 bytes per character. Every valid ASCII string is already valid UTF-8 because UTF-8 preserves the ASCII range (U+0000 to U+007F) as single-byte sequences. The real complexity arises when text contains characters beyond codepoint 127. A misidentified encoding produces mojibake: garbled output caused by interpreting bytes under the wrong scheme. This tool performs real encoding via the browser's native TextEncoder API, exposes the raw byte structure, and flags every non-ASCII character so you can diagnose encoding issues before they corrupt a database or break a data pipeline.

Limitations: this tool operates on valid Unicode strings as represented by JavaScript's internal UTF-16. Lone surrogates and byte sequences that do not form valid UTF-8 will be replaced with the replacement character U+FFFD. If you are debugging raw binary files, examine the hex dump output rather than the decoded text. Pro tip: CSV imports fail silently on encoding mismatch. Validate your source encoding here before bulk inserts.

ascii utf-8 encoding converter text encoding unicode hex dump byte analysis character encoding

Formulas

UTF-8 encodes Unicode codepoints into variable-length byte sequences. The number of bytes depends on the codepoint range:

{
1 byte: 0xxxxxxx β†’ U+0000 to U+007F (128 chars, pure ASCII)2 bytes: 110xxxxx 10xxxxxx β†’ U+0080 to U+07FF3 bytes: 1110xxxx 10xxxxxx 10xxxxxx β†’ U+0800 to U+FFFF4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx β†’ U+10000 to U+10FFFF

The encoding formula extracts the codepoint value cp and distributes its bits into the template above. For a 2-byte character:

Byte 1 = 0xC0 ∨ (cp >> 6)
Byte 2 = 0x80 ∨ (cp ∧ 0x3F)

Where cp = Unicode codepoint (integer). The >> operator performs a bitwise right shift, and ∧ masks the lower 6 bits. ASCII characters (codepoint ≀ 127) pass through unchanged because their single-byte UTF-8 representation is identical to their ASCII value.

Reference Data

CharacterASCII DecUnicode CodepointUTF-8 Bytes (Hex)UTF-8 Byte CountCategory
A65U+0041411Latin uppercase
z122U+007A7A1Latin lowercase
048U+0030301Digit
Space32U+0020201Whitespace
~126U+007E7E1Printable (last ASCII)
Β’ - U+00A2C2 A22Currency symbol
Β£ - U+00A3C2 A32Currency symbol
€ - U+20ACE2 82 AC3Currency symbol
Β© - U+00A9C2 A92Miscellaneous symbol
Β° - U+00B0C2 B02Miscellaneous symbol
ΓΌ - U+00FCC3 BC2Latin extended
Γ± - U+00F1C3 B12Latin extended
Ξ± - U+03B1CE B12Greek
Ξ© - U+03A9CE A92Greek
δΈ– - U+4E16E4 B8 963CJK Ideograph
А - U+0410D0 902Cyrillic
β˜ƒ - U+2603E2 98 833Miscellaneous symbol
πŸ˜€ - U+1F600F0 9F 98 804Emoji (supplementary)
πŸ’© - U+1F4A9F0 9F 92 A94Emoji (supplementary)
NUL0U+0000001Control character
TAB9U+0009091Control character
LF10U+000A0A1Control character
CR13U+000D0D1Control character
DEL127U+007F7F1Control character
BOM - U+FEFFEF BB BF3Byte Order Mark

Frequently Asked Questions

ASCII defines only codepoints U+0000 through U+007F (0 - 127). Any character with a codepoint above 127 is not ASCII. This tool flags such characters and encodes them using their proper UTF-8 multi-byte representation. If your source claims to be ASCII but contains bytes above 127, the file is likely encoded in ISO-8859-1, Windows-1252, or already UTF-8. Misidentifying the source encoding is the primary cause of mojibake in data pipelines.
UTF-8 is a variable-width encoding. Characters in the ASCII range (U+0000 - U+007F) use 1 byte. Latin extended and common diacritics (U+0080 - U+07FF) use 2 bytes. CJK ideographs, most symbols, and the Basic Multilingual Plane remainder (U+0800 - U+FFFF) use 3 bytes. Supplementary characters including emoji (U+10000 - U+10FFFF) use 4 bytes. The leading bits of each byte signal the total length, enabling decoders to read the stream without external length metadata.
The Byte Order Mark (U+FEFF) encodes to EF BB BF in UTF-8. Unlike UTF-16, UTF-8 has no byte-order ambiguity, so the BOM is unnecessary. However, some Windows applications (notably Notepad) prepend it. This can cause failures in Unix shell scripts (shebang line breaks), CSV parsers (first column name gets a hidden prefix), and JSON parsers (RFC 8259 forbids BOM). Always strip it before feeding data into strict parsers.
The tool operates on the text as your browser interprets it. If you paste text that was decoded under the wrong encoding before reaching the browser, the damage is already done. The byte analysis view helps you identify suspicious patterns. For example, the sequence C3 A9 represents Γ© in UTF-8 but if you see C3 and A9 as two separate visible characters (é), your source was UTF-8 but was decoded as ISO-8859-1. This double-encoding pattern is the most common encoding bug in web applications.
The maximum is 4 bytes, covering codepoints up to U+10FFFF. The original UTF-8 specification allowed up to 6 bytes (covering 31-bit values), but RFC 3629 restricted it to match Unicode's actual range. Any sequence claiming to be 5 or 6 bytes is invalid under current standards and will be rejected by conformant decoders.
JavaScript strings use UTF-16 internally. Characters outside the Basic Multilingual Plane (codepoint > 0xFFFF) are represented as surrogate pairs: two 16-bit code units. When this tool encodes such a character to UTF-8, it first resolves the surrogate pair to the true codepoint using codePointAt, then encodes that codepoint into 4 UTF-8 bytes. The surrogate values themselves (U+D800 - U+DFFF) are never valid in UTF-8.