Analyze UTF-8
Analyze UTF-8 text: inspect byte sequences, codepoints, character categories, encoding details, entropy, and visual byte maps in real time.
About
UTF-8 encodes each Unicode codepoint into 1 to 4 bytes using a variable-width scheme. A single misinterpreted byte - a truncated multibyte sequence, a stray continuation byte (0x80 - 0xBF), or an invisible BOM (U+FEFF) - can corrupt data pipelines, break parsers, or silently alter string lengths. This tool decodes every character in your input to its raw byte sequence, Unicode codepoint (U+XXXX), General Category, and block name. It also computes Shannon entropy H across the byte distribution, flags non-ASCII content, detects surrogate artifacts, and renders a visual byte-density map. The analysis is performed entirely in-browser using the native TextEncoder API with manual bitmask verification - no server round-trips, no data leaves your machine.
Precision matters. A string that reports length = 5 in JavaScript may contain 7 codepoints (due to surrogate pairs) encoded in 19 bytes. This tool exposes that discrepancy. It handles edge cases including astral plane characters (emoji, CJK Extension B), zero-width joiners, combining diacritical marks, and right-to-left overrides. Limitation: General Category labels are derived from a built-in subset covering the most common ranges - exotic scripts may show as "Unknown Category."
Formulas
UTF-8 encodes a codepoint U into a byte sequence by partitioning the codepoint's binary representation across template bytes. The lead byte signals sequence length via high-bit prefix; continuation bytes carry 6 payload bits each, masked as 10xxxxxx.
Shannon entropy of the byte stream measures information density:
Where p(i) = frequency of byte value i divided by total byte count. Maximum entropy for a byte stream is 8 bits (perfectly uniform distribution). Plain English text typically yields H ≈ 4.0 - 5.0 bits. Compressed or encrypted data approaches 8.0.
Reference Data
| Byte Pattern (Binary) | Byte Count | Codepoint Range | Bits for Codepoint | Example Char | Hex Bytes |
|---|---|---|---|---|---|
| 0xxxxxxx | 1 | U+0000 - U+007F | 7 | A | 0x41 |
| 110xxxxx 10xxxxxx | 2 | U+0080 - U+07FF | 11 | ñ | 0xC3 0xB1 |
| 1110xxxx 10xxxxxx 10xxxxxx | 3 | U+0800 - U+FFFF | 16 | 中 | 0xE4 0xB8 0xAD |
| 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 4 | U+10000 - U+10FFFF | 21 | 😀 | 0xF0 0x9F 0x98 0x80 |
| Common Unicode General Categories | |||||
| Lu | Uppercase Letter (A, B, Ñ, Ω) | ||||
| Ll | Lowercase Letter (a, b, ñ, ω) | ||||
| Lt | Titlecase Letter (Dž, Lj, Nj) | ||||
| Lm | Modifier Letter (ʰ, ˡ, ˢ) | ||||
| Lo | Other Letter (Chinese, Arabic, Hebrew chars) | ||||
| Mn | Nonspacing Mark (combining diacritics: ̈ ̃ ̂) | ||||
| Nd | Decimal Digit Number (0-9, ٠-٩, ०-९) | ||||
| Nl | Letter Number (Roman numerals Ⅰ, Ⅱ, Ⅲ) | ||||
| Pc | Connector Punctuation (underscore _) | ||||
| Pd | Dash Punctuation (-, -, -) | ||||
| Ps | Open Punctuation ( (, [, { ) | ||||
| Pe | Close Punctuation ( ), ], } ) | ||||
| Sm | Math Symbol (+, =, <, >, ~) | ||||
| Sc | Currency Symbol ($, €, £, ¥) | ||||
| Sk | Modifier Symbol (^, `, ¨, ¯) | ||||
| So | Other Symbol (©, ®, emoji) | ||||
| Zs | Space Separator (space, non-breaking space) | ||||
| Zl | Line Separator (U+2028) | ||||
| Zp | Paragraph Separator (U+2029) | ||||
| Cc | Control Character (NULL, TAB, LF, CR) | ||||
| Cf | Format Character (BOM, ZWJ, ZWNJ, RLO) | ||||
| Notable Invisible / Hazardous Characters | |||||
| U+FEFF | BOM (Byte Order Mark) - often invisible, breaks CSV/JSON | ||||
| U+200B | Zero Width Space - invisible word break | ||||
| U+200C | Zero Width Non-Joiner - affects ligature rendering | ||||
| U+200D | Zero Width Joiner - used in emoji sequences | ||||
| U+202A - U+202E | Bidi overrides - can disguise file extensions (security risk) | ||||
| U+FFFD | Replacement Character - indicates decoding failure | ||||