User Rating 0.0
Total Usage 0 times
Supports all Unicode: Latin, CJK, Emoji, Arabic, Cyrillic, and more.
Is this tool helpful?

Your feedback helps us improve.

About

UTF-8 encodes each Unicode codepoint into 1 to 4 bytes using a variable-width scheme. A single misinterpreted byte - a truncated multibyte sequence, a stray continuation byte (0x80 - 0xBF), or an invisible BOM (U+FEFF) - can corrupt data pipelines, break parsers, or silently alter string lengths. This tool decodes every character in your input to its raw byte sequence, Unicode codepoint (U+XXXX), General Category, and block name. It also computes Shannon entropy H across the byte distribution, flags non-ASCII content, detects surrogate artifacts, and renders a visual byte-density map. The analysis is performed entirely in-browser using the native TextEncoder API with manual bitmask verification - no server round-trips, no data leaves your machine.

Precision matters. A string that reports length = 5 in JavaScript may contain 7 codepoints (due to surrogate pairs) encoded in 19 bytes. This tool exposes that discrepancy. It handles edge cases including astral plane characters (emoji, CJK Extension B), zero-width joiners, combining diacritical marks, and right-to-left overrides. Limitation: General Category labels are derived from a built-in subset covering the most common ranges - exotic scripts may show as "Unknown Category."

utf-8 unicode text analysis byte inspector codepoint encoding character analysis

Formulas

UTF-8 encodes a codepoint U into a byte sequence by partitioning the codepoint's binary representation across template bytes. The lead byte signals sequence length via high-bit prefix; continuation bytes carry 6 payload bits each, masked as 10xxxxxx.

{
0x6x5...x0 if U 0x7F (1 byte, 7 bits)110x10...x6 10x5...x0 if U 0x7FF (2 bytes, 11 bits)1110x15...x12 10x11...x6 10x5...x0 if U 0xFFFF (3 bytes, 16 bits)11110x20...x18 10x17...x12 10x11...x6 10x5...x0 if U 0x10FFFF (4 bytes, 21 bits)

Shannon entropy of the byte stream measures information density:

H = 255i=0 p(i) log2 p(i)

Where p(i) = frequency of byte value i divided by total byte count. Maximum entropy for a byte stream is 8 bits (perfectly uniform distribution). Plain English text typically yields H 4.0 - 5.0 bits. Compressed or encrypted data approaches 8.0.

Reference Data

Byte Pattern (Binary)Byte CountCodepoint RangeBits for CodepointExample CharHex Bytes
0xxxxxxx1U+0000 - U+007F7A0x41
110xxxxx 10xxxxxx2U+0080 - U+07FF11ñ0xC3 0xB1
1110xxxx 10xxxxxx 10xxxxxx3U+0800 - U+FFFF160xE4 0xB8 0xAD
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx4U+10000 - U+10FFFF21😀0xF0 0x9F 0x98 0x80
Common Unicode General Categories
LuUppercase Letter (A, B, Ñ, Ω)
LlLowercase Letter (a, b, ñ, ω)
LtTitlecase Letter (Dž, Lj, Nj)
LmModifier Letter (ʰ, ˡ, ˢ)
LoOther Letter (Chinese, Arabic, Hebrew chars)
MnNonspacing Mark (combining diacritics: ̈ ̃ ̂)
NdDecimal Digit Number (0-9, ٠-٩, ०-९)
NlLetter Number (Roman numerals Ⅰ, Ⅱ, Ⅲ)
PcConnector Punctuation (underscore _)
PdDash Punctuation (-, -, -)
PsOpen Punctuation ( (, [, { )
PeClose Punctuation ( ), ], } )
SmMath Symbol (+, =, <, >, ~)
ScCurrency Symbol ($, €, £, ¥)
SkModifier Symbol (^, `, ¨, ¯)
SoOther Symbol (©, ®, emoji)
ZsSpace Separator (space, non-breaking space)
ZlLine Separator (U+2028)
ZpParagraph Separator (U+2029)
CcControl Character (NULL, TAB, LF, CR)
CfFormat Character (BOM, ZWJ, ZWNJ, RLO)
Notable Invisible / Hazardous Characters
U+FEFFBOM (Byte Order Mark) - often invisible, breaks CSV/JSON
U+200BZero Width Space - invisible word break
U+200CZero Width Non-Joiner - affects ligature rendering
U+200DZero Width Joiner - used in emoji sequences
U+202A - U+202EBidi overrides - can disguise file extensions (security risk)
U+FFFDReplacement Character - indicates decoding failure

Frequently Asked Questions

JavaScript strings use UTF-16 internally. Characters outside the Basic Multilingual Plane (above U+FFFF) - such as emoji or rare CJK - are stored as surrogate pairs, each consuming 2 UTF-16 code units. So "😀".length returns 2, not 1. This tool counts actual codepoints using the spread operator [...str].length and shows you where surrogate pairs occur alongside the true UTF-8 byte count.
The BOM (U+FEFF, encoded as 0xEF 0xBB 0xBF in UTF-8) is a zero-width no-break space placed at the start of a file. While optional in UTF-8, many Windows editors insert it by default. It can corrupt JSON parsing (invalid first character), break shell scripts (unexpected bytes before #!), and cause CSV column-header mismatches. This analyzer flags BOM presence in the first 3 bytes and warns you.
Shannon entropy H measures bits of information per byte. English prose typically scores 4.0-5.0 bits. Highly repetitive text (e.g., 'aaaaaaa') approaches 0. Encrypted or compressed data approaches the theoretical maximum of 8.0 bits. An unexpectedly high entropy in supposedly plain text could indicate embedded binary data or encoding corruption. An unexpectedly low value may suggest excessive padding or placeholder content.
The TextEncoder API used by this tool always produces valid, shortest-form UTF-8 by specification. However, if you paste text that originally contained overlong sequences (e.g., encoding U+002F as 0xC0 0xAF instead of 0x2F), those were already decoded by the browser into the correct codepoint before reaching this tool. The analyzer will show the correct shortest encoding. For raw byte-level inspection of potentially malformed data, you would need a hex editor operating on the original file bytes.
A single visual character (grapheme cluster) can consist of multiple codepoints - for example, "é" can be U+0065 (e) + U+0301 (combining acute accent), totaling 3 UTF-8 bytes across 2 codepoints but appearing as 1 glyph. This tool analyzes at the codepoint level, so it will show both constituent codepoints separately, along with their individual byte sequences and General Category labels (Ll for the base letter, Mn for the combining mark).
This tool embeds a curated subset of Unicode General Category ranges covering Latin, Greek, Cyrillic, Arabic, CJK, Hangul, Devanagari, common symbols, and emoji. Characters from extremely rare scripts (e.g., Cuneiform, Egyptian Hieroglyphs, Tangut) may fall outside the embedded lookup table and display as "Unknown Category". The byte-level analysis and codepoint identification remain fully accurate regardless.