User Rating 0.0 ★★★★★

Total Usage 0 times

Category Text Analysis

Paste or type text to analyze Supports all Unicode: Latin, CJK, Emoji, Arabic, Cyrillic, and more.

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

UTF-8 encodes each Unicode codepoint into 1 to 4 bytes using a variable-width scheme. A single misinterpreted byte - a truncated multibyte sequence, a stray continuation byte (0x80 - 0xBF), or an invisible BOM (U+FEFF) - can corrupt data pipelines, break parsers, or silently alter string lengths. This tool decodes every character in your input to its raw byte sequence, Unicode codepoint (U+XXXX), General Category, and block name. It also computes Shannon entropy H across the byte distribution, flags non-ASCII content, detects surrogate artifacts, and renders a visual byte-density map. The analysis is performed entirely in-browser using the native TextEncoder API with manual bitmask verification - no server round-trips, no data leaves your machine.

Precision matters. A string that reports length = 5 in JavaScript may contain 7 codepoints (due to surrogate pairs) encoded in 19 bytes. This tool exposes that discrepancy. It handles edge cases including astral plane characters (emoji, CJK Extension B), zero-width joiners, combining diacritical marks, and right-to-left overrides. Limitation: General Category labels are derived from a built-in subset covering the most common ranges - exotic scripts may show as "Unknown Category."

Formulas

UTF-8 encodes a codepoint U into a byte sequence by partitioning the codepoint's binary representation across template bytes. The lead byte signals sequence length via high-bit prefix; continuation bytes carry 6 payload bits each, masked as 10xxxxxx.

{

0x₆x₅...x₀ if U ≤ 0x7F (1 byte, 7 bits)110x₁₀...x₆ 10x₅...x₀ if U ≤ 0x7FF (2 bytes, 11 bits)1110x₁₅...x₁₂ 10x₁₁...x₆ 10x₅...x₀ if U ≤ 0xFFFF (3 bytes, 16 bits)11110x₂₀...x₁₈ 10x₁₇...x₁₂ 10x₁₁...x₆ 10x₅...x₀ if U ≤ 0x10FFFF (4 bytes, 21 bits)

Shannon entropy of the byte stream measures information density:

H = − 255∑i=0 p(i) ⋅ log₂ p(i)

Where p(i) = frequency of byte value i divided by total byte count. Maximum entropy for a byte stream is 8 bits (perfectly uniform distribution). Plain English text typically yields H ≈ 4.0 - 5.0 bits. Compressed or encrypted data approaches 8.0.

Reference Data

Byte Pattern (Binary)	Byte Count	Codepoint Range	Bits for Codepoint	Example Char	Hex Bytes
0xxxxxxx	1	U+0000 - U+007F	7	A	0x41
110xxxxx 10xxxxxx	2	U+0080 - U+07FF	11	ñ	0xC3 0xB1
1110xxxx 10xxxxxx 10xxxxxx	3	U+0800 - U+FFFF	16	中	0xE4 0xB8 0xAD
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	4	U+10000 - U+10FFFF	21	😀	0xF0 0x9F 0x98 0x80
Common Unicode General Categories
Lu		Uppercase Letter (A, B, Ñ, Ω)
Ll		Lowercase Letter (a, b, ñ, ω)
Lt		Titlecase Letter (ǅ, ǈ, ǋ)
Lm		Modifier Letter (ʰ, ˡ, ˢ)
Lo		Other Letter (Chinese, Arabic, Hebrew chars)
Mn		Nonspacing Mark (combining diacritics: ̈ ̃ ̂)
Nd		Decimal Digit Number (0-9, ٠-٩, ०-९)
Nl		Letter Number (Roman numerals Ⅰ, Ⅱ, Ⅲ)
Pc		Connector Punctuation (underscore _)
Pd		Dash Punctuation (-, -, -)
Ps		Open Punctuation ( (, [, { )
Pe		Close Punctuation ( ), ], } )
Sm		Math Symbol (+, =, <, >, ~)
Sc		Currency Symbol ($, €, £, ¥)
Sk		Modifier Symbol (^, `, ¨, ¯)
So		Other Symbol (©, ®, emoji)
Zs		Space Separator (space, non-breaking space)
Zl		Line Separator (U+2028)
Zp		Paragraph Separator (U+2029)
Cc		Control Character (NULL, TAB, LF, CR)
Cf		Format Character (BOM, ZWJ, ZWNJ, RLO)
Notable Invisible / Hazardous Characters
U+FEFF		BOM (Byte Order Mark) - often invisible, breaks CSV/JSON
U+200B		Zero Width Space - invisible word break
U+200C		Zero Width Non-Joiner - affects ligature rendering
U+200D		Zero Width Joiner - used in emoji sequences
U+202A - U+202E		Bidi overrides - can disguise file extensions (security risk)
U+FFFD		Replacement Character - indicates decoding failure

Frequently Asked Questions

JavaScript strings use UTF-16 internally. Characters outside the Basic Multilingual Plane (above U+FFFF) - such as emoji or rare CJK - are stored as surrogate pairs, each consuming 2 UTF-16 code units. So "😀".length returns 2, not 1. This tool counts actual codepoints using the spread operator [...str].length and shows you where surrogate pairs occur alongside the true UTF-8 byte count.

The BOM (U+FEFF, encoded as 0xEF 0xBB 0xBF in UTF-8) is a zero-width no-break space placed at the start of a file. While optional in UTF-8, many Windows editors insert it by default. It can corrupt JSON parsing (invalid first character), break shell scripts (unexpected bytes before #!), and cause CSV column-header mismatches. This analyzer flags BOM presence in the first 3 bytes and warns you.

Shannon entropy H measures bits of information per byte. English prose typically scores 4.0-5.0 bits. Highly repetitive text (e.g., 'aaaaaaa') approaches 0. Encrypted or compressed data approaches the theoretical maximum of 8.0 bits. An unexpectedly high entropy in supposedly plain text could indicate embedded binary data or encoding corruption. An unexpectedly low value may suggest excessive padding or placeholder content.

The TextEncoder API used by this tool always produces valid, shortest-form UTF-8 by specification. However, if you paste text that originally contained overlong sequences (e.g., encoding U+002F as 0xC0 0xAF instead of 0x2F), those were already decoded by the browser into the correct codepoint before reaching this tool. The analyzer will show the correct shortest encoding. For raw byte-level inspection of potentially malformed data, you would need a hex editor operating on the original file bytes.

A single visual character (grapheme cluster) can consist of multiple codepoints - for example, "é" can be U+0065 (e) + U+0301 (combining acute accent), totaling 3 UTF-8 bytes across 2 codepoints but appearing as 1 glyph. This tool analyzes at the codepoint level, so it will show both constituent codepoints separately, along with their individual byte sequences and General Category labels (Ll for the base letter, Mn for the combining mark).

This tool embeds a curated subset of Unicode General Category ranges covering Latin, Greek, Cyrillic, Arabic, CJK, Hangul, Devanagari, common symbols, and emoji. Characters from extremely rare scripts (e.g., Cuneiform, Egyptian Hieroglyphs, Tangut) may fall outside the embedded lookup table and display as "Unknown Category". The byte-level analysis and codepoint identification remain fully accurate regardless.