User Rating 0.0 ★★★★★

Total Usage 0 times

Category Text Analysis

Paste or type text to analyze Supports all Unicode planes including emoji, combining marks, and control characters.

Character Breakdown

Filter:

#	Char	Code Point	Name	Block	Category	Script	UTF-8	UTF-16	Flags

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

Unicode encodes over 149,813 characters across 161 scripts as of version 15.1. A single visible glyph can consist of multiple code points - a base character, combining marks, variation selectors, and zero-width joiners - making raw string length unreliable. Confusable characters (homoglyphs), invisible control characters (U+200B Zero Width Space, U+FEFF BOM), and RTL overrides create security vulnerabilities in URLs, usernames, and source code. This tool decomposes any input into its constituent code points, reports each character's Unicode block, general category, script assignment, and exact byte representation in both UTF-8 and UTF-16 encoding. It flags invisible and potentially dangerous characters. Normalization comparison (NFC, NFD, NFKC, NFKD) reveals whether two visually identical strings differ at the binary level - a common source of bugs in string comparison, database lookups, and authentication systems.

Formulas

UTF-8 encodes a code point U into a variable number of bytes based on its magnitude:

{

1 byte: 0xxxxxxx if U ≤ 0x7F (127)2 bytes: 110xxxxx 10xxxxxx if U ≤ 0x7FF (2,047)3 bytes: 1110xxxx 10xxxxxx 10xxxxxx if U ≤ 0xFFFF (65,535)4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if U ≤ 0x10FFFF (1,114,111)

Where U is the Unicode code point value. UTF-16 uses 2 bytes for code points in the Basic Multilingual Plane (U ≤ 0xFFFF) and 4 bytes (a surrogate pair) for supplementary planes. The surrogate pair is computed as:

U′ = U − 0x10000
high = 0xD800 + (U′ >> 10)
low = 0xDC00 + (U′ & 0x3FF)

Where high ∈ [0xD800, 0xDBFF] and low ∈ [0xDC00, 0xDFFF].

Reference Data

Unicode Block	Range	Characters	Script	Common Use
Basic Latin	U+0000 - U+007F	128	Latin / Common	ASCII: English letters, digits, basic punctuation
Latin-1 Supplement	U+0080 - U+00FF	128	Latin	Western European accented characters (é, ñ, ü)
General Punctuation	U+2000 - U+206F	112	Common	En/em dashes, ellipsis, invisible formatters
Cyrillic	U+0400 - U+04FF	256	Cyrillic	Russian, Ukrainian, Bulgarian alphabets
Arabic	U+0600 - U+06FF	256	Arabic	Arabic script, RTL text
CJK Unified Ideographs	U+4E00 - U+9FFF	20,992	Han	Chinese, Japanese Kanji, Korean Hanja
Hangul Syllables	U+AC00 - U+D7AF	11,184	Hangul	Korean precomposed syllables
Devanagari	U+0900 - U+097F	128	Devanagari	Hindi, Sanskrit, Marathi
Emoticons	U+1F600 - U+1F64F	80	Common	Smiley faces, gesture emoji
Mathematical Operators	U+2200 - U+22FF	256	Common	∀, ∃, ∈, ∑, ∏, √, ∞
Box Drawing	U+2500 - U+257F	128	Common	Terminal UI borders: ─, │, ┌, ┐
Greek and Coptic	U+0370 - U+03FF	135	Greek	α, β, γ, δ - math and science notation
Currency Symbols	U+20A0 - U+20CF	48	Common	₹, ₿, €, ₽ - monetary symbols
Combining Diacritical Marks	U+0300 - U+036F	112	Inherited	Accents, tildes, and marks that combine with preceding char
Specials	U+FFF0 - U+FFFF	5	Common	Replacement char U+FFFD, BOM, noncharacters
Private Use Area	U+E000 - U+F8FF	6,400	Unknown	Custom glyphs (icons, ligatures in fonts)
Miscellaneous Symbols	U+2600 - U+26FF	256	Common	☀, ☎, ♠, ⚠ - common symbols
Supplementary Private Use A	U+F0000 - U+FFFFF	65,534	Unknown	Plane 15 private use
Tags	U+E0000 - U+E007F	97	Common	Emoji flag tag sequences (subdivision flags)
Halfwidth and Fullwidth Forms	U+FF00 - U+FFEF	225	Common	Ａ, ０ - CJK compatibility full-width variants

Frequently Asked Questions

JavaScript strings use UTF-16 encoding internally. Characters outside the Basic Multilingual Plane (code points above U+FFFF, such as emoji 🎉 at U+1F389) require a surrogate pair - two 16-bit code units. The .length property counts code units, not code points. Use [...str].length or the for...of iterator to count actual code points. Even this may not match visible glyphs, since combining marks (e.g., é = e + U+0301) and ZWJ sequences (e.g., 👨‍👩‍👧 = 3 emoji + 2 ZWJ) form single grapheme clusters from multiple code points.

Invisible characters such as Zero Width Space (U+200B), Zero Width Non-Joiner (U+200C), Right-to-Left Override (U+202E), and the BOM (U+FEFF) can be injected into usernames, URLs, filenames, and source code. The RTL override can make a filename appear as "document.doc" while actually being "document.exe" when rendered. Homoglyph attacks substitute Latin "a" (U+0061) with Cyrillic "а" (U+0430) to create spoofed domain names. This tool flags all invisible and potentially dangerous characters with explicit warnings.

Unicode allows the same visual character to be represented differently. The letter "é" can be a single code point U+00E9 (NFC - precomposed) or two code points U+0065 + U+0301 (NFD - decomposed). These render identically but fail strict byte comparison (===). Database lookups, password hashing, filename matching, and API comparisons can silently break. NFC is recommended for web content (W3C standard). NFKC additionally normalizes compatibility equivalents: the ligature "ﬁ" (U+FB01) becomes "fi". Always normalize before comparing or hashing.

The tool iterates using code-point-aware logic (for...of) and displays each individual code point. An emoji like 👨‍👩‍👧‍👦 consists of 7 code points: 4 person/family emoji connected by 3 Zero Width Joiners (U+200D). Each code point is shown separately with its role labeled. Skin tone modifiers (U+1F3FB - U+1F3FF) and variation selectors (U+FE0E text, U+FE0F emoji) are also individually identified and flagged.

Unicode divides the entire code space (U+0000 to U+10FFFF, totaling 1,114,112 code points) into 17 planes of 65,536 code points each. Plane 0 (U+0000 - U+FFFF) is the Basic Multilingual Plane (BMP) containing most common scripts. Plane 1 is the Supplementary Multilingual Plane (SMP) containing emoji, historic scripts, and musical notation. Within each plane, code points are grouped into named blocks (e.g., "Greek and Coptic" at U+0370 - U+03FF). Blocks are contiguous ranges; a character's block indicates its neighborhood but not necessarily its script or category.

Missing glyphs occur when the user's system fonts do not contain a rendering for that code point. The tool correctly identifies the code point and provides its Unicode name and properties regardless of rendering. Private Use Area characters (U+E000 - U+F8FF) are intentionally undefined by Unicode and rely on custom fonts. Unassigned code points in reserved blocks will also show as missing. The tool marks these with their official status.