Analyze Unicode
Analyze Unicode text character-by-character: view code points, UTF-8/16 encoding, blocks, scripts, detect invisible characters and normalization forms.
| # | Char | Code Point | Name | Block | Category | Script | UTF-8 | UTF-16 | Flags |
|---|
About
Unicode encodes over 149,813 characters across 161 scripts as of version 15.1. A single visible glyph can consist of multiple code points - a base character, combining marks, variation selectors, and zero-width joiners - making raw string length unreliable. Confusable characters (homoglyphs), invisible control characters (U+200B Zero Width Space, U+FEFF BOM), and RTL overrides create security vulnerabilities in URLs, usernames, and source code. This tool decomposes any input into its constituent code points, reports each character's Unicode block, general category, script assignment, and exact byte representation in both UTF-8 and UTF-16 encoding. It flags invisible and potentially dangerous characters. Normalization comparison (NFC, NFD, NFKC, NFKD) reveals whether two visually identical strings differ at the binary level - a common source of bugs in string comparison, database lookups, and authentication systems.
Formulas
UTF-8 encodes a code point U into a variable number of bytes based on its magnitude:
Where U is the Unicode code point value. UTF-16 uses 2 bytes for code points in the Basic Multilingual Plane (U β€ 0xFFFF) and 4 bytes (a surrogate pair) for supplementary planes. The surrogate pair is computed as:
high = 0xD800 + (Uβ² >> 10)
low = 0xDC00 + (Uβ² & 0x3FF)
Where high β [0xD800, 0xDBFF] and low β [0xDC00, 0xDFFF].
Reference Data
| Unicode Block | Range | Characters | Script | Common Use |
|---|---|---|---|---|
| Basic Latin | U+0000 - U+007F | 128 | Latin / Common | ASCII: English letters, digits, basic punctuation |
| Latin-1 Supplement | U+0080 - U+00FF | 128 | Latin | Western European accented characters (Γ©, Γ±, ΓΌ) |
| General Punctuation | U+2000 - U+206F | 112 | Common | En/em dashes, ellipsis, invisible formatters |
| Cyrillic | U+0400 - U+04FF | 256 | Cyrillic | Russian, Ukrainian, Bulgarian alphabets |
| Arabic | U+0600 - U+06FF | 256 | Arabic | Arabic script, RTL text |
| CJK Unified Ideographs | U+4E00 - U+9FFF | 20,992 | Han | Chinese, Japanese Kanji, Korean Hanja |
| Hangul Syllables | U+AC00 - U+D7AF | 11,184 | Hangul | Korean precomposed syllables |
| Devanagari | U+0900 - U+097F | 128 | Devanagari | Hindi, Sanskrit, Marathi |
| Emoticons | U+1F600 - U+1F64F | 80 | Common | Smiley faces, gesture emoji |
| Mathematical Operators | U+2200 - U+22FF | 256 | Common | β, β, β, β, β, β, β |
| Box Drawing | U+2500 - U+257F | 128 | Common | Terminal UI borders: β, β, β, β |
| Greek and Coptic | U+0370 - U+03FF | 135 | Greek | Ξ±, Ξ², Ξ³, Ξ΄ - math and science notation |
| Currency Symbols | U+20A0 - U+20CF | 48 | Common | βΉ, βΏ, β¬, β½ - monetary symbols |
| Combining Diacritical Marks | U+0300 - U+036F | 112 | Inherited | Accents, tildes, and marks that combine with preceding char |
| Specials | U+FFF0 - U+FFFF | 5 | Common | Replacement char U+FFFD, BOM, noncharacters |
| Private Use Area | U+E000 - U+F8FF | 6,400 | Unknown | Custom glyphs (icons, ligatures in fonts) |
| Miscellaneous Symbols | U+2600 - U+26FF | 256 | Common | β, β, β , β - common symbols |
| Supplementary Private Use A | U+F0000 - U+FFFFF | 65,534 | Unknown | Plane 15 private use |
| Tags | U+E0000 - U+E007F | 97 | Common | Emoji flag tag sequences (subdivision flags) |
| Halfwidth and Fullwidth Forms | U+FF00 - U+FFEF | 225 | Common | οΌ‘, οΌ - CJK compatibility full-width variants |