User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
Supports all Unicode planes including emoji, combining marks, and control characters.
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

Unicode encodes over 149,813 characters across 161 scripts as of version 15.1. A single visible glyph can consist of multiple code points - a base character, combining marks, variation selectors, and zero-width joiners - making raw string length unreliable. Confusable characters (homoglyphs), invisible control characters (U+200B Zero Width Space, U+FEFF BOM), and RTL overrides create security vulnerabilities in URLs, usernames, and source code. This tool decomposes any input into its constituent code points, reports each character's Unicode block, general category, script assignment, and exact byte representation in both UTF-8 and UTF-16 encoding. It flags invisible and potentially dangerous characters. Normalization comparison (NFC, NFD, NFKC, NFKD) reveals whether two visually identical strings differ at the binary level - a common source of bugs in string comparison, database lookups, and authentication systems.

unicode text analysis code point utf-8 character encoding unicode blocks normalization

Formulas

UTF-8 encodes a code point U into a variable number of bytes based on its magnitude:

{
1 byte: 0xxxxxxx if U ≀ 0x7F (127)2 bytes: 110xxxxx 10xxxxxx if U ≀ 0x7FF (2,047)3 bytes: 1110xxxx 10xxxxxx 10xxxxxx if U ≀ 0xFFFF (65,535)4 bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx if U ≀ 0x10FFFF (1,114,111)

Where U is the Unicode code point value. UTF-16 uses 2 bytes for code points in the Basic Multilingual Plane (U ≀ 0xFFFF) and 4 bytes (a surrogate pair) for supplementary planes. The surrogate pair is computed as:

Uβ€² = U βˆ’ 0x10000
high = 0xD800 + (Uβ€² >> 10)
low = 0xDC00 + (Uβ€² & 0x3FF)

Where high ∈ [0xD800, 0xDBFF] and low ∈ [0xDC00, 0xDFFF].

Reference Data

Unicode BlockRangeCharactersScriptCommon Use
Basic LatinU+0000 - U+007F128Latin / CommonASCII: English letters, digits, basic punctuation
Latin-1 SupplementU+0080 - U+00FF128LatinWestern European accented characters (Γ©, Γ±, ΓΌ)
General PunctuationU+2000 - U+206F112CommonEn/em dashes, ellipsis, invisible formatters
CyrillicU+0400 - U+04FF256CyrillicRussian, Ukrainian, Bulgarian alphabets
ArabicU+0600 - U+06FF256ArabicArabic script, RTL text
CJK Unified IdeographsU+4E00 - U+9FFF20,992HanChinese, Japanese Kanji, Korean Hanja
Hangul SyllablesU+AC00 - U+D7AF11,184HangulKorean precomposed syllables
DevanagariU+0900 - U+097F128DevanagariHindi, Sanskrit, Marathi
EmoticonsU+1F600 - U+1F64F80CommonSmiley faces, gesture emoji
Mathematical OperatorsU+2200 - U+22FF256Commonβˆ€, βˆƒ, ∈, βˆ‘, ∏, √, ∞
Box DrawingU+2500 - U+257F128CommonTerminal UI borders: ─, β”‚, β”Œ, ┐
Greek and CopticU+0370 - U+03FF135GreekΞ±, Ξ², Ξ³, Ξ΄ - math and science notation
Currency SymbolsU+20A0 - U+20CF48Commonβ‚Ή, β‚Ώ, €, β‚½ - monetary symbols
Combining Diacritical MarksU+0300 - U+036F112InheritedAccents, tildes, and marks that combine with preceding char
SpecialsU+FFF0 - U+FFFF5CommonReplacement char U+FFFD, BOM, noncharacters
Private Use AreaU+E000 - U+F8FF6,400UnknownCustom glyphs (icons, ligatures in fonts)
Miscellaneous SymbolsU+2600 - U+26FF256Commonβ˜€, ☎, β™ , ⚠ - common symbols
Supplementary Private Use AU+F0000 - U+FFFFF65,534UnknownPlane 15 private use
TagsU+E0000 - U+E007F97CommonEmoji flag tag sequences (subdivision flags)
Halfwidth and Fullwidth FormsU+FF00 - U+FFEF225Common1, 0 - CJK compatibility full-width variants

Frequently Asked Questions

JavaScript strings use UTF-16 encoding internally. Characters outside the Basic Multilingual Plane (code points above U+FFFF, such as emoji πŸŽ‰ at U+1F389) require a surrogate pair - two 16-bit code units. The .length property counts code units, not code points. Use [...str].length or the for...of iterator to count actual code points. Even this may not match visible glyphs, since combining marks (e.g., Γ© = e + U+0301) and ZWJ sequences (e.g., πŸ‘¨β€πŸ‘©β€πŸ‘§ = 3 emoji + 2 ZWJ) form single grapheme clusters from multiple code points.
Invisible characters such as Zero Width Space (U+200B), Zero Width Non-Joiner (U+200C), Right-to-Left Override (U+202E), and the BOM (U+FEFF) can be injected into usernames, URLs, filenames, and source code. The RTL override can make a filename appear as "document.doc" while actually being "document.exe" when rendered. Homoglyph attacks substitute Latin "a" (U+0061) with Cyrillic "Π°" (U+0430) to create spoofed domain names. This tool flags all invisible and potentially dangerous characters with explicit warnings.
Unicode allows the same visual character to be represented differently. The letter "é" can be a single code point U+00E9 (NFC - precomposed) or two code points U+0065 + U+0301 (NFD - decomposed). These render identically but fail strict byte comparison (===). Database lookups, password hashing, filename matching, and API comparisons can silently break. NFC is recommended for web content (W3C standard). NFKC additionally normalizes compatibility equivalents: the ligature "fi" (U+FB01) becomes "fi". Always normalize before comparing or hashing.
The tool iterates using code-point-aware logic (for...of) and displays each individual code point. An emoji like πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ consists of 7 code points: 4 person/family emoji connected by 3 Zero Width Joiners (U+200D). Each code point is shown separately with its role labeled. Skin tone modifiers (U+1F3FB - U+1F3FF) and variation selectors (U+FE0E text, U+FE0F emoji) are also individually identified and flagged.
Unicode divides the entire code space (U+0000 to U+10FFFF, totaling 1,114,112 code points) into 17 planes of 65,536 code points each. Plane 0 (U+0000 - U+FFFF) is the Basic Multilingual Plane (BMP) containing most common scripts. Plane 1 is the Supplementary Multilingual Plane (SMP) containing emoji, historic scripts, and musical notation. Within each plane, code points are grouped into named blocks (e.g., "Greek and Coptic" at U+0370 - U+03FF). Blocks are contiguous ranges; a character's block indicates its neighborhood but not necessarily its script or category.
Missing glyphs occur when the user's system fonts do not contain a rendering for that code point. The tool correctly identifies the code point and provides its Unicode name and properties regardless of rendering. Private Use Area characters (U+E000 - U+F8FF) are intentionally undefined by Unicode and rely on custom fonts. Unassigned code points in reserved blocks will also show as missing. The tool marks these with their official status.