User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
0 / 5,000 characters
Enter text and click Check Unicode to analyze
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

Every character in digital text belongs to a specific Unicode version. A document mixing characters from Unicode 1.0 (1991) and Unicode 16.0 (2024) may render correctly on modern systems but fail on older terminals, embedded devices, or legacy databases. This tool extracts each codepoint U+XXXX from your input and maps it to the exact Unicode standard version that introduced it. It identifies the assigned block (e.g., "Latin Extended-B", "CJK Unified Ideographs") and general category (L for Letter, N for Number, S for Symbol). Use it to audit compatibility before deploying multilingual content or embedding special symbols in systems with constrained font support.

Limitation: this tool covers assigned codepoint ranges per version. Private Use Area characters (U+E000 - U+F8FF) are reported as version 1.1 per original allocation but carry no standard glyph. Unassigned codepoints return "Unassigned". Surrogate pair codepoints (U+D800 - U+DFFF) are encoding artifacts and not valid characters.

unicode unicode version codepoint character analysis text encoding unicode block utf-8

Formulas

Each character in the input string is decomposed into its Unicode codepoint using JavaScript's codePointAt method, which correctly handles surrogate pairs for astral plane characters (codepoints above U+FFFF).

cp = codePointAt(i)

The codepoint cp is then looked up against a sorted array of Unicode version assignment ranges. Each range is a tuple:

[start, end, version]

A character belongs to Unicode version v if:

start ≀ cp ≀ end ∧ version = v

Where cp is the decimal codepoint value, start and end define the inclusive range boundary, and version is a string like "6.0". The hex representation displayed is computed via:

hex = cp.toString(16).toUpperCase().padStart(4, "0")

The general category is determined by testing the codepoint against category ranges (e.g., L for letters matching alphabetic ranges, N for numbers matching digit/numeric ranges). Block membership is resolved similarly against the ~330 named Unicode blocks.

Reference Data

Unicode VersionRelease YearTotal Characters AddedNotable Additions
1.01991~7,129Basic Latin, Greek, Cyrillic, CJK core
1.11993~28,327CJK Unified Ideographs (20,902), Tibetan
2.01996~6,516Surrogate mechanism, Hangul Syllables rewrite
3.01999~10,307Cherokee, Ethiopic, Khmer, Mongolian, Myanmar, Sinhala
3.12001~44,946CJK Extension B (42,711), Deseret, Gothic, Old Italic, Musical Symbols
3.22002~1,016Philippine scripts (Buhid, Hanunoo, Tagalog, Tagbanwa)
4.02003~1,226Cypriot, Limbu, Linear B, Osmanya, Shavian, Tai Le, Ugaritic
4.12005~1,273Buginese, Coptic, Glagolitic, New Tai Lue, Old Persian
5.02006~1,369N'Ko, Phags-pa, Phoenician, currency symbols
5.12008~1,624Carian, Lycian, Lydian, Vai, Sundanese
5.22009~6,648CJK Extension C, Egyptian Hieroglyphs, Tai Tham
6.02010~2,088Emoji first batch, Indian Rupee sign β‚Ή, Mandaic, Batak
6.12012~732Chakma, Miao, Sharada, Sora Sompeng, Takri
6.220121Turkish Lira sign β‚Ί
6.320135Bidirectional formatting characters
7.02014~2,834Ruble sign β‚½, Bassa Vah, Duployan, Grantha, Khojki, Pau Cin Hau
8.02015~7,716CJK Extension E, Cherokee lowercase, Emoji skin tones (Fitzpatrick)
9.02016~7,500Adlam, Bhaiksuki, Tangut (6,881 chars), 72 new emoji
10.02017~8,518CJK Extension F, Bitcoin sign β‚Ώ, Zanabazar Square, Soyombo, 56 emoji
11.02018~684Dogra, Georgian Mtavruli, Hanifi Rohingya, 66 emoji
12.02019~554Elymaic, Nandinagari, Wancho, 61 emoji
12.120191Reiwa era square character (Japanese)
13.02020~5,930CJK Extension G, Chorasmian, Dives Akuru, Yezidi, 55 emoji
14.02021~838Toto, Cypro-Minoan, Vithkuqi, Old Uyghur, Tangsa, 37 emoji
15.02022~4,489CJK Extension H, Kawi, Nag Mundari, 31 emoji
15.12023627CJK Extension I (622 ideographs), 5 emoji
16.02024~5,185CJK Extension J, Egyptian Hieroglyphs Extended-A, Garay, Gurung Khema, 7 emoji

Frequently Asked Questions

Many modern emoji are composed of multiple codepoints joined by Zero Width Joiner (U+200D). For example, the family emoji is four person codepoints joined by ZWJ. Each constituent codepoint may originate from a different Unicode version. This tool decomposes the full sequence and reports each codepoint individually. Variation selectors (U+FE0E and U+FE0F) also appear as separate entries.
CJK Unified Ideographs span Extensions A through J, each assigned to a specific Unicode version. The base CJK block (U+4E00 - U+9FFF) was partially allocated in version 1.1 and extended in later versions. This tool maps the major extension blocks accurately: Extension A to 3.0, Extension B to 3.1, Extension C to 5.2, and so on through Extension J in 16.0. Individual codepoint additions within existing blocks across minor versions are approximated to the block's primary version.
The Basic Multilingual Plane Private Use Area (U+E000 - U+F8FF) was allocated in Unicode 1.1. The Supplementary Private Use Areas (U+F0000 - U+10FFFF) were allocated in 2.0. These codepoints carry no standard glyph or name. The tool reports their version as the version that allocated the range. The actual glyphs depend entirely on installed fonts.
This tool reports which Unicode version introduced each character. To determine system compatibility, compare the reported version against your target platform's Unicode support level. For example, Android 10 supports Unicode 12.0, so any character from version 13.0 or later may not render. Windows 10 (May 2020) supports up to 13.0. Font coverage is a separate concern: a system may support a Unicode version but lack glyphs in installed fonts for specific blocks.
Control characters (U+0000 - U+001F and U+007F - U+009F) were inherited from ASCII and ISO 8859-1, included in Unicode 1.0/1.1. They have no visual representation. The tool displays them with their Unicode name (e.g., NULL, LINE FEED) and marks them as category Cc (Control). They are legitimate codepoints and will appear in the analysis.
Yes. Results are listed in strict input order, character by character. For astral plane characters encoded as surrogate pairs in JavaScript's internal UTF-16, the tool correctly yields a single codepoint per grapheme cluster base. However, combining character sequences (e.g., base letter followed by combining diacritical marks) appear as separate rows. The position column indicates the zero-based character index in the original string.