User Rating 0.0 โ˜…โ˜…โ˜…โ˜…โ˜…
Total Usage 0 times
0 CJK Characters
0 Total Characters
0 CJK Punctuation
0 Fullwidth Forms
0 ASCII / Latin
0 Whitespace
0 Digits
0 Other
CJK Density 0%
CJK Block Breakdown
Character Frequency (Top 30) CJK Only
Is this tool helpful?

Your feedback helps us improve.

โ˜… โ˜… โ˜… โ˜… โ˜…

About

Counting Chinese characters is not as simple as calling string.length. JavaScript's native .length property counts UTF-16 code units, not code points. A single CJK Extension B character (U+20000 - U+2A6DF) occupies two code units via surrogate pairs, producing a count of 2 instead of 1. This tool iterates by code point using Array.from(), then classifies each against 9 Unicode block ranges covering CJK Unified Ideographs, Extensions A through B, Compatibility Ideographs, Radicals, Symbols, Punctuation, Bopomofo, and Fullwidth Forms. Miscount propagates into character-limited platforms: Weibo enforces 2000 Chinese characters per post, and many SMS gateways cap CJK messages at 70 characters per segment. Getting the number wrong means truncated messages or rejected submissions.

The frequency analysis sorts every distinct character by occurrence count. This is useful for vocabulary profiling, identifying high-frequency hanzi for spaced-repetition study decks, or verifying that a text meets HSK level constraints. The tool assumes input is plain text. It does not resolve variant forms (e.g., traditional ็น vs. simplified ็ฎ€ are counted as separate characters). Punctuation classification covers CJK-specific marks (ใ€ ใ€‚ ใ€Œ ใ€) separately from ASCII equivalents.

chinese character counter CJK counter hanzi counter character frequency unicode text analysis chinese text tool

Formulas

Each input character is classified by testing its Unicode code point against defined block ranges. The primary classification function operates as follows:

isCJK(cp) =
{
TRUE if 0x4E00 โ‰ค cp โ‰ค 0x9FFFTRUE if 0x3400 โ‰ค cp โ‰ค 0x4DBFTRUE if 0x20000 โ‰ค cp โ‰ค 0x2A6DFTRUE if 0xF900 โ‰ค cp โ‰ค 0xFAFFFALSE otherwise

Where cp is the Unicode code point obtained via codePointAt(0). Total character count uses code-point-aware iteration:

Ntotal = Array.from(text).length

Character frequency is computed as a map F : C โ†’ N, where C is the set of distinct characters and F(c) is the occurrence count. The percentage of CJK content relative to non-whitespace characters:

CJK% = NCJKNtotal โˆ’ Nwhitespace ร— 100

Reference Data

Unicode BlockRangeCharactersDescription
CJK Unified IdeographsU+4E00 - U+9FFF20,992Core set of common Chinese, Japanese, Korean ideographs
CJK Extension AU+3400 - U+4DBF6,592Rare and historic ideographs
CJK Extension BU+20000 - U+2A6DF42,720Very rare ideographs (surrogate pairs in UTF-16)
CJK Compatibility IdeographsU+F900 - U+FAFF512Duplicates for round-trip compatibility with legacy encodings
CJK Radicals SupplementU+2E80 - U+2EFF128CJK radical forms used in dictionaries
Kangxi RadicalsU+2F00 - U+2FDF224214 Kangxi radicals for indexing
CJK Symbols & PunctuationU+3000 - U+303F64Ideographic comma, period, brackets, marks
Halfwidth & Fullwidth FormsU+FF00 - U+FFEF240Fullwidth Latin, Katakana, Hangul variants
BopomofoU+3100 - U+312F48Phonetic symbols for Mandarin Chinese
Bopomofo ExtendedU+31A0 - U+31BF32Additional Bopomofo for Minnan and Hakka
HSK Level 1 - 174 charactersBeginner Mandarin vocabulary requirement
HSK Level 4 - 1,064 charactersIntermediate Mandarin proficiency
HSK Level 6 - 2,663 charactersAdvanced Mandarin proficiency
Common Usage (PRC Standard) - 3,500 charactersGB/T 2312 Level 1 frequently used characters
Newspaper Literacy - ~3,000 charactersSufficient to read ~99% of modern Chinese text
Weibo Post Limit - 2,000 charactersMaximum characters per Weibo post
WeChat Article Title - 64 charactersMaximum title length for WeChat public articles
SMS (CJK) - 70 charactersSingle SMS segment limit for UCS-2 encoded messages

Frequently Asked Questions

JavaScript strings use UTF-16 encoding. Characters in the Basic Multilingual Plane (U+0000 - U+FFFF) occupy one code unit and report a length of 1. Characters in CJK Extension B (U+20000 - U+2A6DF) and beyond require a surrogate pair (two code units), so .length reports 2 for a single character. This tool uses Array.from(text) which iterates by code point, producing the correct count of 1 per character regardless of the Unicode plane.
No. Simplified (็ฎ€) and Traditional (็น) variants are separate code points within the CJK Unified Ideographs block. Both are counted as CJK ideographs, but each is treated as a distinct character in the frequency analysis. For example, ๅ›ฝ (U+56FD, simplified) and ๅœ‹ (U+570B, traditional) are counted independently.
CJK-specific punctuation marks reside in the CJK Symbols & Punctuation block (U+3000 - U+303F). These include the ideographic comma ใ€ (U+3001), ideographic period ใ€‚ (U+3002), and corner brackets ใ€Œใ€. This tool counts them in a separate "CJK Punctuation" category, distinct from both CJK ideographs and ASCII punctuation. Fullwidth punctuation like ๏ผŒ (U+FF0C) falls under Fullwidth Forms.
Weibo allows a maximum of 2,000 Chinese characters per post. WeChat public article titles are capped at 64 characters. SMS messages encoded in UCS-2 (triggered by any non-GSM character including CJK) are limited to 70 characters per segment. Exceeding these limits causes truncation or multi-segment billing. This tool's accurate CJK count helps verify compliance before posting.
Yes. CJK Unified Ideographs is a shared block covering Chinese Hanzi, Japanese Kanji, and Korean Hanja via Unicode's Han Unification principle. The tool counts all characters in this block as CJK ideographs. It cannot distinguish language intent - the character ๅฑฑ (mountain) has the same code point U+5C71 regardless of whether it appears in Chinese, Japanese, or Korean text.
Every distinct character is included in the frequency map with a minimum count of 1. The display shows the top characters sorted by descending frequency. Single-occurrence characters (hapax legomena) appear at the bottom of the frequency list. In typical Chinese text, approximately 40-50% of distinct characters are hapax, while the top 100 most frequent characters cover roughly 90% of all occurrences.