About

Counting Chinese characters is not as simple as calling string.length. JavaScript's native .length property counts UTF-16 code units, not code points. A single CJK Extension B character (U+20000 - U+2A6DF) occupies two code units via surrogate pairs, producing a count of 2 instead of 1. This tool iterates by code point using Array.from(), then classifies each against 9 Unicode block ranges covering CJK Unified Ideographs, Extensions A through B, Compatibility Ideographs, Radicals, Symbols, Punctuation, Bopomofo, and Fullwidth Forms. Miscount propagates into character-limited platforms: Weibo enforces 2000 Chinese characters per post, and many SMS gateways cap CJK messages at 70 characters per segment. Getting the number wrong means truncated messages or rejected submissions.

The frequency analysis sorts every distinct character by occurrence count. This is useful for vocabulary profiling, identifying high-frequency hanzi for spaced-repetition study decks, or verifying that a text meets HSK level constraints. The tool assumes input is plain text. It does not resolve variant forms (e.g., traditional 繁 vs. simplified 简 are counted as separate characters). Punctuation classification covers CJK-specific marks (、。「」) separately from ASCII equivalents.

Formulas

Each input character is classified by testing its Unicode code point against defined block ranges. The primary classification function operates as follows:

isCJK(cp) =

{

TRUE if 0x4E00 ≤ cp ≤ 0x9FFFTRUE if 0x3400 ≤ cp ≤ 0x4DBFTRUE if 0x20000 ≤ cp ≤ 0x2A6DFTRUE if 0xF900 ≤ cp ≤ 0xFAFFFALSE otherwise

Where cp is the Unicode code point obtained via codePointAt(0). Total character count uses code-point-aware iteration:

N_total = Array.from(text).length

Character frequency is computed as a map F : C → N, where C is the set of distinct characters and F(c) is the occurrence count. The percentage of CJK content relative to non-whitespace characters:

CJK% = N_CJKN_total − N_whitespace × 100

Reference Data

Unicode Block	Range	Characters	Description
CJK Unified Ideographs	U+4E00 - U+9FFF	20,992	Core set of common Chinese, Japanese, Korean ideographs
CJK Extension A	U+3400 - U+4DBF	6,592	Rare and historic ideographs
CJK Extension B	U+20000 - U+2A6DF	42,720	Very rare ideographs (surrogate pairs in UTF-16)
CJK Compatibility Ideographs	U+F900 - U+FAFF	512	Duplicates for round-trip compatibility with legacy encodings
CJK Radicals Supplement	U+2E80 - U+2EFF	128	CJK radical forms used in dictionaries
Kangxi Radicals	U+2F00 - U+2FDF	224	214 Kangxi radicals for indexing
CJK Symbols & Punctuation	U+3000 - U+303F	64	Ideographic comma, period, brackets, marks
Halfwidth & Fullwidth Forms	U+FF00 - U+FFEF	240	Fullwidth Latin, Katakana, Hangul variants
Bopomofo	U+3100 - U+312F	48	Phonetic symbols for Mandarin Chinese
Bopomofo Extended	U+31A0 - U+31BF	32	Additional Bopomofo for Minnan and Hakka
HSK Level 1	-	174 characters	Beginner Mandarin vocabulary requirement
HSK Level 4	-	1,064 characters	Intermediate Mandarin proficiency
HSK Level 6	-	2,663 characters	Advanced Mandarin proficiency
Common Usage (PRC Standard)	-	3,500 characters	GB/T 2312 Level 1 frequently used characters
Newspaper Literacy	-	~3,000 characters	Sufficient to read ~99% of modern Chinese text
Weibo Post Limit	-	2,000 characters	Maximum characters per Weibo post
WeChat Article Title	-	64 characters	Maximum title length for WeChat public articles
SMS (CJK)	-	70 characters	Single SMS segment limit for UCS-2 encoded messages

Frequently Asked Questions

JavaScript strings use UTF-16 encoding. Characters in the Basic Multilingual Plane (U+0000 - U+FFFF) occupy one code unit and report a length of 1. Characters in CJK Extension B (U+20000 - U+2A6DF) and beyond require a surrogate pair (two code units), so .length reports 2 for a single character. This tool uses Array.from(text) which iterates by code point, producing the correct count of 1 per character regardless of the Unicode plane.

No. Simplified (简) and Traditional (繁) variants are separate code points within the CJK Unified Ideographs block. Both are counted as CJK ideographs, but each is treated as a distinct character in the frequency analysis. For example, 国 (U+56FD, simplified) and 國 (U+570B, traditional) are counted independently.

CJK-specific punctuation marks reside in the CJK Symbols & Punctuation block (U+3000 - U+303F). These include the ideographic comma 、 (U+3001), ideographic period 。 (U+3002), and corner brackets 「」. This tool counts them in a separate "CJK Punctuation" category, distinct from both CJK ideographs and ASCII punctuation. Fullwidth punctuation like ， (U+FF0C) falls under Fullwidth Forms.

Weibo allows a maximum of 2,000 Chinese characters per post. WeChat public article titles are capped at 64 characters. SMS messages encoded in UCS-2 (triggered by any non-GSM character including CJK) are limited to 70 characters per segment. Exceeding these limits causes truncation or multi-segment billing. This tool's accurate CJK count helps verify compliance before posting.

Yes. CJK Unified Ideographs is a shared block covering Chinese Hanzi, Japanese Kanji, and Korean Hanja via Unicode's Han Unification principle. The tool counts all characters in this block as CJK ideographs. It cannot distinguish language intent - the character 山 (mountain) has the same code point U+5C71 regardless of whether it appears in Chinese, Japanese, or Korean text.

Every distinct character is included in the frequency map with a minimum count of 1. The display shows the top characters sorted by descending frequency. Single-occurrence characters (hapax legomena) appear at the bottom of the frequency list. In typical Chinese text, approximately 40-50% of distinct characters are hapax, while the top 100 most frequent characters cover roughly 90% of all occurrences.