Chinese Character Counter
Count Chinese characters (CJK ideographs), punctuation, and analyze character frequency in any text. Supports all Unicode CJK blocks.
About
Counting Chinese characters is not as simple as calling string.length. JavaScript's native .length property counts UTF-16 code units, not code points. A single CJK Extension B character (U+20000 - U+2A6DF) occupies two code units via surrogate pairs, producing a count of 2 instead of 1. This tool iterates by code point using Array.from(), then classifies each against 9 Unicode block ranges covering CJK Unified Ideographs, Extensions A through B, Compatibility Ideographs, Radicals, Symbols, Punctuation, Bopomofo, and Fullwidth Forms. Miscount propagates into character-limited platforms: Weibo enforces 2000 Chinese characters per post, and many SMS gateways cap CJK messages at 70 characters per segment. Getting the number wrong means truncated messages or rejected submissions.
The frequency analysis sorts every distinct character by occurrence count. This is useful for vocabulary profiling, identifying high-frequency hanzi for spaced-repetition study decks, or verifying that a text meets HSK level constraints. The tool assumes input is plain text. It does not resolve variant forms (e.g., traditional ็น vs. simplified ็ฎ are counted as separate characters). Punctuation classification covers CJK-specific marks (ใ ใ ใ ใ) separately from ASCII equivalents.
Formulas
Each input character is classified by testing its Unicode code point against defined block ranges. The primary classification function operates as follows:
Where cp is the Unicode code point obtained via codePointAt(0). Total character count uses code-point-aware iteration:
Character frequency is computed as a map F : C โ N, where C is the set of distinct characters and F(c) is the occurrence count. The percentage of CJK content relative to non-whitespace characters:
Reference Data
| Unicode Block | Range | Characters | Description |
|---|---|---|---|
| CJK Unified Ideographs | U+4E00 - U+9FFF | 20,992 | Core set of common Chinese, Japanese, Korean ideographs |
| CJK Extension A | U+3400 - U+4DBF | 6,592 | Rare and historic ideographs |
| CJK Extension B | U+20000 - U+2A6DF | 42,720 | Very rare ideographs (surrogate pairs in UTF-16) |
| CJK Compatibility Ideographs | U+F900 - U+FAFF | 512 | Duplicates for round-trip compatibility with legacy encodings |
| CJK Radicals Supplement | U+2E80 - U+2EFF | 128 | CJK radical forms used in dictionaries |
| Kangxi Radicals | U+2F00 - U+2FDF | 224 | 214 Kangxi radicals for indexing |
| CJK Symbols & Punctuation | U+3000 - U+303F | 64 | Ideographic comma, period, brackets, marks |
| Halfwidth & Fullwidth Forms | U+FF00 - U+FFEF | 240 | Fullwidth Latin, Katakana, Hangul variants |
| Bopomofo | U+3100 - U+312F | 48 | Phonetic symbols for Mandarin Chinese |
| Bopomofo Extended | U+31A0 - U+31BF | 32 | Additional Bopomofo for Minnan and Hakka |
| HSK Level 1 | - | 174 characters | Beginner Mandarin vocabulary requirement |
| HSK Level 4 | - | 1,064 characters | Intermediate Mandarin proficiency |
| HSK Level 6 | - | 2,663 characters | Advanced Mandarin proficiency |
| Common Usage (PRC Standard) | - | 3,500 characters | GB/T 2312 Level 1 frequently used characters |
| Newspaper Literacy | - | ~3,000 characters | Sufficient to read ~99% of modern Chinese text |
| Weibo Post Limit | - | 2,000 characters | Maximum characters per Weibo post |
| WeChat Article Title | - | 64 characters | Maximum title length for WeChat public articles |
| SMS (CJK) | - | 70 characters | Single SMS segment limit for UCS-2 encoded messages |