User Rating 0.0 ★★★★★

Total Usage 0 times

Category Text Analysis

Enter or paste your text Supports Unicode, emoji, combining marks, and multi-line text.

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

JavaScript's String.length returns UTF-16 code units, not visible characters. A single emoji like 👨‍👩‍👧‍👦 reports 11 from .length but renders as 1 grapheme cluster. Database column limits, SMS segments (160 GSM-7 characters or 70 UCS-2), and API payload caps operate on different length definitions. Confusing code units with bytes or graphemes causes silent data truncation, broken emoji rendering, and failed validation. This tool computes all five canonical string metrics simultaneously: UTF-16 code units, Unicode code points, grapheme clusters, UTF-8 byte length, and word/line counts.

The grapheme segmentation uses the Intl.Segmenter API (UAX #29) when available, which correctly handles zero-width joiners, combining marks, and regional indicator sequences. UTF-8 byte count is derived via Blob constructor, matching what your server receives over HTTP. Note: results assume well-formed input. Lone surrogates (unpaired UTF-16 halves) produce implementation-defined byte counts.

Formulas

Five distinct length metrics are computed from an input string S:

L_utf16 = S.length

Returns the number of UTF-16 code units. Surrogate pairs (code points above U+FFFF) count as 2.

L_cp = [...S].length

The spread operator iterates by Unicode code point, correctly counting astral plane characters (emoji, rare CJK) as 1 each.

L_grapheme = Intl.Segmenter(S, {granularity: "grapheme"})

UAX #29 segmentation. A grapheme cluster is the smallest user-perceived character unit. Emoji sequences joined with Zero-Width Joiners (U+200D) and characters with combining marks each count as 1.

L_bytes = new Blob([S]).size

UTF-8 encoded byte length. ASCII characters use 1 byte. Most Latin/Cyrillic use 2. CJK and common symbols use 3. Emoji and astral characters use 4.

L_words = S.split(/\s+/).filter(Boolean).length

Where L_utf16 = UTF-16 code unit count, L_cp = Unicode code point count, L_grapheme = grapheme cluster count (visual characters), L_bytes = UTF-8 byte size, L_words = whitespace-delimited token count.

Reference Data

Character / Sequence	Visual	UTF-16 Units	Code Points	Graphemes	UTF-8 Bytes
Latin letter A	A	1	1	1	1
Euro sign	€	1	1	1	3
CJK ideograph (water)	水	1	1	1	3
Emoji (grinning face)	😀	2	1	1	4
Emoji (flag US)	🇺🇸	4	2	1	8
Family emoji (ZWJ)	👨‍👩‍👧‍👦	11	7	1	25
e + combining acute	é (2 CP)	2	2	1	3
Precomposed é	é (1 CP)	1	1	1	2
Hindi syllable क्षि	क्षि	4	4	1	12
Korean syllable 한	한	1	1	1	3
Musical symbol 𝄞	𝄞	2	1	1	4
Newline (LF)	\n	1	1	1	1
Tab	\t	1	1	1	1
Empty string	(none)	0	0	0	0
Space	(space)	1	1	1	1
Skin-toned emoji 👋🏽	👋🏽	4	2	1	8
Keycap sequence 1️⃣	1️⃣	3	3	1	7
Zalgo text (10 combiners)	ẕ̴̢̛̫̤̈́̑̋̃̈	11	11	1	25

Frequently Asked Questions

JavaScript strings use UTF-16 encoding internally. Characters with code points above U+FFFF (the Basic Multilingual Plane) require a surrogate pair - two 16-bit code units. The .length property counts code units, not code points. For example, 😀 is U+1F600, which encodes as the pair U+D83D U+DE00, giving a .length of 2. Use [...str].length or the code point count from this tool to get the correct count of 1.

A code point is a single entry in the Unicode table (e.g., U+0041 for 'A'). A grapheme cluster is what a human perceives as one character. The family emoji 👨‍👩‍👧‍👦 consists of 7 code points (4 person emoji + 3 zero-width joiners) but renders as 1 grapheme cluster. Similarly, "é" can be composed of 2 code points (e + combining acute accent) but is 1 grapheme. The Intl.Segmenter API implements UAX #29 to segment correctly.

It depends on the column type. MySQL's VARCHAR(n) in utf8mb4 counts characters (code points). PostgreSQL's VARCHAR(n) also counts characters. However, MySQL's VARBINARY(n) counts bytes. For HTTP Content-Length headers and file size calculations, use the UTF-8 byte count. For SMS messages, use UTF-16 code units: a GSM-7 message allows 160 characters, but if any character requires UCS-2 encoding, the limit drops to 70 UTF-16 code units.

Intl.Segmenter follows Unicode Standard Annex #29, but the supported Unicode version varies by browser engine. Newer emoji sequences (e.g., added in Unicode 15.1) may not be recognized as single grapheme clusters by older browsers, causing them to be split into multiple segments. This tool falls back to a regex-based approximation if Intl.Segmenter is unavailable, which handles most ZWJ sequences and regional indicators but may miss edge cases in newest emoji.

Unicode defines multiple ways to encode the same visual character. "é" in NFC (composed) is 1 code point U+00E9. In NFD (decomposed) it is 2 code points: U+0065 (e) + U+0301 (combining acute). This means the same visual text can have different .length values depending on normalization form. This tool measures the string exactly as provided. Use str.normalize('NFC') before measuring if you need canonical comparison. The grapheme cluster count remains the same regardless of normalization.

Yes. Lines are counted by splitting on the newline character \n. A CRLF sequence (\r\n) contains one \n, so it counts as one line break. However, note that \r\n contributes 2 to UTF-16 length and 2 bytes in UTF-8, while a bare \n contributes 1 to each. The word count treats both \r and \n as whitespace, so word counts are unaffected by line ending style.