About

Letter frequency analysis is foundational to cryptanalysis, computational linguistics, and information theory. In monoalphabetic substitution ciphers, frequency analysis is the primary attack vector: if you cannot identify that e occurs at roughly 12.7% in English text, decryption becomes guesswork. This tool computes absolute counts and relative frequencies for every character in your input, normalized against total alphabetic characters. Results are sortable by rank or alphabetical order and visualized as a proportional bar chart.

Accuracy depends on sample size. Texts shorter than 200 characters produce unstable distributions that deviate significantly from language norms. The tool assumes no encoding transformation; it processes raw UTF-16 code points as delivered by the browser. For case-insensitive analysis, enable the fold option to merge uppercase and lowercase into a single bucket. Pro tip: compare your output against known language profiles (English, French, German) in the reference table below to identify the source language of an unknown text.

Formulas

The relative frequency of each letter is computed by dividing its count by the total number of alphabetic characters in the sample.

f_i = c_iN × 100

Where f_i is the percentage frequency of letter i, c_i is the raw count of occurrences of letter i, and N is the total count of all alphabetic characters in the input: N = 26∑i=1 c_i.

The Index of Coincidence (IC) measures how likely two randomly chosen letters from the text are identical. For monolingual English text, the expected IC ≈ 0.0667. Random uniform text yields IC ≈ 0.0385.

IC = 26∑i=1 c_i(c_i − 1)N(N − 1)

Where c_i is the count for the i-th letter and N is total letter count. An IC significantly below 0.06 suggests polyalphabetic encryption or a non-natural text source.

Reference Data

Letter	English %	French %	German %	Spanish %	Italian %	Portuguese %
a	8.167	7.636	6.516	12.53	11.74	14.63
b	1.492	0.901	1.886	1.42	0.92	1.04
c	2.782	3.260	2.732	4.68	4.50	3.88
d	4.253	3.669	5.076	5.86	3.73	4.99
e	12.702	14.715	16.396	13.68	11.79	12.57
f	2.228	1.066	1.656	0.69	0.95	1.02
g	2.015	0.866	3.009	1.01	1.64	1.30
h	6.094	0.737	4.577	0.70	1.54	1.28
i	6.966	7.529	6.550	6.25	11.28	6.18
j	0.153	0.545	0.268	0.44	0.00	0.40
k	0.772	0.049	1.417	0.01	0.00	0.02
l	4.025	5.456	3.437	4.97	6.51	2.78
m	2.406	2.968	2.534	3.15	2.51	4.74
n	6.749	7.095	9.776	6.71	6.88	5.05
o	7.507	5.378	2.594	8.68	9.83	10.73
p	1.929	3.021	0.670	2.51	3.05	2.52
q	0.095	1.362	0.018	0.88	0.51	1.20
r	5.987	6.553	7.003	6.87	6.37	6.53
s	6.327	7.948	7.270	7.98	4.98	7.81
t	9.056	7.244	6.154	4.63	5.62	4.34
u	2.758	6.311	4.166	3.93	3.01	4.63
v	0.978	1.628	0.846	0.90	2.10	1.67
w	2.360	0.114	1.921	0.02	0.00	0.01
x	0.150	0.387	0.034	0.22	0.00	0.21
y	1.974	0.308	0.039	0.90	0.00	0.01
z	0.074	0.136	1.134	0.52	0.49	0.47

Frequently Asked Questions

A minimum of 200 alphabetic characters produces a rough profile. For statistically stable results that converge within ±0.5% of true language frequencies, you need approximately 2,000-5,000 characters. Shorter texts exhibit high variance - a 50-character sample could show "z" at 4% purely by chance.

The Index of Coincidence (IC) measures the probability that two randomly selected letters from the text are identical. English text has an expected IC ≈ 0.0667. A value near 0.0385 indicates uniform random distribution (or strong polyalphabetic cipher). Values between these extremes suggest partial encryption or mixed-language content.

By default, this tool folds uppercase to lowercase so that "A" and "a" are counted as the same letter. If you disable case folding, each case variant is tracked independently, which is useful for analyzing text formatting patterns or programming source code where casing carries semantic meaning.

Non-alphabetic characters are excluded from the letter frequency percentage calculation (the denominator N counts only a - z). However, total character count, digit count, whitespace count, and punctuation count are reported separately in the statistics summary. This ensures the percentage distribution is directly comparable to standard language frequency tables.

Yes. Compare the output frequency profile against the reference table provided. English is characterized by a dominant "e" at ~12.7% and high "t" ~9.1%. German shows "e" at ~16.4%. French exhibits high "s" ~7.9% and "e" ~14.7%. A chi-squared test between your observed frequencies and each language profile yields a quantitative match score. The tool computes this automatically.

Published tables (like those from Robert Lewand or Peter Norvig's Google corpus analysis) are derived from millions of words. Your input is a finite sample subject to topic bias. Technical writing overrepresents "x" and "z". Dialogue-heavy fiction skews toward personal pronouns, inflating "i" and "y". The deviation itself is analytically informative - it reveals the genre and register of your text.