About

CSV remains the dominant interchange format for tabular data, yet its apparent simplicity hides parsing traps: unescaped delimiters, inconsistent quoting, mixed encodings, and ambiguous column types. Feeding malformed CSV into a downstream pipeline without validation risks silent data corruption - truncated rows, shifted columns, or numeric fields miscast as strings. This tool implements a strict RFC 4180 state-machine parser that handles quoted fields, embedded newlines, and escaped double-quotes correctly. It auto-detects column types and computes descriptive statistics: x̅ (mean), σ (standard deviation), median, quartiles Q₁ and Q₃, skewness, and kurtosis for every numeric column. Categorical columns receive frequency counts and unique value tallies.

The analyzer supports files up to 50 MB with virtual scrolling for datasets exceeding 500 rows. Parsing of large files runs in a Web Worker to keep the UI responsive. Note: type detection uses heuristics - a column with > 80% numeric values is classified numeric. Edge cases like mixed-type columns or locale-specific decimal separators (comma vs. period) may require manual review. Pro tip: if your CSV uses semicolons (common in European Excel exports), the delimiter auto-detection handles it, but you can override manually.

Formulas

Descriptive statistics computed for each numeric column:

Arithmetic mean:

x = 1n n∑i=1 x_i

Population standard deviation:

σ = √n∑i=1 (x_i − x)²n

Skewness (Fisher's):

γ₁ = 1n n∑i=1 (x_i − x)³σ³

Excess kurtosis:

κ_excess = 1n n∑i=1 (x_i − x)⁴σ⁴ − 3

Where x_i represents each observation, n is the count of non-empty values, x is the arithmetic mean, and σ is the population standard deviation. The sample standard deviation s uses n − 1 (Bessel's correction) in the denominator. Median is computed via sorted-array indexing: for odd n, the middle element; for even n, the average of the two central elements. Quartiles use inclusive interpolation (Method 1, same as Excel QUARTILE.INC).

Reference Data

Statistic	Symbol	Description	Applicable To
Count	n	Total non-empty values in column	All types
Unique	n_u	Distinct value count	All types
Null/Empty	n_∅	Missing or empty cell count	All types
Mean	x	Arithmetic average	Numeric
Median	x̃	Middle value when sorted	Numeric
Mode	Mo	Most frequent value	All types
Standard Deviation	σ	Spread around the mean (population)	Numeric
Sample Std Dev	s	Spread using Bessel's correction (n − 1)	Numeric
Minimum	min	Smallest value	Numeric
Maximum	max	Largest value	Numeric
Sum	Σ	Total of all values	Numeric
Range	R	max − min	Numeric
Q1 (25th percentile)	Q₁	Lower quartile boundary	Numeric
Q3 (75th percentile)	Q₃	Upper quartile boundary	Numeric
IQR	Q₃ − Q₁	Interquartile range, robust spread measure	Numeric
Skewness	γ₁	Asymmetry of distribution. 0 = symmetric	Numeric
Kurtosis	κ	Tail heaviness. 3 = normal (excess = 0)	Numeric
Coefficient of Variation	CV	σ ÷ x expressed as %	Numeric
Top Frequency	f_max	Count of the most common value	Categorical
Avg. String Length	L	Mean character count of text values	String

Frequently Asked Questions

The parser samples the first 10 lines of the file and counts occurrences of four candidate delimiters: comma, semicolon, tab, and pipe. The delimiter with the most consistent count across all sampled lines (lowest variance) wins. If your file uses an unusual delimiter, you can override the selection manually before parsing.

The parser implements a full RFC 4180 state machine. Fields wrapped in double quotes can contain the delimiter character, newline characters (CR, LF, or CRLF), and literal double quotes (escaped as two consecutive double quotes ""). The state machine tracks whether it is inside or outside a quoted context and handles these cases correctly.

Type detection uses heuristic sampling. If more than 80% of non-empty values in a column parse as valid numbers (integers or floats, including negative and scientific notation), the column is classified as numeric. Date detection checks ISO 8601 patterns and common formats (MM/DD/YYYY, DD.MM.YYYY). Mixed-type columns default to string. Edge case: a ZIP code column like "02134" will be detected as numeric and lose the leading zero in statistics. Review the detected types in the column header badges.

Multiple formulas exist for skewness and kurtosis. This tool computes Fisher's skewness and excess kurtosis using population formulas (dividing by n). Excel's SKEW function and R's default use sample-adjusted formulas with correction factors involving n − 1 and n − 2. For large n, the difference is negligible. For small samples (< 30 rows), expect discrepancies.

The tool accepts files up to 50 MB. Files exceeding 1 MB are parsed inside a Web Worker to prevent UI freezing. The data table uses virtual scrolling: only the visible rows (plus a small buffer) are rendered in the DOM. This means even a 500,000-row dataset scrolls smoothly. Statistics are computed in a single pass where possible (Welford's online algorithm for variance) to minimize memory overhead.

Currently, sorting is single-column. Click a column header to sort ascending; click again for descending; a third click resets to original order. The sort algorithm is a stable merge sort, so you can achieve a pseudo multi-column sort by sorting the secondary column first, then the primary column - stability preserves the secondary order within equal primary values.

Search is case-insensitive and matches substrings across all cells. Special regex characters are escaped, so searching for "price (USD)" works literally. The search is debounced by 250 ms to avoid excessive re-renders. For datasets with more than 100,000 cells, a progress indicator appears during filtering.