User Rating 0.0 โ˜…โ˜…โ˜…โ˜…โ˜…
Total Usage 0 times
Drop CSV file here or click to browse Supports .csv and .tsv files up to 50 MB
Is this tool helpful?

Your feedback helps us improve.

โ˜… โ˜… โ˜… โ˜… โ˜…

About

CSV remains the dominant interchange format for tabular data, yet its apparent simplicity hides parsing traps: unescaped delimiters, inconsistent quoting, mixed encodings, and ambiguous column types. Feeding malformed CSV into a downstream pipeline without validation risks silent data corruption - truncated rows, shifted columns, or numeric fields miscast as strings. This tool implements a strict RFC 4180 state-machine parser that handles quoted fields, embedded newlines, and escaped double-quotes correctly. It auto-detects column types and computes descriptive statistics: xฬ… (mean), ฯƒ (standard deviation), median, quartiles Q1 and Q3, skewness, and kurtosis for every numeric column. Categorical columns receive frequency counts and unique value tallies.

The analyzer supports files up to 50 MB with virtual scrolling for datasets exceeding 500 rows. Parsing of large files runs in a Web Worker to keep the UI responsive. Note: type detection uses heuristics - a column with > 80% numeric values is classified numeric. Edge cases like mixed-type columns or locale-specific decimal separators (comma vs. period) may require manual review. Pro tip: if your CSV uses semicolons (common in European Excel exports), the delimiter auto-detection handles it, but you can override manually.

csv analyzer csv parser data analysis statistics csv viewer spreadsheet data visualization

Formulas

Descriptive statistics computed for each numeric column:

Arithmetic mean:

x = 1n nโˆ‘i=1 xi

Population standard deviation:

ฯƒ = โˆšnโˆ‘i=1 (xi โˆ’ x)2n

Skewness (Fisher's):

ฮณ1 = 1n nโˆ‘i=1 (xi โˆ’ x)3ฯƒ3

Excess kurtosis:

ฮบexcess = 1n nโˆ‘i=1 (xi โˆ’ x)4ฯƒ4 โˆ’ 3

Where xi represents each observation, n is the count of non-empty values, x is the arithmetic mean, and ฯƒ is the population standard deviation. The sample standard deviation s uses n โˆ’ 1 (Bessel's correction) in the denominator. Median is computed via sorted-array indexing: for odd n, the middle element; for even n, the average of the two central elements. Quartiles use inclusive interpolation (Method 1, same as Excel QUARTILE.INC).

Reference Data

StatisticSymbolDescriptionApplicable To
CountnTotal non-empty values in columnAll types
UniquenuDistinct value countAll types
Null/Emptynโˆ…Missing or empty cell countAll types
MeanxArithmetic averageNumeric
MedianxฬƒMiddle value when sortedNumeric
ModeMoMost frequent valueAll types
Standard DeviationฯƒSpread around the mean (population)Numeric
Sample Std DevsSpread using Bessel's correction (n โˆ’ 1)Numeric
MinimumminSmallest valueNumeric
MaximummaxLargest valueNumeric
SumฮฃTotal of all valuesNumeric
RangeRmax โˆ’ minNumeric
Q1 (25th percentile)Q1Lower quartile boundaryNumeric
Q3 (75th percentile)Q3Upper quartile boundaryNumeric
IQRQ3 โˆ’ Q1Interquartile range, robust spread measureNumeric
Skewnessฮณ1Asymmetry of distribution. 0 = symmetricNumeric
KurtosisฮบTail heaviness. 3 = normal (excess = 0)Numeric
Coefficient of VariationCVฯƒ รท x expressed as %Numeric
Top FrequencyfmaxCount of the most common valueCategorical
Avg. String LengthLMean character count of text valuesString

Frequently Asked Questions

The parser samples the first 10 lines of the file and counts occurrences of four candidate delimiters: comma, semicolon, tab, and pipe. The delimiter with the most consistent count across all sampled lines (lowest variance) wins. If your file uses an unusual delimiter, you can override the selection manually before parsing.
The parser implements a full RFC 4180 state machine. Fields wrapped in double quotes can contain the delimiter character, newline characters (CR, LF, or CRLF), and literal double quotes (escaped as two consecutive double quotes ""). The state machine tracks whether it is inside or outside a quoted context and handles these cases correctly.
Type detection uses heuristic sampling. If more than 80% of non-empty values in a column parse as valid numbers (integers or floats, including negative and scientific notation), the column is classified as numeric. Date detection checks ISO 8601 patterns and common formats (MM/DD/YYYY, DD.MM.YYYY). Mixed-type columns default to string. Edge case: a ZIP code column like "02134" will be detected as numeric and lose the leading zero in statistics. Review the detected types in the column header badges.
Multiple formulas exist for skewness and kurtosis. This tool computes Fisher's skewness and excess kurtosis using population formulas (dividing by n). Excel's SKEW function and R's default use sample-adjusted formulas with correction factors involving n โˆ’ 1 and n โˆ’ 2. For large n, the difference is negligible. For small samples (< 30 rows), expect discrepancies.
The tool accepts files up to 50 MB. Files exceeding 1 MB are parsed inside a Web Worker to prevent UI freezing. The data table uses virtual scrolling: only the visible rows (plus a small buffer) are rendered in the DOM. This means even a 500,000-row dataset scrolls smoothly. Statistics are computed in a single pass where possible (Welford's online algorithm for variance) to minimize memory overhead.
Currently, sorting is single-column. Click a column header to sort ascending; click again for descending; a third click resets to original order. The sort algorithm is a stable merge sort, so you can achieve a pseudo multi-column sort by sorting the secondary column first, then the primary column - stability preserves the secondary order within equal primary values.
Search is case-insensitive and matches substrings across all cells. Special regex characters are escaped, so searching for "price (USD)" works literally. The search is debounced by 250 ms to avoid excessive re-renders. For datasets with more than 100,000 cells, a progress indicator appears during filtering.