About

Misaligned columns in CSV data cause silent failures in ETL pipelines, database imports, and analytics dashboards. A single unescaped delimiter inside a field shifts every subsequent column, producing corrupt records that pass validation but yield wrong results. This tool parses your CSV using a strict RFC 4180 state machine that correctly handles quoted fields, escaped double-quotes (""), and embedded newlines. It reports the column count per row, flags inconsistencies where row i has n_i ≠ n₁, and auto-detects the delimiter from comma, semicolon, tab, and pipe characters.

The tool approximates header presence by checking whether the first row contains exclusively non-numeric strings while subsequent rows contain mixed or numeric data. Limitation: auto-detection fails on files where every field is text or where multiple candidate delimiters appear with equal frequency. In such cases, select the delimiter manually. Files up to 50 MB are supported client-side with no server upload.

Formulas

Column counting follows a finite-state parser. For each row, the parser transitions between states based on the current character and the active state. The column count for row i is:

cols_i = delimiters_i + 1

where delimiters_i counts only unquoted delimiter characters in row i. The delimiter auto-detection score for candidate d is computed as:

score(d) = consistency(d) × frequency(d)

where consistency measures what fraction of sampled rows produce the same column count, and frequency is the mean count of d per row. The candidate with the highest score is selected. Inconsistency is flagged when:

cols_i ≠ mode(cols₁, cols₂, …, cols_N)

where N is total row count, mode returns the most frequent column count, cols_i is the column count for row i, and rows deviating from the mode are reported as inconsistent.

Reference Data

Delimiter	Name	Common Use	Unicode	RFC Standard	Risk Factor
,	Comma	International CSV default	U+002C	RFC 4180	Breaks on European decimals (3,14)
;	Semicolon	European CSV (Excel EU locale)	U+003B	Non-standard	Rare in field data
\t	Tab	TSV files, database exports	U+0009	IANA TSV	Invisible character, hard to debug
\|	Pipe	Legacy systems, HL7 medical data	U+007C	Non-standard	Conflicts with shell piping
\x1F	Unit Separator	ASCII control character	U+001F	Non-standard	Not human-readable
Common Column Count Expectations
Standard Address File		5 - 8 columns		Name, Street, City, State, Zip, Country
Bank Transaction Export		6 - 12 columns		Date, Description, Debit, Credit, Balance, Reference
Web Analytics (GA Export)		10 - 30 columns		Session, Source, Medium, Page, Bounce Rate, etc.
eCommerce Product Feed		15 - 50 columns		SKU, Title, Description, Price, Images, Variants
Scientific Dataset (Tidy)		3 - 20 columns		Observation, Variable, Value per tidy data principles
US Census PUMS		200+ columns		Microdata with coded variables
Apache Log (CSV-converted)		7 - 9 columns		IP, Timestamp, Method, URL, Status, Size, Referrer
CRM Contact Export		20 - 40 columns		Name, Email, Phone, Company, Tags, Custom Fields
IoT Sensor Readings		4 - 15 columns		Timestamp, Sensor ID, Value, Unit, Status
Genomics VCF (tab-delimited)		8+ fixed + n samples		CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO

Frequently Asked Questions

The parser implements a 4-state finite automaton per RFC 4180. When a double-quote opens a field, all characters including delimiters and line breaks are treated as field content until a closing quote is found. A literal quote inside a quoted field must be escaped as two consecutive double-quotes (""). This means a field like "San Francisco, CA" counts as one column, not two, even with a comma delimiter.

Auto-detection samples the first 20 rows and scores each candidate delimiter by consistency (same column count across rows) multiplied by frequency. If your file uses an uncommon delimiter or has irregular structure in the first 20 rows, the heuristic may fail. In that case, manually select the correct delimiter from the dropdown. Files with only one column or where every field is quoted with embedded delimiters are inherently ambiguous.

Common causes include: unescaped delimiters inside field values (e.g., commas in addresses without quoting), missing trailing delimiters on some rows, extra blank fields appended by spreadsheet software, or corrupted lines from truncated writes. The tool flags each row that deviates from the mode column count so you can inspect the specific problematic lines.

The FileReader API reads the file as UTF-8 text by default. UTF-8 BOM (byte order mark, 0xEF 0xBB 0xBF) is stripped automatically before parsing. If your file uses Latin-1 or Windows-1252 encoding, special characters may display incorrectly, but delimiter detection and column counting remain accurate since delimiter characters fall within the ASCII range (U+0000 to U+007F).

Files up to 50 MB are accepted. For files exceeding 1 MB, parsing is chunked using setTimeout batches of 100,000 characters to prevent the browser main thread from freezing. A progress indicator shows parsing advancement. Memory is released after analysis completes. For files larger than 50 MB, consider splitting with a command-line tool like GNU split before analysis.

The heuristic checks whether the first row consists entirely of non-numeric, non-empty strings while at least 30% of cells in rows 2 through 6 contain numeric or date-like values. This is a probabilistic guess. If your data has text-only columns or numeric headers (e.g., year codes), the detection may be incorrect. You can override it manually using the header toggle.