User Rating 0.0 ★★★★★

Total Usage 0 times

Category Code Utilities

Drop CSV file here, click to browse, or paste below Supports .csv and .tsv files up to 50 MB

Or paste CSV data directly:

Delimiter

Quote Character

First row is header

Strict mode (warnings → errors)

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

Malformed CSV files silently corrupt data pipelines. A single unescaped quote character in row 4,207 can shift every subsequent column, producing garbage in your database without triggering an import error. This tool parses your CSV using a finite-state machine compliant with RFC 4180, checking structural integrity (uniform column count n across all m rows), quote balancing, delimiter consistency, data type coherence per column, and duplicate row detection. It does not guess. It reports exact row and column coordinates for every anomaly found.

The validator distinguishes errors (file will break on import) from warnings (file imports but data may be wrong). Common failure modes include: fields containing the delimiter without quoting, line breaks embedded in unquoted fields, trailing commas creating phantom columns, and mixed line endings (CRLF vs LF). Spreadsheet exports from Excel are particularly prone to locale-dependent delimiter substitution (commas become semicolons in European locales). This tool auto-detects the delimiter and flags inconsistencies that a naive split-on-comma approach would miss entirely.

Formulas

The CSV parser operates as a deterministic finite automaton (DFA) with three states:

S ∈ { FIELD_START, UNQUOTED, QUOTED }

Transition function: given current state S and input character c, the parser transitions as follows. In FIELD_START: if c = " then transition to QUOTED; if c = delim then emit empty field and remain; otherwise transition to UNQUOTED and accumulate. In QUOTED: if c = " then peek next - if next is also ", emit literal quote and advance; otherwise transition to FIELD_START and emit field. All other characters accumulate. In UNQUOTED: if c = delim or c = newline, emit field and transition to FIELD_START.

Column count consistency check:

valid(row_i) = ( len(row_i) = n_expected )

where n_expected = len(row₀) (header or first data row).

Auto-delimiter detection scores each candidate delimiter d by computing the standard deviation σ of occurrence counts across the first 10 lines:

score(d) = count(d)σ(d) + 1

The delimiter with the highest score (high mean count, low variance) is selected. If all candidates score 0, comma is used as default per RFC 4180.

Reference Data

Validation Rule	Severity	RFC 4180 Section	Description
Inconsistent Column Count	Error	§2.4	Row has more or fewer fields than the header row
Unbalanced Quotes	Error	§2.5-2.7	Opening double-quote without matching close
Unescaped Quote in Field	Error	§2.7	Double-quote inside quoted field not escaped as ""
Delimiter in Unquoted Field	Error	§2.5	Field contains the delimiter but is not enclosed in quotes
Newline in Unquoted Field	Error	§2.6	Line break inside a field that lacks quote enclosure
Empty File	Error	-	File contains zero parseable rows
Data Type Inconsistency	Warning	-	Column contains mixed types (e.g., integers and strings)
Duplicate Rows	Warning	-	Two or more rows with identical content across all fields
Leading/Trailing Whitespace	Warning	-	Fields with spaces outside quotes that may cause match failures
Empty Fields	Info	-	Cells with no content (may be intentional)
Mixed Line Endings	Warning	§2.1	File uses both CRLF and LF line terminators
Trailing Delimiter	Warning	-	Row ends with delimiter, creating an empty trailing field
BOM Detected	Info	-	UTF-8 Byte Order Mark found at file start
Single Column Detected	Warning	-	Only one column found; delimiter may be incorrect
Header Duplicates	Warning	-	Two or more header columns share the same name
Excessively Long Field	Warning	-	Field exceeds 10,000 characters; possible parse error
Non-UTF8 Characters	Warning	-	Characters detected outside standard UTF-8 printable range
Completely Empty Row	Warning	-	Row contains only delimiters with no data

Frequently Asked Questions

Per RFC 4180 §2.6, fields containing line breaks must be enclosed in double quotes. The parser's QUOTED state accumulates characters including newlines until a closing quote is found. If a newline appears in an unquoted field, it is flagged as a structural error because the parser cannot distinguish it from a row terminator, which would shift all subsequent column alignments.

The auto-detection algorithm samples the first 10 lines and counts occurrences of four candidate delimiters: comma, semicolon, tab, and pipe. It selects the delimiter with the highest consistency score (high average count per line, low standard deviation). European Excel exports typically use semicolons due to the comma serving as a decimal separator in those locales. You can also manually override the detected delimiter in the settings panel.

No. The validator infers the dominant type per column by sampling all non-empty values and classifying each as integer, float, boolean, date (ISO 8601 pattern), email, URL, or string. If more than 10% of values in a column deviate from the dominant type, a warning is issued. This threshold accounts for legitimate missing data or placeholder values like "N/A". Columns detected as all-string receive no type warnings.

Each row is serialized by joining its fields with a null character separator (U+0000, which cannot appear in valid CSV data), then stored in a JavaScript Set. If the Set's add operation does not increase its size, the row is a duplicate. This provides O(1) average lookup per row and O(n) total time complexity. For files exceeding 50,000 rows, only exact full-row duplicates are checked; partial/fuzzy matching is not performed.

The validator detects the UTF-8 BOM (byte sequence EF BB BF, which appears as the character U+FEFF) at position 0 of the file content and strips it before parsing, logging an informational note. Mixed encoding (e.g., Latin-1 characters in a UTF-8 file) produces replacement characters (U+FFFD) during FileReader decoding; the validator flags any field containing U+FFFD as a non-UTF8 character warning with the exact row and column location.

The tool runs entirely in the browser. Practical limits depend on available RAM. Files up to approximately 50 MB parse reliably on modern devices with 4+ GB RAM. Processing is chunked into batches of 5,000 lines with yielding via setTimeout to prevent UI freezing. A progress bar displays completion percentage. For files exceeding 100 MB, consider splitting them with a command-line tool like GNU split before validation.

A trailing delimiter creates an additional empty field, making that row have column count n+1. Strictly, this is a column count mismatch (an error). However, many export tools consistently add trailing delimiters to every row including the header. If every row has the same trailing delimiter, the structural consistency is maintained and the extra empty column is flagged as a warning instead. If only some rows have trailing delimiters, it escalates to an error due to inconsistent column counts.