CSV Validator
Validate CSV files against RFC 4180 standards. Detect structural errors, quote mismatches, type inconsistencies, and duplicates instantly.
About
Malformed CSV files silently corrupt data pipelines. A single unescaped quote character in row 4,207 can shift every subsequent column, producing garbage in your database without triggering an import error. This tool parses your CSV using a finite-state machine compliant with RFC 4180, checking structural integrity (uniform column count n across all m rows), quote balancing, delimiter consistency, data type coherence per column, and duplicate row detection. It does not guess. It reports exact row and column coordinates for every anomaly found.
The validator distinguishes errors (file will break on import) from warnings (file imports but data may be wrong). Common failure modes include: fields containing the delimiter without quoting, line breaks embedded in unquoted fields, trailing commas creating phantom columns, and mixed line endings (CRLF vs LF). Spreadsheet exports from Excel are particularly prone to locale-dependent delimiter substitution (commas become semicolons in European locales). This tool auto-detects the delimiter and flags inconsistencies that a naive split-on-comma approach would miss entirely.
Formulas
The CSV parser operates as a deterministic finite automaton (DFA) with three states:
Transition function: given current state S and input character c, the parser transitions as follows. In FIELD_START: if c = " then transition to QUOTED; if c = delim then emit empty field and remain; otherwise transition to UNQUOTED and accumulate. In QUOTED: if c = " then peek next - if next is also ", emit literal quote and advance; otherwise transition to FIELD_START and emit field. All other characters accumulate. In UNQUOTED: if c = delim or c = newline, emit field and transition to FIELD_START.
Column count consistency check:
where nexpected = len(row0) (header or first data row).
Auto-delimiter detection scores each candidate delimiter d by computing the standard deviation ฯ of occurrence counts across the first 10 lines:
The delimiter with the highest score (high mean count, low variance) is selected. If all candidates score 0, comma is used as default per RFC 4180.
Reference Data
| Validation Rule | Severity | RFC 4180 Section | Description |
|---|---|---|---|
| Inconsistent Column Count | Error | ยง2.4 | Row has more or fewer fields than the header row |
| Unbalanced Quotes | Error | ยง2.5-2.7 | Opening double-quote without matching close |
| Unescaped Quote in Field | Error | ยง2.7 | Double-quote inside quoted field not escaped as "" |
| Delimiter in Unquoted Field | Error | ยง2.5 | Field contains the delimiter but is not enclosed in quotes |
| Newline in Unquoted Field | Error | ยง2.6 | Line break inside a field that lacks quote enclosure |
| Empty File | Error | - | File contains zero parseable rows |
| Data Type Inconsistency | Warning | - | Column contains mixed types (e.g., integers and strings) |
| Duplicate Rows | Warning | - | Two or more rows with identical content across all fields |
| Leading/Trailing Whitespace | Warning | - | Fields with spaces outside quotes that may cause match failures |
| Empty Fields | Info | - | Cells with no content (may be intentional) |
| Mixed Line Endings | Warning | ยง2.1 | File uses both CRLF and LF line terminators |
| Trailing Delimiter | Warning | - | Row ends with delimiter, creating an empty trailing field |
| BOM Detected | Info | - | UTF-8 Byte Order Mark found at file start |
| Single Column Detected | Warning | - | Only one column found; delimiter may be incorrect |
| Header Duplicates | Warning | - | Two or more header columns share the same name |
| Excessively Long Field | Warning | - | Field exceeds 10,000 characters; possible parse error |
| Non-UTF8 Characters | Warning | - | Characters detected outside standard UTF-8 printable range |
| Completely Empty Row | Warning | - | Row contains only delimiters with no data |