About

Switching a CSV file's column delimiter is deceptively error-prone. A naive find-and-replace corrupts any field that contains the delimiter character inside quoted text. This tool implements a full RFC 4180 state-machine parser that correctly handles quoted fields, escaped double-quotes (""), and embedded newlines before re-serializing with your target delimiter. It auto-detects the source delimiter by frequency analysis across the first 20 rows, counting only characters outside quoted regions.

Failure to handle quoting rules when changing delimiters is the most common cause of column-shift corruption in data pipelines. This is especially critical when moving data between European locale systems (semicolon-delimited) and US/international systems (comma-delimited). The tool re-applies minimal quoting on output: a field is quoted only when it contains the new delimiter, a double-quote, or a newline character. Approximation limits: files over 50 MB may cause browser memory pressure. For streaming workloads, a CLI tool such as csvkit is more appropriate.

Formulas

The parser operates as a finite state machine with three states:

S ∈ { UNQUOTED, QUOTED, QUOTE_ESCAPE }

Transition rules govern how each character c at position i is processed:

{

S → QUOTED if c = " and field startS → QUOTE_ESCAPE if c = " and S = QUOTEDemit field, advance row if c = \n and S ≠ QUOTEDemit field if c = d_src and S ≠ QUOTED

Auto-detection scores each candidate delimiter d by computing its occurrence count per row outside quoted regions. The delimiter with the lowest variance in per-row counts and count > 0 is selected:

d_best = argmin_d σ²(counts_d)

On output, a field f is wrapped in quotes when:

needsQuote(f) = TRUE if f contains d_target ∨ " ∨ \n ∨ \r

Where d_src is the source delimiter and d_target is the target delimiter chosen by the user. Internal double-quotes are escaped as "" per RFC 4180.

Reference Data

Delimiter	Name	Common Use	Escape Character	RFC/Standard	Typical File Extension	Locale Association	Risk Level (Collision)
,	Comma	General CSV	" (double-quote)	RFC 4180	.csv	US, UK, International	High (numeric decimals)
;	Semicolon	European CSV	"	De facto	.csv	DE, FR, IT, ES, BR	Low
\t	Tab	TSV files	"	IANA TSV	.tsv / .tab	Universal	Very Low
\|	Pipe	Database exports	" or \	None	.csv / .dat	Enterprise / Mainframe	Very Low
:	Colon	/etc/passwd, legacy	None standard	POSIX (passwd)	.dat / .txt	Unix/Linux	Medium (time values)
^	Caret	Fixed-width alternatives	"	None	.dat	Niche	Very Low
~	Tilde	EDI / Mainframe	None standard	X12 EDI	.edi / .dat	Enterprise	Very Low
\x01	SOH	Binary-safe exports	None needed	None	.dat	Enterprise	None
Space	Space	Fixed-width text	"	None	.txt	Scientific data	Extreme
\x1F	Unit Separator	ASCII control char	None needed	ASCII	.dat	Legacy systems	None
#	Hash	Config files	"	None	.cfg / .dat	Niche	Medium (comments)
\\	Backslash	Path data exports	"	None	.dat	Windows paths	High

Frequently Asked Questions

The parser uses a finite state machine with three states: UNQUOTED, QUOTED, and QUOTE_ESCAPE. When inside the QUOTED state, delimiter characters are treated as literal field content and never trigger a column split. This is compliant with RFC 4180 Section 2, Rule 6. A naive split() approach would corrupt such fields.

During re-serialization, every field is checked against the target delimiter. If a field contains the target delimiter character, a double-quote, or a newline, it is automatically wrapped in double-quotes. Any existing double-quotes within the field are escaped by doubling them (""). This ensures zero data corruption regardless of field content.

Auto-detection parses the first 20 rows using a quote-aware scanner. It counts occurrences of each candidate delimiter only when the parser state is UNQUOTED. It then computes the variance of per-row counts. A true delimiter produces a consistent count across rows (low variance), while commas inside quoted numeric fields are invisible to the counter. The candidate with the lowest non-zero variance wins.

Yes. The parser emits every field including empty strings between consecutive delimiters and at the end of rows. If a row ends with three tabs, three empty fields are preserved. The re-serializer writes the exact number of delimiters to maintain column count parity. Row-to-row column count consistency is preserved.

Practical limit is approximately 50 MB. The FileReader API loads the entire file into a JavaScript string, consuming roughly 2x the file size in memory (UTF-16 encoding). For files exceeding this, the browser tab may run out of memory. The tool displays a warning for files over 10 MB and blocks files over 50 MB with an error message recommending a CLI tool like csvkit or Miller.

The parser normalizes all line endings. It treats \r\n (Windows), \r (old Mac), and \n (Unix) identically as row terminators when outside a quoted field. On output, the tool uses \n (Unix LF) by default. Newlines embedded inside quoted fields are preserved exactly as they appear in the source data.

The current implementation supports single-character delimiters only. Multi-character delimiters are not part of RFC 4180 and introduce ambiguity in quoting rules. If your data uses "||", consider first replacing "||" with a single unused character (such as the ASCII Unit Separator \x1F) using a text editor, then processing through this tool.