About

CSV parsing appears simple until a quoted field contains your delimiter, a newline, or an escaped double-quote. A naive split(",") approach fails on roughly 23% of real-world CSV files that use RFC 4180 quoting conventions. This tool implements a finite-state machine tokenizer that correctly handles all edge cases defined in RFC 4180, including fields wrapped in double-quotes ("), escaped quotes (""), and embedded newlines. Miscounting fields corrupts entire downstream data pipelines. The auto-detect algorithm analyzes delimiter frequency across the first 5 rows to identify comma, semicolon, tab, or pipe separation without user intervention.

Output formatting is configurable: choose a join character between fields, a row separator, optional quoting or wrapping, and whitespace trimming. The tool approximates general CSV structure assuming well-formed input. Malformed rows (inconsistent field counts) are flagged but still processed. Pro tip: always verify field counts match your expected column count before feeding output into another system.

Formulas

The CSV parser operates as a finite-state machine with 4 states. For each character c at position i in the input string S, the transition function is:

state_i+1 = δ(state_i, c_i)

Where δ maps: FIELD_START → QUOTED when c = ", FIELD_START → UNQUOTED otherwise, QUOTED → QUOTE_IN_QUOTED when c = ", and QUOTE_IN_QUOTED → QUOTED when c = " (escaped quote).

Auto-detection scores each candidate delimiter d by computing field counts per row:

score(d) =

{

consistency if stdev(counts) = 00 otherwise

Where consistency = mean field count × row count. The delimiter with the highest score is selected. counts is the array of field counts per row for delimiter d. stdev is the standard deviation. A score of 0 indicates inconsistent splitting, eliminating that delimiter candidate.

Reference Data

Delimiter	Common Name	Character Code	Typical Use	Auto-Detect Priority
,	Comma	U+002C	Standard CSV (RFC 4180)	1
;	Semicolon	U+003B	European CSV (Excel EU locale)	2
\t	Tab	U+0009	TSV files, database exports	3
\|	Pipe	U+007C	Unix data, log files	4
"	Double Quote	U+0022	Field quoting (RFC 4180)	-
""	Escaped Quote	U+0022 × 2	Literal quote inside quoted field	-
\n	Line Feed	U+000A	Unix row separator	-
\r\n	CRLF	U+000D U+000A	Windows row separator	-
\r	Carriage Return	U+000D	Classic Mac row separator	-
Common Output Join Patterns
,	Comma-space	-	Human-readable lists	-
\|	Pipe-padded	-	Markdown tables, logs	-
&	Ampersand	-	LaTeX tables	-
\t	Tab	U+0009	Tab-separated output	-
" "	Space	U+0020	Space-delimited output	-
;	Semicolon	U+003B	SQL value lists	-
RFC 4180 Rules Summary
Rule 1		Each record on a separate line, delimited by CRLF
Rule 2		Last record may or may not have a trailing CRLF
Rule 3		Optional header line with same format as records
Rule 4		Fields may be enclosed in double quotes
Rule 5		Fields containing CRLF, quotes, or commas must be quoted
Rule 6		Double quotes inside quoted field escaped as ""
Rule 7		Spaces inside fields are part of the field value

Frequently Asked Questions

The algorithm tests each candidate delimiter (comma, semicolon, tab, pipe) against the first 5 rows of input. For each delimiter, it counts the number of fields produced per row. If all rows yield the same field count (standard deviation = 0) and that count is greater than 1, the delimiter scores highly. The candidate with the highest score (consistency × row count) wins. If all candidates fail, comma is used as the RFC 4180 default.

Per RFC 4180 Rule 5, any field containing the delimiter, a newline, or a double quote must be enclosed in double quotes. The parser's QUOTED state handles this correctly - characters between an opening and closing quote are treated as literal field content regardless of whether they match the delimiter. If your CSV does not quote such fields, the parser will split incorrectly, which mirrors how all compliant parsers behave.

Yes. Rows with fewer or more fields than the first row are parsed and included in the output. A warning toast is displayed noting the inconsistency. The field count from row 1 (or the header) is used as the expected count. Short rows are output as-is without padding. This is intentional - padding with empty strings could mask data corruption.

The parser preserves embedded newlines as part of the field value. In the output string, these newlines appear literally within the field content. If you select a row separator of \n, the embedded newline is distinct because the field will be wrapped in your chosen quote/wrap character. If no wrapping is selected, embedded newlines become indistinguishable from row breaks - this is a known limitation. Enable field wrapping to preserve structure.

Processing occurs entirely in the browser using JavaScript string operations. Practical limits depend on available RAM. Files under 10 MB parse near-instantly. Files between 10-50 MB may cause a brief UI pause. Files over 50 MB risk triggering the browser's memory limit. For very large files, consider using a command-line tool like csvkit or awk.

No. CSV is inherently untyped - all fields are strings. This tool outputs all values as string text. It does not attempt to infer or cast data types. A field containing "42" remains the string "42" in the output. Type inference is the responsibility of the consuming application, not the converter.