About

CSV parsing fails silently. A misplaced quote, a delimiter inside a field value, a newline embedded in a cell - any of these corrupts your output if the parser lacks RFC 4180 compliance. Most naive implementations split on commas with regex. That approach breaks the moment a field contains delimiter characters within quoted strings. This tool implements a finite-state machine parser that handles quoted fields, escaped double-quotes (""), and embedded newlines correctly. It auto-detects the delimiter by frequency analysis across the first 5 rows, supporting comma, semicolon, tab, and pipe. Limitations: the parser assumes UTF-8 encoding. Binary-encoded CSV or files exceeding 10 MB are rejected to prevent browser memory issues.

Output is a standard JSON array of objects (keyed by header names) or a nested array of arrays if no headers are specified. Indentation is configurable. Pro tip: if your CSV originates from Excel on European locales, the delimiter is almost certainly a semicolon, not a comma. Validate your delimiter choice before processing large datasets. This tool does not round, coerce, or interpret field values - every value remains a string, preserving data fidelity.

Formulas

The CSV parser operates as a finite-state machine with 4 states. For each character c in the input stream, the machine transitions between states to correctly segment fields.

S ∈ { FIELD_START, UNQUOTED, QUOTED, QUOTE_IN_QUOTED }

Transition rules:

{

FIELD_START → QUOTED if c = """FIELD_START → UNQUOTED if c ≠ """ ∧ c ≠ delimQUOTED → QUOTE_IN_QUOTED if c = """QUOTE_IN_QUOTED → QUOTED if c = """ (escaped quote)QUOTE_IN_QUOTED → FIELD_START if c = delim ∨ c = "\n"

Delimiter auto-detection counts the frequency of each candidate delimiter across the first n = 5 lines. The candidate with the lowest coefficient of variation (most consistent count per line) is selected:

delim = argmin_d σ(freq_d)freq_d

Where σ is the standard deviation of delimiter frequency counts across sampled lines, and freq_d is the mean count. A delimiter with zero variance (identical count per row) is strongly preferred. Ties are broken by priority order: comma > semicolon > tab > pipe.

Reference Data

Delimiter	Symbol	Common Source	Auto-Detect Pattern	RFC 4180	Notes
Comma	,	US/UK Excel, Google Sheets export	Highest comma frequency	Yes (default)	Fails if decimal commas used
Semicolon	;	European Excel (DE, FR, IT locales)	Highest semicolon frequency	Extension	Common in SAP exports
Tab	\t	TSV files, database dumps	Highest tab frequency	Extension	Rarely appears inside fields
Pipe	\|	Legacy mainframe systems, HL7	Highest pipe frequency	Extension	Used in medical data (HL7v2)
Quoted Field	"..."	Any source with special chars	-	Yes	Encloses fields with delimiters/newlines
Escaped Quote	""	Fields containing literal quotes	-	Yes	Two consecutive double-quotes = one literal
CRLF Line End	\r\n	Windows systems	-	Yes (required)	LF-only also accepted by most parsers
LF Line End	\n	Unix/macOS systems	-	Extension	De facto standard in web contexts
CR Line End	\r	Classic Mac OS (pre-X)	-	Extension	Extremely rare today
BOM Marker	\uFEFF	UTF-8 files from Windows Notepad	First byte check	Not specified	Stripped automatically by this tool
Empty Field	,,	Sparse datasets	-	Yes	Produces empty string, not null
Newline in Field	"a\nb"	Address fields, descriptions	-	Yes	Must be inside quotes per RFC 4180
Trailing Delimiter	a,b,c,	Some DB exports	-	Ambiguous	Creates extra empty field per row
Header Row	-	Most structured exports	First row analysis	Optional	Used as JSON object keys
No Header	-	Raw sensor data, logs	-	Optional	Output becomes array of arrays

Frequently Asked Questions

The parser uses a finite-state machine with a dedicated QUOTED state. When it encounters an opening double-quote at field start, all subsequent characters (including delimiters and newlines) are treated as field content until a closing quote is found. A delimiter inside quotes does not trigger field separation. This follows RFC 4180 Section 2, Rule 6.

Per RFC 4180, a literal double-quote inside a quoted field must be escaped by preceding it with another double-quote. For example, the value She said "hello" must be encoded as "She said ""hello""" in CSV. The parser's QUOTE_IN_QUOTED state detects consecutive double-quotes and emits a single literal quote. Unescaped quotes in unquoted fields produce undefined behavior and may corrupt the parse.

CSV is an untyped format. A field containing 007 could be a numeric value 7, a string identifier, or a James Bond reference. Type coercion (converting "true" to boolean true or "42" to integer 42) is a lossy transformation that destroys information. This tool preserves raw string values to maintain data fidelity. Apply type casting downstream in your application logic where the schema is known.

The tool samples the first 5 lines and counts occurrences of each candidate delimiter (comma, semicolon, tab, pipe). The delimiter with the most consistent count across lines (lowest coefficient of variation) wins. Override auto-detection when: your file has fewer than 2 rows, the first rows are atypical (e.g., comment lines), or the file uses a rare delimiter not in the candidate set.

The limit is 10 MB. Browser JavaScript runs on the main thread. Parsing a 50 MB CSV string character-by-character can freeze the UI for several seconds and may exceed available heap memory on mobile devices. For files beyond 10 MB, use a server-side tool or a streaming parser. The 10 MB limit ensures responsive interaction on devices with as little as 2 GB RAM.

If a row has fewer fields than the header row, the missing fields are filled with empty strings in the JSON output. If a row has more fields than headers, the extra fields are assigned auto-generated keys (column_N where N is the 1-based index). This prevents data loss while keeping the output structurally valid. A warning toast is displayed when row widths are inconsistent.

Yes. Windows applications (notably Notepad) prepend a 3-byte BOM (EF BB BF, or U+FEFF) to UTF-8 files. If not stripped, this invisible character becomes part of the first field value or the first header name, causing silent key mismatch bugs in downstream JSON consumers. This tool automatically detects and removes BOM characters before parsing.