About

CSV (Comma-Separated Values) parsing appears trivial until a quoted field contains an embedded comma, a newline, or a literal double-quote escaped as "". RFC 4180 defines the grammar, but real-world exports from Excel, Google Sheets, and legacy ERP systems routinely deviate with BOM markers, mixed line endings (CRLF vs LF), and semicolon delimiters dictated by locale. Incorrect parsing silently shifts columns, corrupting downstream analysis. This converter implements a character-level state-machine parser that handles all RFC 4180 edge cases, auto-detects the input delimiter, and outputs clean TXT in your choice of tab-delimited, fixed-width, pipe-separated, or space-separated format. Processing runs entirely in your browser. No data leaves your machine.

The tool enforces strict quoting rules: a field that begins with a double-quote must end with one, and internal quotes must be doubled ("" → "). Malformed rows are flagged, not silently dropped. Fixed-width output pads each column to its maximum observed width plus 2 characters, aligning data for monospaced display or legacy mainframe ingest. Files up to 50 MB are supported. For files exceeding 1 MB, parsing offloads to a Web Worker to keep the UI responsive. Note: this tool assumes UTF-8 encoding. Non-UTF-8 files may produce garbled characters in multibyte sequences.

Formulas

The CSV parser operates as a finite state machine with three states: FIELD_START, IN_QUOTED, and IN_UNQUOTED. Transitions are determined character-by-character:

{

FIELD_START → IN_QUOTED if char = "FIELD_START → IN_UNQUOTED if char ≠ "IN_QUOTED → FIELD_START if char = " ∧ next ≠ "IN_UNQUOTED → FIELD_START if char = delimiter

Delimiter auto-detection counts occurrences of each candidate delimiter (, ; \t |) across the first 5 lines. The delimiter with the lowest coefficient of variation in per-line counts is selected:

score = σμ

where σ is the standard deviation of per-line counts and μ is the mean. The candidate with the lowest score (most consistent count per line) wins. Ties are broken by priority order: comma > semicolon > tab > pipe.

Fixed-width output computes column width as:

W_j = max(len(cell_i,j)) + 2 for all rows i

where W_j is the padded width for column j, and each cell is right-padded with spaces to W_j characters.

Reference Data

Output Format	Separator Character	Best For	Column Alignment	Readability	Import Compatibility
Tab-Delimited	\t (U+0009)	Spreadsheets, databases	Variable	Medium	Excel, SQL loaders, R, Python pandas
Fixed-Width	Space padding	Mainframes, COBOL, reports	Exact column alignment	High	FORTRAN, SAS, legacy ETL
Pipe-Delimited	\| (U+007C)	Data pipelines, logs	Variable	Medium	Unix tools, awk, sed
Space-Delimited	Single space	Simple text, CLI tools	Variable	Low (if data has spaces)	cut, tr, shell scripts
Custom Delimiter	User-defined character	Proprietary formats	Variable	Varies	Application-specific
Common CSV Input Delimiters (Auto-Detected)
Comma	, (U+002C)	Default RFC 4180	-	-	Universal
Semicolon	; (U+003B)	European locale Excel exports	-	-	German, French, Italian Excel
Tab (TSV)	\t (U+0009)	Tab-separated values	-	-	Widely supported
Pipe	\| (U+007C)	Medical (HL7), financial feeds	-	-	Domain-specific
RFC 4180 Quoting Rules
Plain field	No quoting required: hello
Field with comma	Must be quoted: "hello, world"
Field with newline	Must be quoted: "line1\nline2"
Field with quote	Quote doubled inside quotes: "say ""hello"""
Empty field	Two consecutive delimiters: a,,c
Quoted empty	Explicit empty: a,"",c
File Size & Performance
< 100 KB	Instant parsing on main thread (< 50 ms)
100 KB - 1 MB	Main thread, 50 - 500 ms
1 MB - 50 MB	Web Worker parsing, progress indicator shown
> 50 MB	Rejected with error (browser memory limits)

Frequently Asked Questions

The parser samples the first 5 lines and counts occurrences of each candidate delimiter (comma, semicolon, tab, pipe) per line. It then calculates the coefficient of variation (standard deviation divided by mean) for each candidate. A consistent delimiter produces nearly equal counts per line, yielding a low coefficient of variation. The candidate with the lowest score is selected. If two candidates tie, priority order is: comma, semicolon, tab, pipe. This handles European-locale Excel exports that use semicolons because commas serve as decimal separators.

Per RFC 4180, a field enclosed in double quotes may contain line breaks (CRLF or LF). The parser's IN_QUOTED state does not treat newline characters as row terminators. The field continues until a closing double-quote is found that is NOT followed by another double-quote. This means a single logical CSV row can span multiple physical lines. The converter preserves or strips these embedded newlines based on your output format choice. In tab-delimited mode, embedded newlines are replaced with a space to prevent row misalignment in the output TXT.

Yes. The parser does not enforce a fixed column count. If row 1 has 5 fields and row 7 has 3 fields, both are parsed as-is. In fixed-width mode, short rows are padded with empty columns to match the maximum column count observed. A warning toast appears noting the row discrepancy. This is common in real-world exports where trailing empty fields are omitted.

The padding formula adds 2 characters beyond the maximum observed cell width for each column. This ensures visual separation between adjacent columns when displayed in a monospaced font. Without padding, columns would run together wherever a cell reaches maximum width. The value of 2 is a standard convention in COBOL copybooks and mainframe fixed-format files, providing readable gutters without excessive whitespace.

The parser detects a UTF-8 BOM (U+FEFF, encoded as EF BB BF) at the start of the file and strips it before parsing. This prevents the BOM from appearing as a phantom character in the first field of the first row, which is a common issue when opening Excel-exported CSVs in Unix tools. The output TXT file is written without a BOM.

The limit is 50 MB. Browser-based JavaScript holds the entire file content as a string in memory. A 50 MB CSV file can expand to approximately 100 MB in memory due to UTF-16 internal string representation. Beyond this, browsers may hit memory limits or become unresponsive. Files between 1 MB and 50 MB are processed in a Web Worker to keep the UI thread free. For files exceeding 50 MB, consider a server-side tool or command-line utility like awk or csvtool.

In CSV, a literal double-quote inside a quoted field is escaped by doubling it: ""say ""hello"""" represents the value say "hello". The parser un-escapes these during parsing, restoring single double-quotes. In the output TXT, the raw value (with single quotes) is written directly since tab-delimited, pipe, and fixed-width formats do not use quote escaping. If you later need to re-import the TXT as CSV, you would need to re-apply quoting rules.