About

Extracting columns from large CSV files with spreadsheet software risks silent data corruption. Quoted fields containing delimiters, embedded newlines, and escaped quotes are parsed incorrectly by naive splitters. This tool implements a full RFC 4180-compliant state machine parser that handles every edge case: fields wrapped in double quotes, literal quote escaping via "", and multiline cell values. Delimiter detection is automatic, scoring , ; \t and | across the first 5 rows to determine the most probable separator. The tool processes files up to 50 MB entirely in the browser with zero server upload.

Limitation: encoding is assumed UTF-8. Files encoded in legacy charsets (Shift-JIS, Windows-1252) may produce garbled headers. BOM markers are stripped automatically. For files exceeding 50 MB, consider chunked command-line tools like csvkit or awk. Pro tip: always verify your output row count matches the source. A mismatch signals unescaped newlines inside fields that your original exporter failed to quote.

Formulas

The CSV parser operates as a finite state machine with 4 states. Given input string S of length n, each character S[i] triggers a state transition:

{

q₀ → FIELD_START: decide quoted vs unquotedq₁ → UNQUOTED: accumulate until delimiter or EOLq₂ → QUOTED: accumulate until closing quoteq₃ → QUOTE_IN_QUOTED: escape or end field

Delimiter auto-detection scores each candidate delimiter d across the first k = 5 lines. The score function counts consistent column counts:

score(d) = k∑i=1 [count(d, line_i) = count(d, line₁)]

The delimiter with the highest score and at least 1 occurrence is selected. Ties are broken by priority order: , > ; > \t > |.

Where: S = raw CSV input string, n = total character count, d = candidate delimiter, k = number of sample lines for detection, q_0..3 = parser states.

Reference Data

Delimiter	Character	Common Sources	RFC 4180	Unicode Codepoint	Notes
Comma	,	Excel (US/UK), Google Sheets export	Yes (default)	U+002C	Most universal CSV delimiter
Semicolon	;	Excel (EU locales: DE, FR, IT)	No	U+003B	Used where comma is decimal separator
Tab	\t	TSV exports, database dumps	No	U+0009	Rarely appears inside field values
Pipe	\|	Legacy mainframe exports, SAP	No	U+007C	Good for data containing commas
Double Quote	"	Field enclosure (all sources)	Yes	U+0022	Escaped as "" inside fields
CRLF	\r\n	Windows-origin files	Yes (required)	U+000D U+000A	Normalized to LF during parse
LF	\n	Unix/Mac origin files	No (tolerated)	U+000A	Accepted by most parsers
BOM	\uFEFF	Excel UTF-8 export	No	U+FEFF	Invisible; corrupts first header if not stripped
Max Field Size	-	RFC recommendation	Unspecified	-	This tool supports up to 1 MB per field
Max Columns	-	Practical limit	Unspecified	-	This tool tested up to 500 columns
Empty Field	,,	All sources	Yes	-	Parsed as empty string, not null
Quoted Empty	"",	Some ORMs	Yes	-	Equivalent to empty field
Newline in Field	"line1\nline2"	Textarea exports, CRM notes	Yes	-	Must be enclosed in double quotes
Header Row	-	Convention	Optional	-	First row assumed header by default
Trailing Delimiter	a,b,c,	Some ETL tools	Ambiguous	-	Creates extra empty column

Frequently Asked Questions

The parser uses a state machine compliant with RFC 4180. When it encounters an opening double quote at field start, it enters the QUOTED state and accumulates all characters - including delimiters and newlines - until it finds a closing double quote followed by a delimiter or end-of-line. A literal double quote inside a quoted field must be escaped as two consecutive quotes (""). This means a field like "New York, NY" is correctly parsed as a single value, not split into two columns.

The tool uses the header row (row 1) as the canonical column count. Rows with fewer fields are padded with empty strings to match. Rows with more fields than the header are truncated, and a warning toast is displayed indicating the number of inconsistent rows detected. This prevents index-shift errors in your extracted output.

Yes. Uncheck the "First row is header" option. Columns will be labeled as Column 1, Column 2, etc. based on their positional index. The first row of data will be treated as data, not as labels, and will appear in your extracted output.

The tool samples the first 5 lines of the file and tests four candidate delimiters: comma, semicolon, tab, and pipe. For each candidate, it counts the number of occurrences per line. The candidate that produces the most consistent column count across all sample lines wins. If two candidates tie, priority follows the order: comma > semicolon > tab > pipe. You can override this by manually selecting a delimiter.

The file is read as UTF-8 text. A BOM (byte order mark, U+FEFF) at position 0 is automatically stripped. Line endings are normalized internally to LF (\n) during parsing. The exported file uses LF line endings and UTF-8 encoding without BOM. If your source file uses a non-UTF-8 encoding (e.g., Latin-1), characters outside the ASCII range may appear garbled.

The tool enforces a 50 MB limit to prevent browser tab crashes. Parsing happens entirely in the browser's main thread with optimized string operations. For a typical CSV with 20 columns, this supports approximately 500,000 to 1,000,000 rows depending on field sizes. For larger datasets, command-line tools like csvkit, awk, or Python's csv module are recommended.