User Rating 0.0
Total Usage 0 times
Drop CSV file here or click to browse Supports .csv and .tsv up to 50 MB
Is this tool helpful?

Your feedback helps us improve.

About

Extracting columns from large CSV files with spreadsheet software risks silent data corruption. Quoted fields containing delimiters, embedded newlines, and escaped quotes are parsed incorrectly by naive splitters. This tool implements a full RFC 4180-compliant state machine parser that handles every edge case: fields wrapped in double quotes, literal quote escaping via "", and multiline cell values. Delimiter detection is automatic, scoring , ; \t and | across the first 5 rows to determine the most probable separator. The tool processes files up to 50 MB entirely in the browser with zero server upload.

Limitation: encoding is assumed UTF-8. Files encoded in legacy charsets (Shift-JIS, Windows-1252) may produce garbled headers. BOM markers are stripped automatically. For files exceeding 50 MB, consider chunked command-line tools like csvkit or awk. Pro tip: always verify your output row count matches the source. A mismatch signals unescaped newlines inside fields that your original exporter failed to quote.

csv column extractor csv parser data extraction csv filter csv tool spreadsheet data processing

Formulas

The CSV parser operates as a finite state machine with 4 states. Given input string S of length n, each character S[i] triggers a state transition:

{
q0 FIELD_START: decide quoted vs unquotedq1 UNQUOTED: accumulate until delimiter or EOLq2 QUOTED: accumulate until closing quoteq3 QUOTE_IN_QUOTED: escape or end field

Delimiter auto-detection scores each candidate delimiter d across the first k = 5 lines. The score function counts consistent column counts:

score(d) = ki=1 [count(d, linei) = count(d, line1)]

The delimiter with the highest score and at least 1 occurrence is selected. Ties are broken by priority order: , > ; > \t > |.

Where: S = raw CSV input string, n = total character count, d = candidate delimiter, k = number of sample lines for detection, q0..3 = parser states.

Reference Data

DelimiterCharacterCommon SourcesRFC 4180Unicode CodepointNotes
Comma,Excel (US/UK), Google Sheets exportYes (default)U+002CMost universal CSV delimiter
Semicolon;Excel (EU locales: DE, FR, IT)NoU+003BUsed where comma is decimal separator
Tab\tTSV exports, database dumpsNoU+0009Rarely appears inside field values
Pipe|Legacy mainframe exports, SAPNoU+007CGood for data containing commas
Double Quote"Field enclosure (all sources)YesU+0022Escaped as "" inside fields
CRLF\r\nWindows-origin filesYes (required)U+000D U+000ANormalized to LF during parse
LF\nUnix/Mac origin filesNo (tolerated)U+000AAccepted by most parsers
BOM\uFEFFExcel UTF-8 exportNoU+FEFFInvisible; corrupts first header if not stripped
Max Field Size - RFC recommendationUnspecified - This tool supports up to 1 MB per field
Max Columns - Practical limitUnspecified - This tool tested up to 500 columns
Empty Field,,All sourcesYes - Parsed as empty string, not null
Quoted Empty"",Some ORMsYes - Equivalent to empty field
Newline in Field"line1\nline2"Textarea exports, CRM notesYes - Must be enclosed in double quotes
Header Row - ConventionOptional - First row assumed header by default
Trailing Delimitera,b,c,Some ETL toolsAmbiguous - Creates extra empty column

Frequently Asked Questions

The parser uses a state machine compliant with RFC 4180. When it encounters an opening double quote at field start, it enters the QUOTED state and accumulates all characters - including delimiters and newlines - until it finds a closing double quote followed by a delimiter or end-of-line. A literal double quote inside a quoted field must be escaped as two consecutive quotes (""). This means a field like "New York, NY" is correctly parsed as a single value, not split into two columns.
The tool uses the header row (row 1) as the canonical column count. Rows with fewer fields are padded with empty strings to match. Rows with more fields than the header are truncated, and a warning toast is displayed indicating the number of inconsistent rows detected. This prevents index-shift errors in your extracted output.
Yes. Uncheck the "First row is header" option. Columns will be labeled as Column 1, Column 2, etc. based on their positional index. The first row of data will be treated as data, not as labels, and will appear in your extracted output.
The tool samples the first 5 lines of the file and tests four candidate delimiters: comma, semicolon, tab, and pipe. For each candidate, it counts the number of occurrences per line. The candidate that produces the most consistent column count across all sample lines wins. If two candidates tie, priority follows the order: comma > semicolon > tab > pipe. You can override this by manually selecting a delimiter.
The file is read as UTF-8 text. A BOM (byte order mark, U+FEFF) at position 0 is automatically stripped. Line endings are normalized internally to LF (\n) during parsing. The exported file uses LF line endings and UTF-8 encoding without BOM. If your source file uses a non-UTF-8 encoding (e.g., Latin-1), characters outside the ASCII range may appear garbled.
The tool enforces a 50 MB limit to prevent browser tab crashes. Parsing happens entirely in the browser's main thread with optimized string operations. For a typical CSV with 20 columns, this supports approximately 500,000 to 1,000,000 rows depending on field sizes. For larger datasets, command-line tools like csvkit, awk, or Python's csv module are recommended.