CSV Column Extractor
Extract specific columns from CSV files instantly. Upload, select columns, preview data, and download the filtered CSV. RFC 4180 compliant.
About
Extracting columns from large CSV files with spreadsheet software risks silent data corruption. Quoted fields containing delimiters, embedded newlines, and escaped quotes are parsed incorrectly by naive splitters. This tool implements a full RFC 4180-compliant state machine parser that handles every edge case: fields wrapped in double quotes, literal quote escaping via "", and multiline cell values. Delimiter detection is automatic, scoring , ; \t and | across the first 5 rows to determine the most probable separator. The tool processes files up to 50 MB entirely in the browser with zero server upload.
Limitation: encoding is assumed UTF-8. Files encoded in legacy charsets (Shift-JIS, Windows-1252) may produce garbled headers. BOM markers are stripped automatically. For files exceeding 50 MB, consider chunked command-line tools like csvkit or awk. Pro tip: always verify your output row count matches the source. A mismatch signals unescaped newlines inside fields that your original exporter failed to quote.
Formulas
The CSV parser operates as a finite state machine with 4 states. Given input string S of length n, each character S[i] triggers a state transition:
Delimiter auto-detection scores each candidate delimiter d across the first k = 5 lines. The score function counts consistent column counts:
The delimiter with the highest score and at least 1 occurrence is selected. Ties are broken by priority order: , > ; > \t > |.
Where: S = raw CSV input string, n = total character count, d = candidate delimiter, k = number of sample lines for detection, q0..3 = parser states.
Reference Data
| Delimiter | Character | Common Sources | RFC 4180 | Unicode Codepoint | Notes |
|---|---|---|---|---|---|
| Comma | , | Excel (US/UK), Google Sheets export | Yes (default) | U+002C | Most universal CSV delimiter |
| Semicolon | ; | Excel (EU locales: DE, FR, IT) | No | U+003B | Used where comma is decimal separator |
| Tab | \t | TSV exports, database dumps | No | U+0009 | Rarely appears inside field values |
| Pipe | | | Legacy mainframe exports, SAP | No | U+007C | Good for data containing commas |
| Double Quote | " | Field enclosure (all sources) | Yes | U+0022 | Escaped as "" inside fields |
| CRLF | \r\n | Windows-origin files | Yes (required) | U+000D U+000A | Normalized to LF during parse |
| LF | \n | Unix/Mac origin files | No (tolerated) | U+000A | Accepted by most parsers |
| BOM | \uFEFF | Excel UTF-8 export | No | U+FEFF | Invisible; corrupts first header if not stripped |
| Max Field Size | - | RFC recommendation | Unspecified | - | This tool supports up to 1 MB per field |
| Max Columns | - | Practical limit | Unspecified | - | This tool tested up to 500 columns |
| Empty Field | ,, | All sources | Yes | - | Parsed as empty string, not null |
| Quoted Empty | "", | Some ORMs | Yes | - | Equivalent to empty field |
| Newline in Field | "line1\nline2" | Textarea exports, CRM notes | Yes | - | Must be enclosed in double quotes |
| Header Row | - | Convention | Optional | - | First row assumed header by default |
| Trailing Delimiter | a,b,c, | Some ETL tools | Ambiguous | - | Creates extra empty column |