CSV Column Delimiter Changer
Change CSV column delimiters between comma, semicolon, tab, pipe, and custom characters. RFC 4180-compliant parser handles quoted fields correctly.
About
Changing a CSV delimiter appears trivial until a quoted field contains the delimiter character itself. A naive find-and-replace corrupts the data. This tool implements a full RFC 4180-compliant state machine parser that tracks whether the scanner is inside a quoted region, correctly handling escaped quotes (""), embedded newlines, and Unicode BOM prefixes. The auto-detect algorithm samples the first 20 lines outside quoted regions and selects the candidate delimiter with the lowest variance in per-line occurrence count. Typical failure scenario: importing a European-locale CSV (semicolon-delimited) into a system expecting commas destroys numeric columns where commas serve as decimal separators.
The converter re-serializes each field, applying quoting only when the target delimiter, a double-quote, or a newline appears within the field value. This minimizes unnecessary quoting and keeps output files compact. Limitation: this tool treats the input as plain text. It does not validate data types, detect encoding beyond UTF-8/ASCII, or handle fixed-width formats. For files exceeding 1 MB, processing moves to a background thread to prevent UI blocking.
Formulas
The parser operates as a finite state machine with three states:
Delimiter auto-detection scores each candidate d by computing the variance σ2 of per-line counts across the sample:
Where ci is the count of delimiter d on line i (outside quotes), and is the mean count. The candidate with the lowest σ2 and ≥ 1 is selected. Ties are broken by priority order: , > ; > \t > |.
Re-serialization rule: a field f is quoted in the output if and only if f contains the target delimiter, a double-quote character, or a newline character.
Reference Data
| Delimiter | Symbol | Common Name | Typical Use Case | RFC 4180 | Risk with Naive Replace |
|---|---|---|---|---|---|
| Comma | , | CSV | US/UK locale exports, most APIs | Yes (default) | Breaks European decimals (3,14) |
| Semicolon | ; | CSV (EU) | Excel exports in EU locales | No (extension) | Rare in field data |
| Tab | \t | TSV | Database exports, bioinformatics | No | Invisible character confusion |
| Pipe | | | PSV | Legacy mainframe systems, HL7 | No | Rare in natural text |
| Colon | : | Colon-SV | /etc/passwd, some configs | No | Breaks time values (14:30) |
| Space | SSV | Fixed-width approximations | No | Breaks multi-word fields | |
| Caret | ^ | Caret-SV | EDI, some financial feeds | No | Low risk |
| Tilde | ~ | Tilde-SV | Custom enterprise exports | No | Low risk |
| Unit Separator | US (0x1F) | ASCII 31 | Machine-to-machine data | No | Not human-readable |
| SOH | SOH (0x01) | ASCII 1 | FIX protocol tag separator | No | Not human-readable |
| Double Quote | " | Quote char | Field enclosure (not delimiter) | Yes (enclosure) | Must be escaped as "" |
| Newline | \n / \r\n | Row separator | Record boundary | Yes (CRLF) | Embedded newlines in quotes |