Change Separator of Arbitrary Delimited Columns
Convert column delimiters between CSV, TSV, pipe, semicolon, or any custom separator. Handles quoted fields, auto-detects input format.
About
Switching column delimiters in structured text is deceptively error-prone. A naive find-and-replace fails when the source delimiter appears inside quoted fields, when consecutive delimiters represent empty columns, or when mixed line endings corrupt row boundaries. Mishandling these edge cases corrupts datasets - merging columns, shifting values, or silently dropping fields. This tool implements RFC 4180-aware parsing that respects double-quoted fields: a comma inside "New York, NY" is preserved literally, not treated as a column break. It auto-detects the input delimiter by analyzing character frequency consistency across the first 20 rows.
The converter handles tab (\t), comma, semicolon, pipe (|), colon, space, and arbitrary multi-character delimiters. Options for trimming cell whitespace and collapsing consecutive delimiters give fine control over output formatting. Limitation: this tool processes plain-text columnar data. It does not parse binary formats like .xlsx or fixed-width column layouts. Pro tip: European CSV files often use semicolons because the comma serves as a decimal separator in those locales.
Formulas
The core operation is a parse-then-serialize pipeline. Each row of input text is tokenized respecting quoted fields, then re-joined with the target delimiter.
Where dsrc is the source delimiter and dtgt is the target delimiter. The parse function implements a finite state machine with three states:
When S = IN_QUOTED, delimiter characters are treated as literal content. A double-quote inside a quoted field is escaped as "" per RFC 4180. Auto-detection scores each candidate delimiter dc by computing column count variance across sampled rows:
The delimiter with the highest score (lowest variance and highest mean count) wins. A perfect delimiter produces identical column counts on every row, yielding σ2 = 0.
Reference Data
| Delimiter | Symbol | Escape Sequence | Common Format | RFC / Standard | Typical Use Case |
|---|---|---|---|---|---|
| Comma | , | Quoted field | CSV | RFC 4180 | Spreadsheets, data export |
| Tab | \t | Rarely needed | TSV | IANA TSV | Bioinformatics, database dumps |
| Semicolon | ; | Quoted field | CSV (EU) | De facto | European locale CSV |
| Pipe | | | Backslash or quote | PSV | HL7, EDI | Healthcare, legacy systems |
| Colon | : | Backslash | /etc/passwd | POSIX | Unix config files |
| Space | ␣ | Quoted field | SSV | None | Log files, CLI output |
| Tilde | ~ | None standard | Custom | None | Legacy mainframe exports |
| Caret | ^ | None standard | Custom | None | Special data feeds |
| Double Pipe | || | None standard | Custom | None | Multi-char delimited logs |
| SOH (\x01) | ^A | N/A | FIX Protocol | FIX 4.x | Financial trading messages |
| Unit Sep (\x1F) | US | N/A | ASCII delimited | ISO 646 | Data interchange |
| Null (\0) | NUL | N/A | xargs -0 | POSIX | Filenames with spaces |
| Hash | # | None standard | Custom | None | Color codes, config |
| Ampersand | & | None standard | Query string | RFC 3986 | URL parameters |
| Equals | = | URL encoding | Key-value | RFC 3986 | Config files, env vars |