About

Tabular text data uses a single character - the delimiter - to mark column boundaries. The most common delimiters are the horizontal tab (\t, U+0009), comma (,), semicolon (;), and pipe (|). Choosing the wrong delimiter when importing data into a database, spreadsheet, or ETL pipeline silently corrupts column alignment: values shift right, numeric fields absorb text, and downstream queries return garbage. This tool performs real, field-aware conversion between any two delimiters. It implements RFC 4180 quoting rules: if a target delimiter or a double-quote character already exists inside a field value, the field is wrapped in double quotes and internal quotes are escaped as "". Auto-detection analyzes character frequency across the first 50 lines to identify the source delimiter without manual guessing.

Limitation: this tool treats each line as a flat record. It does not parse nested JSON within cells or handle multi-line quoted fields that span more than one physical line. For files exceeding 5 MB, processing is offloaded to a background thread to keep the interface responsive. Pro tip: if your source data uses a tab delimiter but was copy-pasted from a spreadsheet, verify that trailing tabs on short rows are preserved - some clipboard implementations strip them.

Formulas

Delimiter conversion follows a deterministic two-phase process: parse, then serialize. Each line of the input is split into an ordered field array, then rejoined with the target delimiter.

fields = split(line, d_src)

output = join(quote(fields, d_tgt), d_tgt)

The quoting function applies RFC 4180 rules conditionally:

{

"field" if field contains d_tgt ∨ " ∨ newlinefield otherwise

Auto-detection scores each candidate delimiter by counting occurrences per line and computing consistency:

score(d) = 11 + σ(counts) × count

Where d_src = source delimiter, d_tgt = target delimiter, σ = standard deviation of per-line occurrence counts (lower variance means more consistent column structure), and count = mean occurrences per line. The delimiter with the highest score wins.

Reference Data

Delimiter Name	Character	Unicode	Common File Extension	Typical Use Case	RFC / Standard	Quoting Risk
Tab	\t	U+0009	.tsv, .tab	Database exports, UNIX utilities	IANA text/tab-separated-values	Low - rarely appears in data
Comma	,	U+002C	.csv	Spreadsheets, CRM exports	RFC 4180	High - common in text & numbers
Semicolon	;	U+003B	.csv (EU locale)	European Excel, SAP exports	No formal RFC	Medium
Pipe	\|	U+007C	.psv, .dat	EDI, HL7 health data, mainframes	HL7 v2.x	Low
Colon	:	U+003A	/etc/passwd	UNIX config files	POSIX	Medium - in timestamps
Tilde	~	U+007E	.dat	Legacy banking, NACHA files	NACHA/ACH	Very low
Caret	^	U+005E	.dat	Mainframe flat files	None	Very low
Space		U+0020	.txt, .asc	Fixed-width fallback, scientific logs	None	Extreme - appears everywhere
Unit Separator	US	U+001F	.dat	ASCII control, binary-safe delimiting	ISO 646	None - invisible character
Record Separator	RS	U+001E	.dat	Multi-record ASCII streams	ISO 646	None
Null	NUL	U+0000	Binary streams	C-string termination, xargs -0	POSIX	None
SOH	SOH	U+0001	.hl7	HL7 sub-component separator	HL7 v2.x	None
Double Pipe	\|\|	Two U+007C	.dat	Custom enterprise integrations	None	Very low
Hash	#	U+0023	.dat	Legacy telecom CDR files	None	Low
At Sign	@	U+0040	.dat	Custom log formats	None	Low

Frequently Asked Questions

The tool samples the first 50 lines and counts occurrences of each candidate delimiter (tab, comma, semicolon, pipe, colon) per line. It then computes the mean count and standard deviation for each candidate. A consistent column structure produces low variance and a non-zero mean. The candidate with the highest score - defined as mean divided by (1 + standard deviation) - is selected. If all scores are zero or tied, tab is assumed as the default since the tool is TSV-focused.

The tool applies RFC 4180 quoting: any field containing the target delimiter, a double-quote character, or a newline is wrapped in double quotes. Existing double quotes within the field are escaped by doubling them (e.g., a field containing He said "hello" becomes "He said ""hello"""). This ensures the output can be re-parsed without ambiguity by any compliant parser.

Yes. Each line is split independently. If row 5 has 3 fields and row 6 has 7 fields, both are converted faithfully. The tool does not enforce rectangular structure. However, this inconsistency may indicate a parsing problem upstream - for example, a multi-line quoted field that was not properly handled by the source system. The tool reports the detected column count range in the output summary.

Minimally. The quoting check is an O(n) scan of each field value for the presence of the target delimiter or quote character. For a 100,000-line file with 10 columns, this adds roughly 1 million short string searches - completing in under 50 ms on modern hardware. The dominant cost is string concatenation for the output, which the tool optimizes by pre-allocating array joins rather than repeated string appends.

Commas appear frequently in natural text (addresses, descriptions, numbers with thousand separators in some locales). Every embedded comma forces quoting, inflating file size and complicating downstream parsing. Pipe (|) and tilde (~) rarely appear in real data, eliminating the need for quoting entirely. This is why EDI standards (ANSI X12), HL7 health records, and many mainframe systems chose non-comma delimiters decades ago.

The tool normalizes all line endings to the format you select: LF (Unix/Mac, \n), CRLF (Windows, \r\n), or CR (legacy Mac, \r). Input is split using a regex that matches any of these three patterns. Default output uses LF. If you plan to open the output in Windows Notepad (pre-2018 versions), select CRLF.