About

Manual extraction of columnar data from CSV files into programming-language arrays introduces transcription errors, mismatched quoting, and delimiter confusion. A single unescaped comma inside a quoted field breaks naive split(delimiter) logic. This tool implements an RFC 4180-compliant parser that correctly handles quoted fields containing delimiters, escaped double-quotes (""), and embedded newlines. It auto-detects the delimiter by scoring consistency of , ; \t | across sample rows, then transposes the row-major parsed matrix into column-major arrays. Output is generated with proper escaping for 10 target languages. The tool approximates type inference (numeric vs. string) but does not guarantee type safety for ambiguous values like 007 or locale-specific decimals (3,14 vs 3.14).

Formulas

The delimiter auto-detection algorithm scores each candidate delimiter d by computing the variance of field counts across sample rows. The delimiter with the lowest variance and highest consistency wins.

score(d) = 1σ²(counts) + 1 × n

Where σ²(counts) is the variance of the number of fields per row when split by delimiter d, and n is the mean field count. A perfect score occurs when every row produces the same number of fields (variance = 0), and the mean field count is maximized. The + 1 term prevents division by zero.

Column transposition converts row-major matrix M of dimensions r × c into c arrays of length r:

column_j = [M[0][j], M[1][j], …, M[r−1][j]] for j ∈ [0, c)

Where r = total data rows (excluding header if selected), c = maximum column count across all rows, and missing cells in ragged rows are filled with empty strings.

Reference Data

Language	Array Syntax	String Quote	Numeric Handling	Trailing Comma
JavaScript	const arr = […]	Single or Double	Unquoted	Optional
TypeScript	const arr: string[] = […]	Single or Double	Unquoted	Optional
Python	arr = […]	Single or Double	Unquoted	Optional
PHP	$arr = […];	Single or Double	Unquoted	Allowed
Ruby	arr = […]	Single or Double	Unquoted	Optional
Java	String[] arr = {…};	Double only	Unquoted	Allowed
C#	string[] arr = {…};	Double only	Unquoted	Allowed
Go	arr := []string{…}	Double only	Unquoted	Required
Swift	let arr: [String] = […]	Double only	Unquoted	Optional
Rust	let arr: Vec<&str> = vec![…];	Double only	Unquoted	Optional
Delimiter Detection Scoring
Comma (,)	RFC 4180 standard. Most common CSV delimiter worldwide.
Semicolon (;)	Common in European locales where comma is the decimal separator.
Tab (\t)	TSV format. Rarely appears inside field values.
Pipe (\|)	Used in legacy systems and database exports.
Colon (:)	Uncommon. Found in /etc/passwd and some log formats.
RFC 4180 Edge Cases
Quoted comma	"New York, NY" → single field: New York, NY
Escaped quote	"She said ""hi""" → She said "hi"
Embedded newline	"Line1\nLine2" → single field with newline
Empty field	a,,c → three fields, middle is empty string
Ragged rows	Rows with fewer columns padded with empty strings

Frequently Asked Questions

The algorithm tests each candidate delimiter against the first 20 rows (or all rows if fewer). It computes the variance of field counts per row. The delimiter producing variance closest to 0 with the highest mean field count wins. For example, if commas yield [3,3,3,3] fields per row (variance = 0) and semicolons yield [1,1,1,1] (variance = 0 but mean = 1), commas win because the mean field count is higher. You can override auto-detection by manually selecting a delimiter.

The parser implements RFC 4180 fully. A field wrapped in double quotes can contain the delimiter, newlines, and even other double quotes (escaped as two consecutive double quotes). For example, the CSV value "Price: $5,000" with comma delimiter is parsed as one field: Price: $5,000. The output array will contain the unescaped string with proper language-specific escaping applied.

The converter applies heuristic type inference. Values matching the pattern /^-?\d+\.?\d*$/ are treated as numeric and output without quotes in languages that support mixed arrays (JavaScript, Python, PHP, Ruby). In statically-typed languages (Java, C#, Go, Rust, Swift), all values default to string type for safety. You can force all-string output by selecting the "Quote all values" option. Ambiguous values like leading-zero strings (007) are treated as strings to preserve data integrity.

Yes. The parser determines the maximum column count across all rows. Rows with fewer columns are padded with empty strings. The column arrays will therefore all have the same length. A warning toast is displayed indicating which rows had fewer fields than expected, so you can verify data quality before using the output.

The tool processes data entirely in the browser. Practical limits depend on available RAM. Files under 10 MB (roughly 100,000 rows × 10 columns) process in under 2 seconds on modern hardware. Files exceeding 50 MB may cause browser tab slowdowns. For very large datasets, consider server-side tools like Python's csv module or pandas. The tool will display a warning if the input exceeds 50,000 rows.

Yes. Columns are indexed left to right starting from index 0. If a header row exists and the "First row is header" option is enabled, column names are derived from header values and used as variable names in the output (sanitized to valid identifiers). Without a header, columns are named column_0, column_1, etc.