CSV to JSONL Converter
Convert CSV files to JSONL (JSON Lines) format online. Handles quoted fields, custom delimiters, type inference, and large files with Web Worker processing.
About
CSV and JSONL serve fundamentally different data ecosystems. CSV (RFC 4180) uses row-delimited, comma-separated flat text. JSONL stores one self-describing JSON object per line. Converting between them is not trivial. Quoted fields can contain the delimiter character itself. Newlines can appear inside double-quoted values. A naive split-by-comma approach will corrupt your data at the first embedded comma or multiline address field. This converter implements a finite-state-machine parser that correctly handles all RFC 4180 edge cases: escaped quotes (""), embedded delimiters, and multiline quoted fields. It auto-detects the delimiter by frequency analysis and supports optional type inference, converting numeric strings like 3.14 to actual JSON numbers rather than leaving them as strings.
JSONL is the required input format for OpenAI fine-tuning, BigQuery batch loads, and many streaming data pipelines. Malformed conversion can silently shift columns, drop fields, or inject null bytes. This tool processes files up to several hundred megabytes in a Web Worker thread to keep the browser responsive. Limitations: binary data embedded in CSV cells is not supported. The parser assumes UTF-8 encoding. For TSV or semicolon-delimited files, select the appropriate delimiter or use auto-detect.
Formulas
The CSV parser uses a Finite State Machine with four states. For each character c at position i, the transition function δ determines the next state:
Where the state set is:
And the input alphabet is:
Delimiter auto-detection scores each candidate delimiter d by computing the standard deviation σ of per-line occurrence counts across the first n sample lines. The delimiter with the lowest non-zero σ and highest mean count wins:
Where μd is the mean occurrence count of delimiter d per line, and σd is its standard deviation. The + 1 prevents division by zero for perfectly consistent delimiters. For type inference, each string value v is tested against patterns:
Reference Data
| Feature | CSV (RFC 4180) | JSONL (JSON Lines) |
|---|---|---|
| Line Terminator | CRLF (\r\n) | LF (\n) |
| Field Delimiter | Comma (,) default | N/A (self-describing) |
| Quoting | Double-quote (") | N/A |
| Escaped Quote | "" (doubled) | \" (backslash) |
| Nested Structures | Not supported | Full JSON nesting |
| Data Types | All values are strings | String, Number, Boolean, Null, Object, Array |
| Schema | Implicit from header row | Per-object, self-describing |
| Encoding | Typically UTF-8 or ASCII | UTF-8 required |
| Streaming | Line-by-line possible | Line-by-line by design |
| Max File Size (this tool) | Limited by browser memory | Output scales linearly |
| Use Case | Spreadsheets, legacy ETL | ML pipelines, BigQuery, APIs |
| MIME Type | text/csv | application/jsonl |
| Multiline Values | Allowed inside quotes | Not allowed (one object per line) |
| Comments | No standard (some use #) | Not supported |
| Header Row | Optional but conventional | N/A (keys in every object) |
| Empty Fields | ,, → empty string | null or "" |
| Boolean Representation | true / false as text | TRUE / FALSE native |
| Numeric Precision | Arbitrary (text) | IEEE 754 double (15-17 sig. digits) |
| OpenAI Fine-Tuning | Not accepted | Required format |
| BigQuery Load | Supported | Preferred for nested data |