About

Malformed CSV parsing causes silent data corruption. A mishandled quoted field containing a comma splits one column into two, cascading errors across every downstream row. This converter implements a full RFC 4180-compliant finite state machine parser that correctly resolves quoted fields, escaped double-quotes ("" → "), and embedded newlines. It auto-detects delimiters (comma, semicolon, tab, pipe) by frequency analysis across the first 5 lines. Output formats include well-formed XML with entity-escaped special characters (&, <, >) and JSON with configurable indentation.

Limitations: the tool assumes UTF-8 encoding. Files with BOM markers are stripped automatically. Nested or hierarchical CSV structures (parent-child relationships) are flattened to single-depth objects. For XML output, column headers are sanitized to valid XML element names - spaces become underscores, leading digits are prefixed with an underscore. Maximum recommended file size is 50 MB; larger files may cause browser memory pressure.

Formulas

The CSV parser operates as a finite state machine with 4 states. For each character c at position i, the transition function δ determines the next state:

δ: S × Σ → S

where S = {FIELD_START, UNQUOTED, QUOTED, QUOTE_IN_QUOTED} and Σ is the input alphabet (all UTF-8 characters).

{

QUOTED if state = FIELD_START ∧ c = "UNQUOTED if state = FIELD_START ∧ c ≠ "QUOTE_IN_QUOTED if state = QUOTED ∧ c = "QUOTED if state = QUOTE_IN_QUOTED ∧ c = " (escaped quote)

Delimiter auto-detection calculates a consistency score C for each candidate delimiter d across the first n lines:

C_d = 1σ(counts_d) ⋅ count_d

where σ is the standard deviation of delimiter counts per line and count is the mean count. The delimiter with the highest C and mean count ≥ 1 is selected. For XML output, every text node value v undergoes entity replacement: v → escape(v) where escape maps & → &, < → <, > → >, " → ", ' → '.

Reference Data

Delimiter	Character	Common Use	Auto-Detect Pattern
Comma	,	Standard CSV (RFC 4180)	Highest consistent count per line
Semicolon	;	European locales (decimal comma conflict)	Fallback when comma count is inconsistent
Tab	\t	TSV files, database exports	Detected if tab count ≥ 1 per line
Pipe	\|	Legacy systems, mainframe exports	Detected if pipe count is consistent
Double Quote	"	Field enclosure (RFC 4180)	N/A (enclosure, not delimiter)
Escaped Quote	""	Literal quote inside quoted field	N/A (escape sequence)
CRLF	\r\n	Windows line ending	Normalized to \n internally
LF	\n	Unix/macOS line ending	Primary line break
BOM	\uFEFF	UTF-8 Byte Order Mark	Stripped if found at position 0
XML Entity: &	&	Escaped in XML output	All 5 XML entities handled
XML Entity: <	<	Escaped in XML output	Prevents tag injection
XML Entity: >	>	Escaped in XML output	Prevents tag injection
XML Entity: "	"	Escaped in XML attributes	Attribute-safe encoding
XML Entity: '	'	Escaped in XML attributes	Attribute-safe encoding
JSON Indent: 2	Spaces	Standard readable JSON	Default setting
JSON Indent: 4	Spaces	Verbose readable JSON	Optional setting
JSON Indent: Tab	\t	Tab-indented JSON	Optional setting
JSON Compact	None	Minified JSON (no whitespace)	Smallest file size
Max Safe Rows	500,000	Browser memory limit (~50 MB)	Warning shown above limit
RFC 4180	Standard	Formal CSV specification	Full compliance implemented

Frequently Asked Questions

The parser scans the first 5 lines and counts occurrences of each candidate delimiter (comma, semicolon, tab, pipe) per line. It calculates a consistency score by dividing the mean count by the standard deviation. A true delimiter appears the same number of times on each line (low deviation), while a character appearing in field values shows irregular counts. The delimiter with the highest consistency score and a mean count of at least 1 is selected. If all candidates score equally, comma is used as the RFC 4180 default.

Per RFC 4180, any field containing the delimiter, a newline, or a double quote must be enclosed in double quotes. The parser's QUOTED state consumes all characters - including delimiters and newlines - until it encounters a closing double quote. A literal double quote inside a quoted field is represented as two consecutive double quotes (""), which the parser collapses to a single quote character in the output. This means a field like "Smith, John" correctly parses as a single value Smith, John rather than splitting into two columns.

XML element names must start with a letter or underscore and contain only letters, digits, hyphens, underscores, and periods. The converter applies these transformations: spaces and special characters are replaced with underscores, leading digits are prefixed with an underscore (e.g., 3rd_Quarter becomes _3rd_Quarter), empty headers receive a generic name column_N where N is the column index, and consecutive invalid characters are collapsed to a single underscore. This ensures the output is always well-formed XML that passes validation.

By default, all CSV values are strings. When the "Detect Types" option is enabled, the converter attempts to parse each value: numeric strings (matching the pattern /^-?\d+(\.\d+)?([eE][+-]?\d+)?$/) become JSON numbers, the literals true and false (case-insensitive) become JSON booleans, empty fields become null, and everything else remains a string. This heuristic covers most cases but may misinterpret values like zip codes (07001) or phone numbers. Disable type detection if numeric-looking strings must stay as strings.

The converter processes files up to 50 MB using a Web Worker to avoid blocking the main thread. For files under 50 KB, parsing runs on the main thread for faster response. Between 50 KB and 50 MB, the Web Worker handles parsing with progress updates. Above 50 MB, browser memory pressure may cause crashes depending on available RAM. A 10 MB CSV with 200,000 rows typically converts in under 3 seconds on modern hardware. The output file (especially XML) can be 3-5x larger than the input due to tag overhead.

Yes. If a row has fewer fields than the header row, the missing fields are filled with empty strings (JSON) or empty elements (XML). If a row has more fields than the header, the extra fields are assigned generated column names (extra_1, extra_2, etc.) in the output. A warning toast notification appears indicating the row numbers with mismatched column counts so you can verify the source data integrity before using the output.