About

Malformed CSV files silently corrupt data pipelines. A single trailing space in a join key causes failed lookups. An invisible empty row triggers off-by-one errors in row counters. An empty column inflates storage and breaks fixed-schema importers. This tool performs deterministic, cell-level trimming on RFC 4180-compliant CSV data. It strips leading and trailing whitespace from every cell value, removes structurally empty rows (where all fields are blank after trimming), eliminates fully empty columns, and optionally deduplicates rows by content hash. The parser handles quoted fields containing commas, embedded newlines (CRLF within double quotes), and escaped quote characters ("" sequences) per the RFC specification. It does not guess or infer - it tokenizes character by character. Note: this tool assumes UTF-8 encoding. Files with BOM markers are handled, but mixed encodings (e.g., Latin-1 fields inside a UTF-8 file) may produce garbled output for non-ASCII characters.

Formulas

The trimming pipeline applies operations in a deterministic order to avoid interaction effects between steps:

Pipeline(CSV) = Deduplicate(RemoveEmptyCols(RemoveEmptyRows(TrimCells(Parse(raw)))))

Where raw is the input text after BOM removal and line-ending normalization. Parse tokenizes per RFC 4180. TrimCells applies the regex /^\s+|\s+$/g to each unquoted cell value. RemoveEmptyRows filters rows where every cell satisfies cell = "". RemoveEmptyCols identifies column indices j where n∀i=0 cell_i,j = "", and removes them. Deduplicate hashes each row as a joined string and retains only the first occurrence.

Row reduction ratio: R_original − R_trimmedR_original × 100%

Reference Data

Trim Operation	Description	Risk if Skipped	RFC 4180 Safe
Cell Whitespace Trim	Removes leading/trailing spaces, tabs from each cell	Join key mismatches, sort errors	Yes
Empty Row Removal	Deletes rows where all cells are blank after trim	Off-by-one row count errors	Yes
Empty Column Removal	Deletes columns where all cells (incl. header) are blank	Schema inflation, wasted storage	Yes
Duplicate Row Removal	Removes rows with identical content (keeps first occurrence)	Double-counted records, inflated aggregates	Yes
Trailing Delimiter Strip	Removes trailing commas producing phantom empty columns	Extra NULL columns in parsers	Yes
BOM Removal	Strips UTF-8 BOM (0xEF 0xBB 0xBF) from file start	First header field unreadable	N/A
Consistent Line Endings	Normalizes CR, LF, CRLF to CRLF	Parsers split or merge rows incorrectly	Yes (CRLF required)
Quote Normalization	Ensures fields with delimiters/newlines are properly quoted	Downstream parsers break on unquoted commas	Yes
Header Trim	Trims header names independently of data rows	Column name lookup failures in code	Yes
Carriage Return in Cell	Preserves CRLF inside quoted fields during trim	Data loss if naively stripped	Yes
Tab-to-Space Collapse	Optionally replaces inner tabs with single space	Misaligned data in fixed-width consumers	N/A
Numeric Whitespace	Trims spaces around numbers ( 42 → 42)	Type casting failures (NaN)	Yes

Frequently Asked Questions

The parser implements a character-by-character lexer per RFC 4180. When it encounters an opening double-quote, it enters a "quoted" state and treats all characters - including commas, CRLF sequences, and other delimiters - as part of the field value until a closing unescaped double-quote is found. Escaped quotes (two consecutive double-quotes "") are collapsed to a single quote character. This means your multi-line cell data is preserved intact during trimming.

No. The trimmer operates on string values only. It removes leading and trailing whitespace characters (spaces, tabs) but does not alter the content between them. A value like " 3.14159 " becomes "3.14159" - no rounding, no format conversion. Date strings like " 2024-01-15 " become "2024-01-15" without reinterpretation. The tool never parses numbers or dates as typed values.

A column is removed only if every cell in that column - including the header row - is blank after whitespace trimming. If even one row has a non-empty value in that column, it is retained. This prevents accidental data loss in sparse datasets where a column may have values in only a few rows.

Duplicate detection operates on the parsed cell values, not the raw CSV text. Two rows are considered duplicates if every corresponding cell value is identical after trimming. Quoting differences (e.g., one row has a quoted field and another has the same value unquoted) are irrelevant - the comparison uses the parsed content. The first occurrence is always kept; subsequent duplicates are removed.

Files up to approximately 50 MB can be processed. For files exceeding 1 MB, parsing is offloaded to a Web Worker to prevent the browser UI from freezing. A progress indicator is displayed during processing. Memory usage scales linearly with file size since the parsed 2D array is held in memory. For extremely large files (beyond 50 MB), consider splitting them with a command-line tool like "split" before using this trimmer.

Yes. You can configure the delimiter character in the settings panel. The default is comma, but tab (TSV) and semicolon (common in European locales where comma is the decimal separator) are selectable. The parser logic is delimiter-agnostic - it uses whatever character you specify as the field separator while still respecting double-quote escaping rules.