About

CSV files grow deceptively large. A 50MB export from a database can contain 40% redundant whitespace, unnecessary RFC 4180 quoting, and thousands of repeated cell values. Uploading, transferring, or archiving such files wastes bandwidth and storage. This tool applies real compression: whitespace stripping, quote optimization per RFC 4180 rules, empty row/column removal, and standards-compliant GZIP via the browser's native CompressionStream API. For tabular data with high value repetition, dictionary encoding alone can reduce size by 30 - 60% before binary compression. All processing runs in your browser. No data is uploaded to any server.

Limitations: dictionary encoding assumes repetition exists. Files with entirely unique cell values see minimal gain from that method. GZIP output (.csv.gz) requires decompression before use in spreadsheet software. The tool approximates compression ratios; actual ratios depend on data entropy. For files exceeding 500MB, consider command-line tools due to browser memory constraints.

Formulas

Compression ratio quantifies the effectiveness of size reduction. The ratio R is defined as:

R = S_original − S_compressedS_original × 100%

Where S_original = original file size in bytes, S_compressed = compressed file size in bytes. A ratio of 80% means the file is 5× smaller.

GZIP uses the DEFLATE algorithm, which combines LZ77 (sliding window dictionary matching) with Huffman coding. The theoretical entropy limit for compression of a byte stream is given by Shannon entropy:

H = − n∑i=1 p_i log₂ p_i

Where p_i is the probability of the i-th symbol. Data with low entropy (many repeated values) compresses better. CSV files with categorical columns (e.g., country codes, status flags) typically exhibit low entropy per column, making them excellent candidates for both dictionary encoding and DEFLATE.

Quote optimization follows RFC 4180: a field requires quoting only if it contains the delimiter (,), a double-quote ("), or a newline (CRLF). All other quotes are redundant overhead.

Reference Data

Compression Method	Type	Typical Reduction	Output Format	Reversible	Best For
Whitespace Removal	Lossless	5 - 15%	.csv	N/A (cleaned)	Padded exports
Quote Optimization	Lossless	3 - 10%	.csv	N/A (cleaned)	Over-quoted CSVs
Empty Row Removal	Lossless	1 - 20%	.csv	N/A (cleaned)	Sparse datasets
Line Ending Normalization	Lossless	0 - 2%	.csv	N/A	Cross-platform files
BOM Removal	Lossless	3 bytes	.csv	N/A	UTF-8 with BOM
GZIP (DEFLATE)	Lossless	60 - 90%	.csv.gz	Yes	Archival, transfer
Dictionary Encoding	Lossless	30 - 60%	.csv	Yes (with header)	High-repetition data
Column Removal	Lossy	Varies	.csv	No	Unneeded columns
GZIP Level 1 (Fast)	Lossless	50 - 75%	.csv.gz	Yes	Speed priority
GZIP Level 9 (Max)	Lossless	65 - 92%	.csv.gz	Yes	Size priority
Combined (Clean + GZIP)	Lossless	70 - 95%	.csv.gz	Partial	Maximum compression
TSV → CSV (tab removal)	Lossless	0 - 5%	.csv	N/A	Tab-delimited input

Frequently Asked Questions

No. GZIP is a lossless compression algorithm based on DEFLATE (LZ77 + Huffman coding). The decompressed output is byte-identical to the input. However, if you also enable CSV cleaning options (whitespace removal, quote optimization), those transformations modify the CSV text before compression. The cleaning operations are also lossless in terms of data semantics - cell values remain identical - but the raw byte representation changes.

Compression effectiveness depends on data entropy. A CSV with many repeated values (e.g., a "country" column with 10 distinct values across 100,000 rows) has low entropy and compresses extremely well - often exceeding 90% reduction. Conversely, a CSV of random UUIDs or floating-point sensor readings has high entropy and may only compress 20-40%. The Shannon entropy formula H = −Σ pᵢ log₂ pᵢ governs the theoretical minimum.

Google Sheets cannot directly import .csv.gz files. Excel on Windows can sometimes handle them if the file association is configured, but this is unreliable. You should decompress the file first using any archive tool (7-Zip, WinRAR, macOS Archive Utility, or the "gunzip" command). The decompressed file is a standard .csv that opens normally.

The tool caps input at 500 MB as a safety measure. Browser tabs typically have 1-4 GB of memory available depending on the OS and browser. A 500 MB CSV requires roughly 1-1.5 GB of RAM to parse and compress simultaneously. For files larger than 500 MB, use command-line tools such as "gzip" on Linux/macOS or "7z a -tgzip output.csv.gz input.csv" on Windows.

No. Quote optimization strictly follows RFC 4180. It removes quotes only from fields that do not contain the delimiter character, double-quote characters, or embedded newlines. Any compliant CSV parser (Python csv module, pandas.read_csv, R read.csv, Excel) will parse the optimized file identically. If your downstream tool relies on non-standard parsing that expects all fields quoted, disable this option.

Dictionary encoding is a CSV-level semantic compression: it replaces repeated cell values with short integer tokens and stores the mapping in a header. It outputs a valid (modified) CSV. GZIP is a byte-level binary compression applied to the entire file stream. They operate at different layers and can be combined: dictionary-encode first to reduce redundancy, then GZIP the result for maximum compression. Dictionary encoding is most effective when fewer than 1,000 distinct values repeat across millions of cells.