User Rating 0.0 โ˜…โ˜…โ˜…โ˜…โ˜…
Total Usage 0 times

Drop your CSV file here or click to browse

Supports .csv and .tsv files up to 500 MB

Is this tool helpful?

Your feedback helps us improve.

โ˜… โ˜… โ˜… โ˜… โ˜…

About

CSV files grow deceptively large. A 50MB export from a database can contain 40% redundant whitespace, unnecessary RFC 4180 quoting, and thousands of repeated cell values. Uploading, transferring, or archiving such files wastes bandwidth and storage. This tool applies real compression: whitespace stripping, quote optimization per RFC 4180 rules, empty row/column removal, and standards-compliant GZIP via the browser's native CompressionStream API. For tabular data with high value repetition, dictionary encoding alone can reduce size by 30 - 60% before binary compression. All processing runs in your browser. No data is uploaded to any server.

Limitations: dictionary encoding assumes repetition exists. Files with entirely unique cell values see minimal gain from that method. GZIP output (.csv.gz) requires decompression before use in spreadsheet software. The tool approximates compression ratios; actual ratios depend on data entropy. For files exceeding 500MB, consider command-line tools due to browser memory constraints.

csv compressor compress csv csv file size reducer gzip csv csv optimizer reduce csv size csv cleaner

Formulas

Compression ratio quantifies the effectiveness of size reduction. The ratio R is defined as:

R = Soriginal โˆ’ ScompressedSoriginal ร— 100%

Where Soriginal = original file size in bytes, Scompressed = compressed file size in bytes. A ratio of 80% means the file is 5ร— smaller.

GZIP uses the DEFLATE algorithm, which combines LZ77 (sliding window dictionary matching) with Huffman coding. The theoretical entropy limit for compression of a byte stream is given by Shannon entropy:

H = โˆ’ nโˆ‘i=1 pi log2 pi

Where pi is the probability of the i-th symbol. Data with low entropy (many repeated values) compresses better. CSV files with categorical columns (e.g., country codes, status flags) typically exhibit low entropy per column, making them excellent candidates for both dictionary encoding and DEFLATE.

Quote optimization follows RFC 4180: a field requires quoting only if it contains the delimiter (,), a double-quote ("), or a newline (CRLF). All other quotes are redundant overhead.

Reference Data

Compression MethodTypeTypical ReductionOutput FormatReversibleBest For
Whitespace RemovalLossless5 - 15%.csvN/A (cleaned)Padded exports
Quote OptimizationLossless3 - 10%.csvN/A (cleaned)Over-quoted CSVs
Empty Row RemovalLossless1 - 20%.csvN/A (cleaned)Sparse datasets
Line Ending NormalizationLossless0 - 2%.csvN/ACross-platform files
BOM RemovalLossless3 bytes.csvN/AUTF-8 with BOM
GZIP (DEFLATE)Lossless60 - 90%.csv.gzYesArchival, transfer
Dictionary EncodingLossless30 - 60%.csvYes (with header)High-repetition data
Column RemovalLossyVaries.csvNoUnneeded columns
GZIP Level 1 (Fast)Lossless50 - 75%.csv.gzYesSpeed priority
GZIP Level 9 (Max)Lossless65 - 92%.csv.gzYesSize priority
Combined (Clean + GZIP)Lossless70 - 95%.csv.gzPartialMaximum compression
TSV โ†’ CSV (tab removal)Lossless0 - 5%.csvN/ATab-delimited input

Frequently Asked Questions

No. GZIP is a lossless compression algorithm based on DEFLATE (LZ77 + Huffman coding). The decompressed output is byte-identical to the input. However, if you also enable CSV cleaning options (whitespace removal, quote optimization), those transformations modify the CSV text before compression. The cleaning operations are also lossless in terms of data semantics - cell values remain identical - but the raw byte representation changes.
Compression effectiveness depends on data entropy. A CSV with many repeated values (e.g., a "country" column with 10 distinct values across 100,000 rows) has low entropy and compresses extremely well - often exceeding 90% reduction. Conversely, a CSV of random UUIDs or floating-point sensor readings has high entropy and may only compress 20-40%. The Shannon entropy formula H = โˆ’ฮฃ pแตข logโ‚‚ pแตข governs the theoretical minimum.
Google Sheets cannot directly import .csv.gz files. Excel on Windows can sometimes handle them if the file association is configured, but this is unreliable. You should decompress the file first using any archive tool (7-Zip, WinRAR, macOS Archive Utility, or the "gunzip" command). The decompressed file is a standard .csv that opens normally.
The tool caps input at 500 MB as a safety measure. Browser tabs typically have 1-4 GB of memory available depending on the OS and browser. A 500 MB CSV requires roughly 1-1.5 GB of RAM to parse and compress simultaneously. For files larger than 500 MB, use command-line tools such as "gzip" on Linux/macOS or "7z a -tgzip output.csv.gz input.csv" on Windows.
No. Quote optimization strictly follows RFC 4180. It removes quotes only from fields that do not contain the delimiter character, double-quote characters, or embedded newlines. Any compliant CSV parser (Python csv module, pandas.read_csv, R read.csv, Excel) will parse the optimized file identically. If your downstream tool relies on non-standard parsing that expects all fields quoted, disable this option.
Dictionary encoding is a CSV-level semantic compression: it replaces repeated cell values with short integer tokens and stores the mapping in a header. It outputs a valid (modified) CSV. GZIP is a byte-level binary compression applied to the entire file stream. They operate at different layers and can be combined: dictionary-encode first to reduce redundancy, then GZIP the result for maximum compression. Dictionary encoding is most effective when fewer than 1,000 distinct values repeat across millions of cells.