Compress TSV
Compress TSV files online: trim whitespace, remove empty rows, deduplicate values, and optimize numeric precision. Download smaller TSV instantly.
About
Tab-separated value files accumulate dead weight fast. Trailing spaces in cells, empty rows from careless exports, quoted values that need no quoting, and decimal numbers padded with zeros all inflate file size without adding information. A 50MB database export can shed 15 - 40% of its byte count through deterministic cleaning alone. This tool applies five compression passes: cell-level whitespace trimming, empty row and trailing empty column removal, numeric precision normalization (e.g., 3.10000 โ 3.1), redundant quote stripping, and optional dictionary encoding for repeated string values. The output remains valid TSV - no proprietary format, no decompression step required.
Neglecting TSV hygiene before import causes silent failures. Spreadsheet applications miscount columns when trailing tabs exist. Database COPY commands reject rows with inconsistent column counts. This tool validates column-count consistency across all rows and flags violations before you compress, so the output is both smaller and structurally sound. Approximation note: dictionary encoding replaces only values appearing โฅ3 times with length โฅ4 characters; short or unique values pass through unchanged.
Formulas
The compression ratio is calculated as the percentage reduction in byte size from original to compressed output:
Where R is the compression ratio in percent, Soriginal is the original byte size (UTF-8 encoded), and Scompressed is the output byte size.
Dictionary encoding eligibility for a cell value v:
Where len(v) is the character count of the value, and freq(v) is the number of times the value appears across all cells. The savings from replacing value v with token t is:
The subtracted term accounts for the dictionary header entry cost: the original value, a separator, and the token definition.
Reference Data
| Compression Pass | Technique | Typical Savings | Reversible | Affects Data |
|---|---|---|---|---|
| Whitespace Trim | Strip leading/trailing spaces per cell | 5 - 15% | No (lossy on whitespace) | No (content preserved) |
| Empty Row Removal | Delete rows where all cells are empty | 1 - 10% | No | No |
| Trailing Empty Columns | Remove trailing tabs producing empty columns | 2 - 8% | No | No |
| Numeric Normalization | Strip trailing zeros: 4.200 โ 4.2 | 3 - 12% | No (precision reduced) | Numeric precision only |
| Quote Stripping | Remove quotes from cells without special chars | 1 - 5% | No | No |
| Dictionary Encoding | Replace repeated strings with short tokens | 5 - 25% | Yes (header included) | Format changes (header line added) |
| Consistent Column Count | Pad short rows / trim long rows to mode | 0% (structural fix) | No | Structural only |
| BOM Removal | Strip UTF-8 BOM (0xEF 0xBB 0xBF) | 3 bytes | No | No |
| Line Ending Normalization | Convert CRLF to LF | 1 - 5% | No | No |
| Duplicate Row Removal | Remove exact duplicate rows (optional) | 0 - 50% | No | Yes (rows deleted) |
| UTF-8 Encoding | Standard output encoding | Baseline | N/A | No |
| Max Cell Length (typical) | 32767 characters (Excel limit) | N/A | N/A | Validation check |