User Rating 0.0 โ˜…โ˜…โ˜…โ˜…โ˜…
Total Usage 0 times
Drop .tsv file here or browse Supports .tsv, .txt, .tab files up to 100 MB
No data loaded
Compression Options
Is this tool helpful?

Your feedback helps us improve.

โ˜… โ˜… โ˜… โ˜… โ˜…

About

Tab-separated value files accumulate dead weight fast. Trailing spaces in cells, empty rows from careless exports, quoted values that need no quoting, and decimal numbers padded with zeros all inflate file size without adding information. A 50MB database export can shed 15 - 40% of its byte count through deterministic cleaning alone. This tool applies five compression passes: cell-level whitespace trimming, empty row and trailing empty column removal, numeric precision normalization (e.g., 3.10000 โ†’ 3.1), redundant quote stripping, and optional dictionary encoding for repeated string values. The output remains valid TSV - no proprietary format, no decompression step required.

Neglecting TSV hygiene before import causes silent failures. Spreadsheet applications miscount columns when trailing tabs exist. Database COPY commands reject rows with inconsistent column counts. This tool validates column-count consistency across all rows and flags violations before you compress, so the output is both smaller and structurally sound. Approximation note: dictionary encoding replaces only values appearing โ‰ฅ3 times with length โ‰ฅ4 characters; short or unique values pass through unchanged.

compress tsv tsv optimizer tsv minifier tab separated values reduce tsv size tsv cleaner tsv compressor

Formulas

The compression ratio is calculated as the percentage reduction in byte size from original to compressed output:

R = Soriginal โˆ’ ScompressedSoriginal ร— 100%

Where R is the compression ratio in percent, Soriginal is the original byte size (UTF-8 encoded), and Scompressed is the output byte size.

Dictionary encoding eligibility for a cell value v:

encode(v) =
{
TRUE if len(v) โ‰ฅ 4 and freq(v) โ‰ฅ 3FALSE otherwise

Where len(v) is the character count of the value, and freq(v) is the number of times the value appears across all cells. The savings from replacing value v with token t is:

ฮ”S = freq(v) ร— (len(v) โˆ’ len(t)) โˆ’ len(v) โˆ’ len(t) โˆ’ 2

The subtracted term accounts for the dictionary header entry cost: the original value, a separator, and the token definition.

Reference Data

Compression PassTechniqueTypical SavingsReversibleAffects Data
Whitespace TrimStrip leading/trailing spaces per cell5 - 15%No (lossy on whitespace)No (content preserved)
Empty Row RemovalDelete rows where all cells are empty1 - 10%NoNo
Trailing Empty ColumnsRemove trailing tabs producing empty columns2 - 8%NoNo
Numeric NormalizationStrip trailing zeros: 4.200 โ†’ 4.23 - 12%No (precision reduced)Numeric precision only
Quote StrippingRemove quotes from cells without special chars1 - 5%NoNo
Dictionary EncodingReplace repeated strings with short tokens5 - 25%Yes (header included)Format changes (header line added)
Consistent Column CountPad short rows / trim long rows to mode0% (structural fix)NoStructural only
BOM RemovalStrip UTF-8 BOM (0xEF 0xBB 0xBF)3 bytesNoNo
Line Ending NormalizationConvert CRLF to LF1 - 5%NoNo
Duplicate Row RemovalRemove exact duplicate rows (optional)0 - 50%NoYes (rows deleted)
UTF-8 EncodingStandard output encodingBaselineN/ANo
Max Cell Length (typical)32767 characters (Excel limit)N/AN/AValidation check

Frequently Asked Questions

Yes. When dictionary encoding is enabled, a header comment line is prepended (starting with #DICT:) that maps tokens back to original values. Standard TSV parsers that skip comment lines will see the tokens as literal values. Disable dictionary encoding if you need strict TSV compatibility. All other compression passes produce standard, fully compatible TSV output.
The normalizer detects standard decimal formats: it strips trailing zeros after a decimal point (e.g., 7.100 becomes 7.1, and 3.0 becomes 3). It preserves scientific notation (1.20e5 becomes 1.2e5) and handles negative signs correctly. Integer values without a decimal point are never modified. Values like 007 (leading zeros) are left unchanged to avoid breaking codes or identifiers.
The tool detects the modal (most frequent) column count across all rows. Rows with fewer columns are padded with empty trailing cells. Rows with excess trailing empty cells are trimmed. A warning is shown with the count of affected rows. This structural normalization runs before other compression passes to ensure a clean rectangular dataset.
Multi-line cell values are supported only if the cell is properly quoted per RFC 4180 conventions adapted for TSV. A cell value enclosed in double quotes may contain literal newline characters. The parser respects quoted fields during row splitting. If your TSV uses unquoted multi-line cells, the parser will treat each line as a separate row, which may produce incorrect results.
The tool processes files entirely in browser memory. Practical limits depend on the device: approximately 100 MB on desktop browsers with 4 GB+ RAM, and 20-30 MB on mobile devices. Files above 5 MB trigger processing in a Web Worker to prevent UI freezing. For files exceeding available memory, the browser tab may crash. The tool displays an estimated processing time before compression begins for files over 1 MB.
Duplicate row removal uses exact string matching of the entire row after whitespace trimming. Two rows are duplicates only if every cell value is identical in the same column positions. Column order matters: the same values in different columns are not considered duplicates. The first occurrence is always kept; subsequent duplicates are removed. The result count shows how many rows were eliminated.