User Rating 0.0 ★★★★★

Total Usage 0 times

Category Text Formatting

Paste TSV Data

Drop .tsv file here or browse Supports .tsv, .txt, .tab files up to 100 MB

No data loaded

Compression Options Trim cell whitespace Remove empty rows Remove trailing empty columns Normalize numbers (strip trailing zeros) Strip unnecessary quotes Normalize line endings (CRLF → LF) Remove UTF-8 BOM Remove duplicate rows Dictionary encoding (advanced)

Is this tool helpful?

Your feedback helps us improve.

★ ★ ★ ★ ★

About

Tab-separated value files accumulate dead weight fast. Trailing spaces in cells, empty rows from careless exports, quoted values that need no quoting, and decimal numbers padded with zeros all inflate file size without adding information. A 50MB database export can shed 15 - 40% of its byte count through deterministic cleaning alone. This tool applies five compression passes: cell-level whitespace trimming, empty row and trailing empty column removal, numeric precision normalization (e.g., 3.10000 → 3.1), redundant quote stripping, and optional dictionary encoding for repeated string values. The output remains valid TSV - no proprietary format, no decompression step required.

Neglecting TSV hygiene before import causes silent failures. Spreadsheet applications miscount columns when trailing tabs exist. Database COPY commands reject rows with inconsistent column counts. This tool validates column-count consistency across all rows and flags violations before you compress, so the output is both smaller and structurally sound. Approximation note: dictionary encoding replaces only values appearing ≥3 times with length ≥4 characters; short or unique values pass through unchanged.

Formulas

The compression ratio is calculated as the percentage reduction in byte size from original to compressed output:

R = S_original − S_compressedS_original × 100%

Where R is the compression ratio in percent, S_original is the original byte size (UTF-8 encoded), and S_compressed is the output byte size.

Dictionary encoding eligibility for a cell value v:

encode(v) =

{

TRUE if len(v) ≥ 4 and freq(v) ≥ 3FALSE otherwise

Where len(v) is the character count of the value, and freq(v) is the number of times the value appears across all cells. The savings from replacing value v with token t is:

ΔS = freq(v) × (len(v) − len(t)) − len(v) − len(t) − 2

The subtracted term accounts for the dictionary header entry cost: the original value, a separator, and the token definition.

Reference Data

Compression Pass	Technique	Typical Savings	Reversible	Affects Data
Whitespace Trim	Strip leading/trailing spaces per cell	5 - 15%	No (lossy on whitespace)	No (content preserved)
Empty Row Removal	Delete rows where all cells are empty	1 - 10%	No	No
Trailing Empty Columns	Remove trailing tabs producing empty columns	2 - 8%	No	No
Numeric Normalization	Strip trailing zeros: 4.200 → 4.2	3 - 12%	No (precision reduced)	Numeric precision only
Quote Stripping	Remove quotes from cells without special chars	1 - 5%	No	No
Dictionary Encoding	Replace repeated strings with short tokens	5 - 25%	Yes (header included)	Format changes (header line added)
Consistent Column Count	Pad short rows / trim long rows to mode	0% (structural fix)	No	Structural only
BOM Removal	Strip UTF-8 BOM (0xEF 0xBB 0xBF)	3 bytes	No	No
Line Ending Normalization	Convert CRLF to LF	1 - 5%	No	No
Duplicate Row Removal	Remove exact duplicate rows (optional)	0 - 50%	No	Yes (rows deleted)
UTF-8 Encoding	Standard output encoding	Baseline	N/A	No
Max Cell Length (typical)	32767 characters (Excel limit)	N/A	N/A	Validation check

Frequently Asked Questions

Yes. When dictionary encoding is enabled, a header comment line is prepended (starting with #DICT:) that maps tokens back to original values. Standard TSV parsers that skip comment lines will see the tokens as literal values. Disable dictionary encoding if you need strict TSV compatibility. All other compression passes produce standard, fully compatible TSV output.

The normalizer detects standard decimal formats: it strips trailing zeros after a decimal point (e.g., 7.100 becomes 7.1, and 3.0 becomes 3). It preserves scientific notation (1.20e5 becomes 1.2e5) and handles negative signs correctly. Integer values without a decimal point are never modified. Values like 007 (leading zeros) are left unchanged to avoid breaking codes or identifiers.

The tool detects the modal (most frequent) column count across all rows. Rows with fewer columns are padded with empty trailing cells. Rows with excess trailing empty cells are trimmed. A warning is shown with the count of affected rows. This structural normalization runs before other compression passes to ensure a clean rectangular dataset.

Multi-line cell values are supported only if the cell is properly quoted per RFC 4180 conventions adapted for TSV. A cell value enclosed in double quotes may contain literal newline characters. The parser respects quoted fields during row splitting. If your TSV uses unquoted multi-line cells, the parser will treat each line as a separate row, which may produce incorrect results.

The tool processes files entirely in browser memory. Practical limits depend on the device: approximately 100 MB on desktop browsers with 4 GB+ RAM, and 20-30 MB on mobile devices. Files above 5 MB trigger processing in a Web Worker to prevent UI freezing. For files exceeding available memory, the browser tab may crash. The tool displays an estimated processing time before compression begins for files over 1 MB.

Duplicate row removal uses exact string matching of the entire row after whitespace trimming. Two rows are duplicates only if every cell value is identical in the same column positions. Column order matters: the same values in different columns are not considered duplicates. The first occurrence is always kept; subsequent duplicates are removed. The result count shows how many rows were eliminated.