Censor CSV Data
Censor, redact, mask, or hash sensitive PII columns in CSV files. Supports auto-detection of emails, phones, SSNs, and more.
About
Exposing personally identifiable information (PII) in shared datasets leads to regulatory violations under GDPR, HIPAA, and CCPA. Fines reach €20 million or 4% of annual global turnover. This tool parses RFC 4180-compliant CSV data, auto-detects columns likely containing PII (emails, phone numbers, SSNs, credit card numbers, IP addresses), and applies configurable censoring: full redaction, partial masking preserving k leading/trailing characters, SHA-256 cryptographic hashing via the native Web Crypto API, type-aware randomization, or complete column removal. All processing executes client-side. No data leaves your browser.
The parser handles quoted fields, escaped double-quotes, embedded newlines, and auto-detects delimiters (comma, semicolon, tab, pipe). For datasets exceeding 5000 rows, processing offloads to a Web Worker to prevent UI thread blocking. Limitations: hashing is one-way and non-reversible. Partial masking with k < 3 on short strings may still expose the original value. Randomized replacements approximate format but do not guarantee statistical distribution fidelity. Pro tip: always verify the output preview before distributing censored files.
Formulas
Partial masking preserves k characters from both ends of the original string s of length n, replacing the interior with asterisks:
where k = number of preserved characters per side (default 2), and n = len(s). If 2k ≥ n, the entire string is replaced with "***" to prevent information leakage.
SHA-256 hashing uses the native Web Crypto API to produce a deterministic 256-bit digest:
where s = the original cell value. Identical inputs always produce identical hashes, enabling join operations on censored data without exposing raw values. The digest is irreversible: given hash(s), recovering s requires brute-force search of the input space.
PII column detection scores each column by computing the ratio r of cells matching a known PII regex pattern:
where m = number of matching cells and N = total non-empty cells. A column is flagged as PII when r ≥ 0.6 (60% match threshold).
Reference Data
| PII Type | Detection Pattern | Example Input | Redact Output | Mask Output (k=2) | Hash Output (truncated) | Risk Level |
|---|---|---|---|---|---|---|
| Email Address | user@domain.tld | [email protected] | [REDACTED] | jo***om | a8f5f167… | High |
| Phone Number | Digits with dashes/spaces/parens | (555) 123-4567 | [REDACTED] | (5***67 | b3c8e2d1… | High |
| SSN (US) | XXX-XX-XXXX | 123-45-6789 | [REDACTED] | 12***89 | ef2d127d… | Critical |
| Credit Card | 13 - 19 digit sequences | 4111-1111-1111-1111 | [REDACTED] | 41***11 | 9f86d081… | Critical |
| IP Address (v4) | A.B.C.D | 192.168.1.42 | [REDACTED] | 19***42 | c0a80122… | Medium |
| IP Address (v6) | Hex groups separated by colons | 2001:0db8::1 | [REDACTED] | 20***:1 | d4735e3a… | Medium |
| Date of Birth | Common date formats | 1990-03-15 | [REDACTED] | 19***15 | 4e074085… | High |
| Full Name | Two+ capitalized words | Jane Smith | [REDACTED] | Ja***th | 6b86b273… | Medium |
| Street Address | Number + street word patterns | 123 Main St | [REDACTED] | 12***St | 3c9909af… | Medium |
| ZIP/Postal Code | 5 or 5+4 digit pattern | 90210 | [REDACTED] | 90***10 | 5c2dd944… | Low |
| Passport Number | Alpha-numeric 6 - 9 chars | AB1234567 | [REDACTED] | AB***67 | 7c222fb2… | Critical |
| Driver License | State-specific patterns | D123-4567-8901 | [REDACTED] | D1***01 | 2c624232… | Critical |
| Generic Number | Numeric sequences | 42.195 | [REDACTED] | 42***95 | 19581e27… | Low |
| Generic String | Unmatched text | Engineer | [REDACTED] | En***er | e3b0c442… | Low |