User Rating 0.0
Total Usage 0 times
Drop .csv file here or click to browse
Is this tool helpful?

Your feedback helps us improve.

About

Exposing personally identifiable information (PII) in shared datasets leads to regulatory violations under GDPR, HIPAA, and CCPA. Fines reach 20 million or 4% of annual global turnover. This tool parses RFC 4180-compliant CSV data, auto-detects columns likely containing PII (emails, phone numbers, SSNs, credit card numbers, IP addresses), and applies configurable censoring: full redaction, partial masking preserving k leading/trailing characters, SHA-256 cryptographic hashing via the native Web Crypto API, type-aware randomization, or complete column removal. All processing executes client-side. No data leaves your browser.

The parser handles quoted fields, escaped double-quotes, embedded newlines, and auto-detects delimiters (comma, semicolon, tab, pipe). For datasets exceeding 5000 rows, processing offloads to a Web Worker to prevent UI thread blocking. Limitations: hashing is one-way and non-reversible. Partial masking with k < 3 on short strings may still expose the original value. Randomized replacements approximate format but do not guarantee statistical distribution fidelity. Pro tip: always verify the output preview before distributing censored files.

csv censor redact csv mask pii csv anonymizer data privacy csv tool gdpr data masking

Formulas

Partial masking preserves k characters from both ends of the original string s of length n, replacing the interior with asterisks:

mask(s, k) = s[0..k] + "***" + s[n k..n]

where k = number of preserved characters per side (default 2), and n = len(s). If 2k n, the entire string is replaced with "***" to prevent information leakage.

SHA-256 hashing uses the native Web Crypto API to produce a deterministic 256-bit digest:

hash(s) = SHA-256(UTF-8(s)) hex string (64 chars)

where s = the original cell value. Identical inputs always produce identical hashes, enabling join operations on censored data without exposing raw values. The digest is irreversible: given hash(s), recovering s requires brute-force search of the input space.

PII column detection scores each column by computing the ratio r of cells matching a known PII regex pattern:

r = mN

where m = number of matching cells and N = total non-empty cells. A column is flagged as PII when r 0.6 (60% match threshold).

Reference Data

PII TypeDetection PatternExample InputRedact OutputMask Output (k=2)Hash Output (truncated)Risk Level
Email Addressuser@domain.tld[email protected][REDACTED]jo***oma8f5f167…High
Phone NumberDigits with dashes/spaces/parens(555) 123-4567[REDACTED](5***67b3c8e2d1…High
SSN (US)XXX-XX-XXXX123-45-6789[REDACTED]12***89ef2d127d…Critical
Credit Card13 - 19 digit sequences4111-1111-1111-1111[REDACTED]41***119f86d081…Critical
IP Address (v4)A.B.C.D192.168.1.42[REDACTED]19***42c0a80122…Medium
IP Address (v6)Hex groups separated by colons2001:0db8::1[REDACTED]20***:1d4735e3a…Medium
Date of BirthCommon date formats1990-03-15[REDACTED]19***154e074085…High
Full NameTwo+ capitalized wordsJane Smith[REDACTED]Ja***th6b86b273…Medium
Street AddressNumber + street word patterns123 Main St[REDACTED]12***St3c9909af…Medium
ZIP/Postal Code5 or 5+4 digit pattern90210[REDACTED]90***105c2dd944…Low
Passport NumberAlpha-numeric 6 - 9 charsAB1234567[REDACTED]AB***677c222fb2…Critical
Driver LicenseState-specific patternsD123-4567-8901[REDACTED]D1***012c624232…Critical
Generic NumberNumeric sequences42.195[REDACTED]42***9519581e27…Low
Generic StringUnmatched textEngineer[REDACTED]En***ere3b0c442…Low

Frequently Asked Questions

No. All parsing, detection, and censoring run entirely client-side in your browser's JavaScript engine. The Web Crypto API performs SHA-256 hashing locally. No HTTP requests are made. Your data never touches a server. You can verify by disconnecting from the internet before processing.
SHA-256 is deterministic by design: identical plaintext always yields the identical 64-character hex digest. This enables relational joins on censored datasets (e.g., matching customer IDs across tables without exposing them). However, for low-entropy inputs like 5-digit ZIP codes (only 100,000 possibilities), an attacker can build a rainbow table and reverse the hash. For such columns, prefer partial masking or randomization over hashing.
The tool samples all non-empty cells in each column and tests them against regex patterns for emails, phone numbers, SSNs (XXX-XX-XXXX), credit card numbers (13-19 digits passing a Luhn-like length check), IPv4/IPv6 addresses, and common date formats. If 60% or more of a column's cells match a PII pattern, it is flagged with a warning badge. Detection is heuristic; always verify the suggestions manually, especially for columns with mixed content types.
The parser is RFC 4180-compliant. Fields wrapped in double quotes are treated as atomic units regardless of internal commas, newlines, or other delimiters. Escaped double quotes (two consecutive double-quote characters within a quoted field) are correctly unescaped. The auto-detection tests for comma, semicolon, tab, and pipe delimiters by scoring consistency across the first 10 rows.
Yes. Each column has independent censoring controls. You can select "Mask" for a name column (preserving initials), "Hash" for an email column (enabling joins), "Redact" for SSN columns (full removal), and leave non-sensitive columns like "Country" untouched - all in a single pass.
For datasets exceeding 5,000 rows, processing is offloaded to an inline Web Worker running on a separate thread. The UI remains responsive and displays a progress bar. The preview table renders only the first 50 rows regardless of file size. The full censored output is generated in memory and made available as a downloadable Blob URL.
Yes. If k equals 2 and the original string is 5 characters (e.g., a ZIP code '90210'), the mask output "90***10" reveals 4 of 5 characters - effectively exposing the value. The tool enforces a safety rule: if 2k ≥ n (string length), the entire value is replaced with "***". For short, sensitive fields, use full redaction or hashing instead of masking.