User Rating 0.0 β˜…β˜…β˜…β˜…β˜…
Total Usage 0 times
List A 0 items
List B 0 items
Is this tool helpful?

Your feedback helps us improve.

β˜… β˜… β˜… β˜… β˜…

About

Comparing two datasets manually is where errors compound. A single missed entry in a reconciliation between, say, an inventory export and a purchase order can cascade into fulfillment failures or financial discrepancy. This tool performs formal set operations - union (A βˆͺ B), intersection (A ∩ B), and symmetric difference (A Ξ” B) - on two plain-text lists. It also detects intra-list duplicates, which spreadsheet VLOOKUP workflows routinely miss. Results are exportable and print-ready.

The comparison engine normalizes whitespace and supports configurable delimiters (newline, comma, semicolon, tab). Case sensitivity is togglable. Note: this tool performs exact string matching after normalization. It does not perform fuzzy or phonetic matching. Two entries differing by a single whitespace character inside the string will be treated as distinct items unless trimming captures it.

compare lists list diff find duplicates text comparison set operations list intersection unique items

Formulas

The core comparison relies on set-theoretic operations applied to the parsed item collections.

A ∩ B = { x : x ∈ A ∧ x ∈ B }
A βˆ’ B = { x : x ∈ A ∧ x βˆ‰ B }
A βˆͺ B = { x : x ∈ A ∨ x ∈ B }

The Jaccard similarity index quantifies how similar the two lists are:

J(A, B) = |A ∩ B||A βˆͺ B|

Where A and B are the sets of unique items from each list, |A| is the cardinality (count of unique items) of set A, and J ranges from 0 (no overlap) to 1 (identical sets).

Duplicate detection uses a frequency map: for each item x in the list, increment a counter. Any x with count > 1 is flagged as a duplicate.

Reference Data

OperationSymbolDescriptionExample (A = {1,2,3}, B = {2,3,4})Result
IntersectionA ∩ BItems present in both listsCommon items{2, 3}
UnionA βˆͺ BAll unique items from both lists combinedAll items merged{1, 2, 3, 4}
Difference (A \ B)A βˆ’ BItems only in List AOnly in A{1}
Difference (B \ A)B βˆ’ AItems only in List BOnly in B{4}
Symmetric DifferenceA Ξ” BItems in either list but not bothExclusive items{1, 4}
Duplicates in A - Items appearing more than once within List AIf A = {1,2,2,3}{2}
Duplicates in B - Items appearing more than once within List BIf B = {2,3,3,4}{3}
Cardinality of A|A|Total number of items in List ACount3
Cardinality of B|B|Total number of items in List BCount3
Jaccard IndexJ(A,B)Similarity coefficient: intersection / union2 Γ· 40.50
Overlap CoefficientO(A,B)Intersection / min(|A|, |B|)2 Γ· 30.67
SΓΈrensen - DiceD(A,B)2 Γ— intersection / (|A| + |B|)4 Γ· 60.67

Frequently Asked Questions

When case-sensitive mode is enabled (default), "Apple" and "apple" are treated as two distinct items. When disabled, both inputs are normalized to lowercase before comparison, so "Apple", "APPLE", and "apple" all resolve to the same entry. Choose case-insensitive mode for email lists, domain names, or any dataset where casing is inconsistent.
The tool trims all leading and trailing whitespace from every item by default. This means " hello " and "hello" are treated identically. Empty lines (lines containing only whitespace) are automatically removed and excluded from all counts and operations.
Yes. The comparison uses hash-based Set operations with O(n) average time complexity, where n is the total number of items across both lists. Lists of up to 100,000 items process in under 200ms on modern hardware. For extremely large datasets (500k+ lines), you may notice a brief pause; a loading indicator will display during processing.
The tool builds a frequency map for each list independently. Any item appearing more than once within the same list is flagged. The duplicate count shown is the number of distinct items that repeat, not the total extra occurrences. For example, if "cat" appears 3 times in List A, it is counted as 1 duplicate item with 3 total occurrences.
The Jaccard Index (J) is the ratio of the intersection size to the union size of two sets. A value of 0.0 means zero overlap; 1.0 means the sets are identical. It is widely used in data deduplication, plagiarism detection, document similarity scoring, and recommendation system evaluation. Values above 0.75 typically indicate high similarity.
If you copy a column from Excel or Google Sheets, the data is newline-delimited by default - use the "New Line" delimiter. If you copy a row, values are tab-separated - use the "Tab" option. For CSV exports, use "Comma". The "Semicolon" option handles European-locale CSVs where semicolons replace commas as field separators.