About

Comparing two datasets manually is where errors compound. A single missed entry in a reconciliation between, say, an inventory export and a purchase order can cascade into fulfillment failures or financial discrepancy. This tool performs formal set operations - union (A ∪ B), intersection (A ∩ B), and symmetric difference (A Δ B) - on two plain-text lists. It also detects intra-list duplicates, which spreadsheet VLOOKUP workflows routinely miss. Results are exportable and print-ready.

The comparison engine normalizes whitespace and supports configurable delimiters (newline, comma, semicolon, tab). Case sensitivity is togglable. Note: this tool performs exact string matching after normalization. It does not perform fuzzy or phonetic matching. Two entries differing by a single whitespace character inside the string will be treated as distinct items unless trimming captures it.

Formulas

The core comparison relies on set-theoretic operations applied to the parsed item collections.

A ∩ B = { x : x ∈ A ∧ x ∈ B }

A − B = { x : x ∈ A ∧ x ∉ B }

A ∪ B = { x : x ∈ A ∨ x ∈ B }

The Jaccard similarity index quantifies how similar the two lists are:

J(A, B) = |A ∩ B||A ∪ B|

Where A and B are the sets of unique items from each list, |A| is the cardinality (count of unique items) of set A, and J ranges from 0 (no overlap) to 1 (identical sets).

Duplicate detection uses a frequency map: for each item x in the list, increment a counter. Any x with count > 1 is flagged as a duplicate.

Reference Data

Operation	Symbol	Description	Example (A = {1,2,3}, B = {2,3,4})	Result
Intersection	A ∩ B	Items present in both lists	Common items	{2, 3}
Union	A ∪ B	All unique items from both lists combined	All items merged	{1, 2, 3, 4}
Difference (A \ B)	A − B	Items only in List A	Only in A	{1}
Difference (B \ A)	B − A	Items only in List B	Only in B	{4}
Symmetric Difference	A Δ B	Items in either list but not both	Exclusive items	{1, 4}
Duplicates in A	-	Items appearing more than once within List A	If A = {1,2,2,3}	{2}
Duplicates in B	-	Items appearing more than once within List B	If B = {2,3,3,4}	{3}
Cardinality of A	\|A\|	Total number of items in List A	Count	3
Cardinality of B	\|B\|	Total number of items in List B	Count	3
Jaccard Index	J(A,B)	Similarity coefficient: intersection / union	2 ÷ 4	0.50
Overlap Coefficient	O(A,B)	Intersection / min(\|A\|, \|B\|)	2 ÷ 3	0.67
Sørensen - Dice	D(A,B)	2 × intersection / (\|A\| + \|B\|)	4 ÷ 6	0.67

Frequently Asked Questions

When case-sensitive mode is enabled (default), "Apple" and "apple" are treated as two distinct items. When disabled, both inputs are normalized to lowercase before comparison, so "Apple", "APPLE", and "apple" all resolve to the same entry. Choose case-insensitive mode for email lists, domain names, or any dataset where casing is inconsistent.

The tool trims all leading and trailing whitespace from every item by default. This means " hello " and "hello" are treated identically. Empty lines (lines containing only whitespace) are automatically removed and excluded from all counts and operations.

Yes. The comparison uses hash-based Set operations with O(n) average time complexity, where n is the total number of items across both lists. Lists of up to 100,000 items process in under 200ms on modern hardware. For extremely large datasets (500k+ lines), you may notice a brief pause; a loading indicator will display during processing.

The tool builds a frequency map for each list independently. Any item appearing more than once within the same list is flagged. The duplicate count shown is the number of distinct items that repeat, not the total extra occurrences. For example, if "cat" appears 3 times in List A, it is counted as 1 duplicate item with 3 total occurrences.

The Jaccard Index (J) is the ratio of the intersection size to the union size of two sets. A value of 0.0 means zero overlap; 1.0 means the sets are identical. It is widely used in data deduplication, plagiarism detection, document similarity scoring, and recommendation system evaluation. Values above 0.75 typically indicate high similarity.

If you copy a column from Excel or Google Sheets, the data is newline-delimited by default - use the "New Line" delimiter. If you copy a row, values are tab-separated - use the "Tab" option. For CSV exports, use "Comma". The "Semicolon" option handles European-locale CSVs where semicolons replace commas as field separators.