About

String comparison errors propagate silently. A single misplaced character in a configuration file, API response, or legal document can cause system failures or contractual disputes. This tool performs real diff analysis using the Longest Common Subsequence (LCS) algorithm with O(n × m) dynamic programming. It computes Levenshtein edit distance - the minimum number of insertions, deletions, and substitutions to transform string A into string B. Similarity is quantified via the Dice - Sørensen coefficient. The tool does not approximate. It traces exact edit paths across character, word, and line granularities.

Limitations: for inputs exceeding 100,000 characters, the O(n × m) matrix becomes memory-intensive. The tool chunks computation but may slow on very large texts. Pro tip: use line-level mode for large documents and character-level mode for short strings where precision matters most.

Formulas

The primary comparison uses the Longest Common Subsequence (LCS) via dynamic programming. The recurrence relation:

{

C[i,j] = C[i−1,j−1] + 1 if A[i] = B[j]C[i,j] = max(C[i−1,j], C[i,j−1]) otherwise

Similarity is computed using the Dice - Sørensen coefficient:

S = 2 × Ln + m × 100%

Levenshtein distance uses a separate DP matrix where each cell represents the minimum edits:

d[i,j] = min(d[i−1,j] + 1, d[i,j−1] + 1, d[i−1,j−1] + c)

where c = 0 if A[i] = B[j], else c = 1. Here n = length of string A, m = length of string B, L = LCS length, d = Levenshtein distance, S = similarity percentage.

Reference Data

Metric	Symbol	Definition	Range	Use Case
Levenshtein Distance	d	Minimum single-character edits (insert, delete, substitute)	0 to max(n, m)	Spell checking, fuzzy matching
LCS Length	L	Length of longest common subsequence	0 to min(n, m)	Diff algorithms, version control
Dice - Sørensen Coefficient	S	2L ÷ (n + m)	0% to 100%	Similarity scoring
Insertions	I	Characters/words present in B but not in A	0 to m	Content addition tracking
Deletions	D	Characters/words present in A but not in B	0 to n	Content removal tracking
Substitutions	R	Positions where A and B differ	0 to min(n, m)	Mutation detection
Hamming Distance	H	Positions where corresponding symbols differ (equal-length only)	0 to n	Error detection in binary data
Jaro Similarity	J	Accounts for character transpositions within a match window	0 to 1	Record linkage, name matching
Jaro - Winkler Similarity	J_w	Jaro with prefix bonus (up to 4 chars)	0 to 1	Short string matching (names)
Jaccard Index	J_c	\|A ∩ B\| ÷ \|A ∪ B\| (on token sets)	0 to 1	Document similarity
Cosine Similarity	cos(θ)	Dot product of term vectors divided by magnitude product	−1 to 1	NLP, document vectors
Edit Distance (Damerau)	d_DL	Levenshtein + transpositions	0 to max(n, m)	OCR error correction
Normalized Edit Distance	d_N	d ÷ max(n, m)	0 to 1	Length-independent comparison
Longest Common Substring	L_s	Longest contiguous matching block	0 to min(n, m)	Plagiarism detection
Common Prefix Length	P	Characters matching from the start	0 to min(n, m)	Autocomplete, trie structures
Common Suffix Length	S_f	Characters matching from the end	0 to min(n, m)	File extension matching

Frequently Asked Questions

Levenshtein distance counts the minimum single-character edits (insertions, deletions, substitutions) to transform one string into another. LCS finds the longest subsequence common to both strings without requiring contiguity. LCS-based diff produces a visual edit script (what was added, removed, or kept), while Levenshtein gives a single numeric distance. This tool computes both: the LCS for visual highlighting and Levenshtein for the numeric distance metric.

Character mode tokenizes each string into individual characters and runs LCS on that sequence. This gives maximum precision but can be noisy for large texts. Word mode splits on whitespace boundaries, treating each word as an atomic unit - better for prose where word-level changes matter. Line mode splits on newline characters, ideal for comparing code, configs, or structured documents. The Levenshtein distance and similarity percentage are always recomputed relative to the chosen token granularity.

If you have the "Ignore whitespace" option enabled, trailing/leading spaces and multiple spaces are normalized before comparison. Disable this option to make whitespace significant. Similarly, "Case insensitive" mode normalizes both strings to lowercase before comparison, so "Hello" and "hello" would show as identical.

The LCS dynamic programming matrix requires O(n × m) memory. For two strings of 50,000 characters each, that is 2.5 billion cells. The tool uses an optimized approach: it reduces to two-row DP for the distance calculation and uses line-mode or word-mode chunking for the visual diff. A progress indicator appears for computations exceeding 200ms. For extreme inputs, switch to line-level comparison.

No. LCS and Levenshtein are sequential algorithms. If you move a paragraph from the beginning to the end, the algorithm sees a deletion at the original position and an insertion at the new position. Detecting block moves requires more complex algorithms like Patience diff or histogram diff used in Git. This tool focuses on sequential edit distance only.

Normalized edit distance is d ÷ max(n, m), giving a value between 0 and 1 where 0 means identical. The Dice-Sørensen similarity is 2L ÷ (n + m) where 1 means identical. They are related but not exact inverses because Levenshtein counts substitutions as single edits while LCS treats them as a delete-plus-insert pair.