User Rating 0.0
Total Usage 0 times
0 chars
0 chars
Is this tool helpful?

Your feedback helps us improve.

About

String comparison errors propagate silently. A single misplaced character in a configuration file, API response, or legal document can cause system failures or contractual disputes. This tool performs real diff analysis using the Longest Common Subsequence (LCS) algorithm with O(n × m) dynamic programming. It computes Levenshtein edit distance - the minimum number of insertions, deletions, and substitutions to transform string A into string B. Similarity is quantified via the Dice - Sørensen coefficient. The tool does not approximate. It traces exact edit paths across character, word, and line granularities.

Limitations: for inputs exceeding 100,000 characters, the O(n × m) matrix becomes memory-intensive. The tool chunks computation but may slow on very large texts. Pro tip: use line-level mode for large documents and character-level mode for short strings where precision matters most.

string comparison text diff levenshtein distance string similarity text compare

Formulas

The primary comparison uses the Longest Common Subsequence (LCS) via dynamic programming. The recurrence relation:

{
C[i,j] = C[i1,j1] + 1if A[i] = B[j]C[i,j] = max(C[i1,j], C[i,j1]) otherwise

Similarity is computed using the Dice - Sørensen coefficient:

S = 2 × Ln + m × 100%

Levenshtein distance uses a separate DP matrix where each cell represents the minimum edits:

d[i,j] = min(d[i1,j] + 1, d[i,j1] + 1, d[i1,j1] + c)

where c = 0 if A[i] = B[j], else c = 1. Here n = length of string A, m = length of string B, L = LCS length, d = Levenshtein distance, S = similarity percentage.

Reference Data

MetricSymbolDefinitionRangeUse Case
Levenshtein DistancedMinimum single-character edits (insert, delete, substitute)0 to max(n, m)Spell checking, fuzzy matching
LCS LengthLLength of longest common subsequence0 to min(n, m)Diff algorithms, version control
Dice - Sørensen CoefficientS2L ÷ (n + m)0% to 100%Similarity scoring
InsertionsICharacters/words present in B but not in A0 to mContent addition tracking
DeletionsDCharacters/words present in A but not in B0 to nContent removal tracking
SubstitutionsRPositions where A and B differ0 to min(n, m)Mutation detection
Hamming DistanceHPositions where corresponding symbols differ (equal-length only)0 to nError detection in binary data
Jaro SimilarityJAccounts for character transpositions within a match window0 to 1Record linkage, name matching
Jaro - Winkler SimilarityJwJaro with prefix bonus (up to 4 chars)0 to 1Short string matching (names)
Jaccard IndexJc|A B| ÷ |A B| (on token sets)0 to 1Document similarity
Cosine Similaritycos(θ)Dot product of term vectors divided by magnitude product−1 to 1NLP, document vectors
Edit Distance (Damerau)dDLLevenshtein + transpositions0 to max(n, m)OCR error correction
Normalized Edit DistancedNd ÷ max(n, m)0 to 1Length-independent comparison
Longest Common SubstringLsLongest contiguous matching block0 to min(n, m)Plagiarism detection
Common Prefix LengthPCharacters matching from the start0 to min(n, m)Autocomplete, trie structures
Common Suffix LengthSfCharacters matching from the end0 to min(n, m)File extension matching

Frequently Asked Questions

Levenshtein distance counts the minimum single-character edits (insertions, deletions, substitutions) to transform one string into another. LCS finds the longest subsequence common to both strings without requiring contiguity. LCS-based diff produces a visual edit script (what was added, removed, or kept), while Levenshtein gives a single numeric distance. This tool computes both: the LCS for visual highlighting and Levenshtein for the numeric distance metric.
Character mode tokenizes each string into individual characters and runs LCS on that sequence. This gives maximum precision but can be noisy for large texts. Word mode splits on whitespace boundaries, treating each word as an atomic unit - better for prose where word-level changes matter. Line mode splits on newline characters, ideal for comparing code, configs, or structured documents. The Levenshtein distance and similarity percentage are always recomputed relative to the chosen token granularity.
If you have the "Ignore whitespace" option enabled, trailing/leading spaces and multiple spaces are normalized before comparison. Disable this option to make whitespace significant. Similarly, "Case insensitive" mode normalizes both strings to lowercase before comparison, so "Hello" and "hello" would show as identical.
The LCS dynamic programming matrix requires O(n × m) memory. For two strings of 50,000 characters each, that is 2.5 billion cells. The tool uses an optimized approach: it reduces to two-row DP for the distance calculation and uses line-mode or word-mode chunking for the visual diff. A progress indicator appears for computations exceeding 200ms. For extreme inputs, switch to line-level comparison.
No. LCS and Levenshtein are sequential algorithms. If you move a paragraph from the beginning to the end, the algorithm sees a deletion at the original position and an insertion at the new position. Detecting block moves requires more complex algorithms like Patience diff or histogram diff used in Git. This tool focuses on sequential edit distance only.
Normalized edit distance is d ÷ max(n, m), giving a value between 0 and 1 where 0 means identical. The Dice-Sørensen similarity is 2L ÷ (n + m) where 1 means identical. They are related but not exact inverses because Levenshtein counts substitutions as single edits while LCS treats them as a delete-plus-insert pair.