Compare Two Strings
Compare two strings side-by-side with character, word, and line-level diff highlighting. Shows Levenshtein distance and similarity percentage.
About
String comparison errors propagate silently. A single misplaced character in a configuration file, API response, or legal document can cause system failures or contractual disputes. This tool performs real diff analysis using the Longest Common Subsequence (LCS) algorithm with O(n × m) dynamic programming. It computes Levenshtein edit distance - the minimum number of insertions, deletions, and substitutions to transform string A into string B. Similarity is quantified via the Dice - Sørensen coefficient. The tool does not approximate. It traces exact edit paths across character, word, and line granularities.
Limitations: for inputs exceeding 100,000 characters, the O(n × m) matrix becomes memory-intensive. The tool chunks computation but may slow on very large texts. Pro tip: use line-level mode for large documents and character-level mode for short strings where precision matters most.
Formulas
The primary comparison uses the Longest Common Subsequence (LCS) via dynamic programming. The recurrence relation:
Similarity is computed using the Dice - Sørensen coefficient:
Levenshtein distance uses a separate DP matrix where each cell represents the minimum edits:
where c = 0 if A[i] = B[j], else c = 1. Here n = length of string A, m = length of string B, L = LCS length, d = Levenshtein distance, S = similarity percentage.
Reference Data
| Metric | Symbol | Definition | Range | Use Case |
|---|---|---|---|---|
| Levenshtein Distance | d | Minimum single-character edits (insert, delete, substitute) | 0 to max(n, m) | Spell checking, fuzzy matching |
| LCS Length | L | Length of longest common subsequence | 0 to min(n, m) | Diff algorithms, version control |
| Dice - Sørensen Coefficient | S | 2L ÷ (n + m) | 0% to 100% | Similarity scoring |
| Insertions | I | Characters/words present in B but not in A | 0 to m | Content addition tracking |
| Deletions | D | Characters/words present in A but not in B | 0 to n | Content removal tracking |
| Substitutions | R | Positions where A and B differ | 0 to min(n, m) | Mutation detection |
| Hamming Distance | H | Positions where corresponding symbols differ (equal-length only) | 0 to n | Error detection in binary data |
| Jaro Similarity | J | Accounts for character transpositions within a match window | 0 to 1 | Record linkage, name matching |
| Jaro - Winkler Similarity | Jw | Jaro with prefix bonus (up to 4 chars) | 0 to 1 | Short string matching (names) |
| Jaccard Index | Jc | |A ∩ B| ÷ |A ∪ B| (on token sets) | 0 to 1 | Document similarity |
| Cosine Similarity | cos(θ) | Dot product of term vectors divided by magnitude product | −1 to 1 | NLP, document vectors |
| Edit Distance (Damerau) | dDL | Levenshtein + transpositions | 0 to max(n, m) | OCR error correction |
| Normalized Edit Distance | dN | d ÷ max(n, m) | 0 to 1 | Length-independent comparison |
| Longest Common Substring | Ls | Longest contiguous matching block | 0 to min(n, m) | Plagiarism detection |
| Common Prefix Length | P | Characters matching from the start | 0 to min(n, m) | Autocomplete, trie structures |
| Common Suffix Length | Sf | Characters matching from the end | 0 to min(n, m) | File extension matching |