About

Manually scanning two text versions for differences is error-prone. A single misplaced comma in a contract, a silently altered clause, or an overwritten variable name in source code can cascade into costly failures. This tool computes a precise line-level and word-level diff using the Longest Common Subsequence algorithm, reporting every insertion, deletion, and unchanged region. It calculates a similarity ratio S as a percentage, plus Levenshtein edit distance d, character counts, and word counts for both inputs. Results are deterministic and instantaneous for texts up to ~100k characters. The tool assumes plain-text input and compares Unicode codepoints; it does not normalize whitespace or ignore case unless you choose those options explicitly.

Formulas

The similarity ratio is derived from the Longest Common Subsequence length relative to both inputs:

S = 2 × L_LCSL_A + L_B × 100%

Where L_A and L_B are line counts of Text A and Text B respectively, and L_LCS is the length of their longest common subsequence of lines.

The Levenshtein distance between two strings of length m and n is computed via dynamic programming:

d(i, j) = min(d(i−1, j) + 1, d(i, j−1) + 1, d(i−1, j−1) + c)

Where c = 0 if characters match, 1 otherwise. The Jaccard similarity index for word sets is:

J = |A ∩ B||A ∪ B|

Reference Data

Metric	Symbol	Description	Range
Similarity Ratio	S	Percentage of common subsequence length to average text length	0 - 100%
Levenshtein Distance	d	Minimum single-character edits (insert, delete, substitute)	0 - max(m,n)
Lines Added	L₊	Lines present only in Text B	≥ 0
Lines Removed	L₋	Lines present only in Text A	≥ 0
Lines Unchanged	L₌	Identical lines in both texts	≥ 0
Words (Text A)	W_A	Whitespace-delimited token count in original	≥ 0
Words (Text B)	W_B	Whitespace-delimited token count in modified	≥ 0
Characters (Text A)	C_A	Total Unicode codepoints in original	≥ 0
Characters (Text B)	C_B	Total Unicode codepoints in modified	≥ 0
LCS Length	L_LCS	Length of longest common subsequence (lines)	≥ 0
Edit Operations	E	Total insert + delete operations in diff	≥ 0
Jaccard Index (Words)	J	Intersection over union of unique word sets	0 - 1

Frequently Asked Questions

The LCS-based diff does not track line movement as a distinct operation. A line removed from position 5 and appearing at position 12 will show as one deletion and one insertion. This is consistent with standard unified diff behavior used by Git and GNU diff. For detecting moves, you would need a secondary pass comparing removed and added lines, which this tool does not perform to keep output deterministic and unambiguous.

The LCS algorithm has O(m×n) space and time complexity in its basic form. For texts exceeding roughly 10,000 lines each, computation may take several seconds. The tool uses an optimized approach that trims common prefixes and suffixes before running the core diff, which dramatically reduces the problem size for typical revision comparisons where most content is unchanged.

By default, yes. Every character including spaces, tabs, and line endings contributes to the comparison. Enable the "Trim whitespace" option to strip leading/trailing whitespace per line before comparison. This prevents indentation changes from inflating the edit count. The similarity ratio S will increase when whitespace-only differences are excluded.

When two lines are paired as a modification (one removed, one added), the tool runs a secondary character-level LCS on those two lines. Characters present in only the old version are highlighted as deletions; characters present only in the new version are highlighted as insertions. This provides granular visibility into what exactly changed within a single line.

No. The comparison operates on raw text codepoints with no language grammar awareness. It treats Python, JSON, or prose identically. Syntactic equivalences (e.g., reordered JSON keys producing identical objects) will appear as differences. For semantic code comparison, use AST-based tooling. This tool excels at literal textual diff.

Levenshtein distance counts the minimum single-character edits (insertions, deletions, substitutions) to transform one string into another. It operates at the character level on the full text. The LCS-based similarity works at the line level, measuring how many lines are shared in sequence. A high Levenshtein distance with high line similarity means many small in-line edits across otherwise matching lines.