Cosine Similarity Calculator
Calculate cosine similarity between two vectors instantly. Get dot product, magnitudes, angle, and 2D visualization. Supports N-dimensional vectors.
About
Cosine similarity measures the cosine of the angle between two non-zero vectors in an inner product space. It yields a value in [โ1, 1], where 1 indicates identical orientation, 0 indicates orthogonality, and โ1 indicates diametrically opposed directions. The metric is magnitude-invariant: it compares direction only. This property makes it the standard similarity measure in NLP (TF-IDF, word embeddings), recommendation systems, and information retrieval. Misapplying Euclidean distance where cosine similarity is appropriate leads to spurious clustering in high-dimensional sparse spaces. This calculator computes cos(ฮธ) from raw vector components, returning the dot product, both Euclidean norms, and the recovered angle.
The tool handles vectors of arbitrary dimension n โฅ 2. Inputs are validated component-wise. A zero vector has undefined direction and produces an undefined similarity. The formula assumes a real-valued vector space Rn. For 2-dimensional inputs, a canvas visualization renders both vectors and the subtended angle. Pro tip: in text analytics, normalize your term-frequency vectors before computing similarity to avoid bias from document length.
Formulas
The cosine similarity between two vectors A and B in Rn is defined as:
Where A โ B is the dot product, โAโ is the Euclidean norm (L2 norm) of vector A, and n is the number of dimensions. The angle is recovered via ฮธ = arccos(cos(ฮธ)). The input is clamped to [โ1, 1] before applying arccos to guard against floating-point rounding errors that could produce NaN.
The dot product expands as: A โ B = A1B1 + A2B2 + โฆ + AnBn. The Euclidean norm: โAโ = โA12 + A22 + โฆ + An2.
Reference Data
| Similarity Value | Angle (ฮธ) | Interpretation | Typical Use Case |
|---|---|---|---|
| 1.000 | 0ยฐ | Identical direction | Duplicate document detection |
| 0.950 | 18.2ยฐ | Very high similarity | Near-duplicate / paraphrase detection |
| 0.866 | 30ยฐ | High similarity | Related topics in topic modeling |
| 0.707 | 45ยฐ | Moderate similarity | Partially related content |
| 0.500 | 60ยฐ | Weak similarity | Tangentially related features |
| 0.000 | 90ยฐ | Orthogonal (no similarity) | Independent / uncorrelated features |
| โ0.500 | 120ยฐ | Weak opposition | Sentiment polarity (partial) |
| โ0.707 | 135ยฐ | Moderate opposition | Contrasting user preference vectors |
| โ1.000 | 180ยฐ | Diametrically opposed | Perfect negation in signed embeddings |
| Common Thresholds in Industry | |||
| โฅ 0.80 | โค 36.9ยฐ | "Similar" in most NLP pipelines | Search result deduplication |
| โฅ 0.90 | โค 25.8ยฐ | "Highly similar" threshold | Plagiarism detection (strict) |
| โฅ 0.95 | โค 18.2ยฐ | "Near-identical" | Code clone detection |
| โฅ 0.60 | โค 53.1ยฐ | "Related" in recommendation | Collaborative filtering cutoff |
| Comparison with Other Distance Metrics | |||
| Euclidean Distance | Magnitude-sensitive | K-means clustering, KNN | |
| Manhattan Distance | L1 norm, robust to outliers | Sparse feature spaces | |
| Jaccard Similarity | Set-based, binary features | Shingling, MinHash | |
| Pearson Correlation | Centered cosine similarity | Collaborative filtering (mean-adjusted) | |