About

Cosine similarity measures the cosine of the angle between two non-zero vectors in an inner product space. It yields a value in [−1, 1], where 1 indicates identical orientation, 0 indicates orthogonality, and −1 indicates diametrically opposed directions. The metric is magnitude-invariant: it compares direction only. This property makes it the standard similarity measure in NLP (TF-IDF, word embeddings), recommendation systems, and information retrieval. Misapplying Euclidean distance where cosine similarity is appropriate leads to spurious clustering in high-dimensional sparse spaces. This calculator computes cos(θ) from raw vector components, returning the dot product, both Euclidean norms, and the recovered angle.

The tool handles vectors of arbitrary dimension n ≥ 2. Inputs are validated component-wise. A zero vector has undefined direction and produces an undefined similarity. The formula assumes a real-valued vector space Rⁿ. For 2-dimensional inputs, a canvas visualization renders both vectors and the subtended angle. Pro tip: in text analytics, normalize your term-frequency vectors before computing similarity to avoid bias from document length.

Formulas

The cosine similarity between two vectors A and B in Rⁿ is defined as:

cos(θ) = A ⋅ B‖A‖ × ‖B‖ = n∑i=1 A_i × B_i√n∑i=1 A_i² × √n∑i=1 B_i²

Where A ⋅ B is the dot product, ‖A‖ is the Euclidean norm (L2 norm) of vector A, and n is the number of dimensions. The angle is recovered via θ = arccos(cos(θ)). The input is clamped to [−1, 1] before applying arccos to guard against floating-point rounding errors that could produce NaN.

The dot product expands as: A ⋅ B = A₁B₁ + A₂B₂ + … + A_nB_n. The Euclidean norm: ‖A‖ = √A₁² + A₂² + … + A_n².

Reference Data

Similarity Value	Angle (θ)	Interpretation	Typical Use Case
1.000	0°	Identical direction	Duplicate document detection
0.950	18.2°	Very high similarity	Near-duplicate / paraphrase detection
0.866	30°	High similarity	Related topics in topic modeling
0.707	45°	Moderate similarity	Partially related content
0.500	60°	Weak similarity	Tangentially related features
0.000	90°	Orthogonal (no similarity)	Independent / uncorrelated features
−0.500	120°	Weak opposition	Sentiment polarity (partial)
−0.707	135°	Moderate opposition	Contrasting user preference vectors
−1.000	180°	Diametrically opposed	Perfect negation in signed embeddings
Common Thresholds in Industry
≥ 0.80	≤ 36.9°	"Similar" in most NLP pipelines	Search result deduplication
≥ 0.90	≤ 25.8°	"Highly similar" threshold	Plagiarism detection (strict)
≥ 0.95	≤ 18.2°	"Near-identical"	Code clone detection
≥ 0.60	≤ 53.1°	"Related" in recommendation	Collaborative filtering cutoff
Comparison with Other Distance Metrics
Euclidean Distance		Magnitude-sensitive	K-means clustering, KNN
Manhattan Distance		L1 norm, robust to outliers	Sparse feature spaces
Jaccard Similarity		Set-based, binary features	Shingling, MinHash
Pearson Correlation		Centered cosine similarity	Collaborative filtering (mean-adjusted)

Frequently Asked Questions

A zero vector has no direction and therefore undefined cosine similarity. The denominator becomes 0 because ‖A‖ = 0, producing a division by zero. This calculator detects zero-magnitude vectors and returns an "undefined" result with an explanatory message rather than NaN or Infinity.

The formula normalizes both vectors by dividing by their norms. A vector [1, 2] and [100, 200] yield a similarity of 1.000 despite vastly different magnitudes. This is desirable in text analytics (document length shouldn't affect topic similarity) but problematic in domains where magnitude carries information, such as force vectors in physics. In those cases, use Euclidean distance or a magnitude-weighted metric instead.

Yes. Cosine similarity ranges from −1 to 1. Negative values occur when the angle θ between vectors exceeds 90°. In TF-IDF spaces, term frequencies are non-negative, so similarity is always in [0, 1]. However, in word embedding spaces (Word2Vec, GloVe) where components can be negative, similarity can be negative, indicating semantic opposition.

In very high-dimensional spaces (thousands of dimensions), random vectors tend to become nearly orthogonal, pushing cosine similarity toward 0 for unrelated items. This is the "curse of dimensionality." Practically, this means similarity values of 0.3 - 0.5 can be significant in a 300-dimensional embedding space, whereas they might be noise in a 50,000-dimensional TF-IDF space. Always calibrate your threshold to the specific vector space and its dimensionality.

Pearson correlation is cosine similarity applied to mean-centered vectors. If you subtract the mean of each vector's components before computing cosine similarity, you get the Pearson correlation coefficient r. Use Pearson when your data has different baselines (e.g., user rating scales differ), and raw cosine when the absolute component values carry meaning (e.g., term frequencies).

No. This tool operates on real-valued vectors in Rⁿ. For complex vectors in Cⁿ, cosine similarity requires the Hermitian inner product (conjugate transpose), not the simple dot product. The formula becomes cos(θ) = Re(A^† ⋅ B) ÷ (‖A‖ × ‖B‖), which this tool does not implement.