User Rating 0.0 โ˜…โ˜…โ˜…โ˜…โ˜…
Total Usage 0 times
Vector A
Vector B
Is this tool helpful?

Your feedback helps us improve.

โ˜… โ˜… โ˜… โ˜… โ˜…

About

Cosine similarity measures the cosine of the angle between two non-zero vectors in an inner product space. It yields a value in [โˆ’1, 1], where 1 indicates identical orientation, 0 indicates orthogonality, and โˆ’1 indicates diametrically opposed directions. The metric is magnitude-invariant: it compares direction only. This property makes it the standard similarity measure in NLP (TF-IDF, word embeddings), recommendation systems, and information retrieval. Misapplying Euclidean distance where cosine similarity is appropriate leads to spurious clustering in high-dimensional sparse spaces. This calculator computes cos(ฮธ) from raw vector components, returning the dot product, both Euclidean norms, and the recovered angle.

The tool handles vectors of arbitrary dimension n โ‰ฅ 2. Inputs are validated component-wise. A zero vector has undefined direction and produces an undefined similarity. The formula assumes a real-valued vector space Rn. For 2-dimensional inputs, a canvas visualization renders both vectors and the subtended angle. Pro tip: in text analytics, normalize your term-frequency vectors before computing similarity to avoid bias from document length.

cosine similarity vector similarity dot product vector angle linear algebra NLP similarity machine learning

Formulas

The cosine similarity between two vectors A and B in Rn is defined as:

cos(ฮธ) = A โ‹… Bโ€–Aโ€– ร— โ€–Bโ€– = nโˆ‘i=1 Ai ร— Biโˆšnโˆ‘i=1 Ai2 ร— โˆšnโˆ‘i=1 Bi2

Where A โ‹… B is the dot product, โ€–Aโ€– is the Euclidean norm (L2 norm) of vector A, and n is the number of dimensions. The angle is recovered via ฮธ = arccos(cos(ฮธ)). The input is clamped to [โˆ’1, 1] before applying arccos to guard against floating-point rounding errors that could produce NaN.

The dot product expands as: A โ‹… B = A1B1 + A2B2 + โ€ฆ + AnBn. The Euclidean norm: โ€–Aโ€– = โˆšA12 + A22 + โ€ฆ + An2.

Reference Data

Similarity ValueAngle (ฮธ)InterpretationTypical Use Case
1.0000ยฐIdentical directionDuplicate document detection
0.95018.2ยฐVery high similarityNear-duplicate / paraphrase detection
0.86630ยฐHigh similarityRelated topics in topic modeling
0.70745ยฐModerate similarityPartially related content
0.50060ยฐWeak similarityTangentially related features
0.00090ยฐOrthogonal (no similarity)Independent / uncorrelated features
โˆ’0.500120ยฐWeak oppositionSentiment polarity (partial)
โˆ’0.707135ยฐModerate oppositionContrasting user preference vectors
โˆ’1.000180ยฐDiametrically opposedPerfect negation in signed embeddings
Common Thresholds in Industry
โ‰ฅ 0.80โ‰ค 36.9ยฐ"Similar" in most NLP pipelinesSearch result deduplication
โ‰ฅ 0.90โ‰ค 25.8ยฐ"Highly similar" thresholdPlagiarism detection (strict)
โ‰ฅ 0.95โ‰ค 18.2ยฐ"Near-identical"Code clone detection
โ‰ฅ 0.60โ‰ค 53.1ยฐ"Related" in recommendationCollaborative filtering cutoff
Comparison with Other Distance Metrics
Euclidean DistanceMagnitude-sensitiveK-means clustering, KNN
Manhattan DistanceL1 norm, robust to outliersSparse feature spaces
Jaccard SimilaritySet-based, binary featuresShingling, MinHash
Pearson CorrelationCentered cosine similarityCollaborative filtering (mean-adjusted)

Frequently Asked Questions

A zero vector has no direction and therefore undefined cosine similarity. The denominator becomes 0 because โ€–Aโ€– = 0, producing a division by zero. This calculator detects zero-magnitude vectors and returns an "undefined" result with an explanatory message rather than NaN or Infinity.
The formula normalizes both vectors by dividing by their norms. A vector [1, 2] and [100, 200] yield a similarity of 1.000 despite vastly different magnitudes. This is desirable in text analytics (document length shouldn't affect topic similarity) but problematic in domains where magnitude carries information, such as force vectors in physics. In those cases, use Euclidean distance or a magnitude-weighted metric instead.
Yes. Cosine similarity ranges from โˆ’1 to 1. Negative values occur when the angle ฮธ between vectors exceeds 90ยฐ. In TF-IDF spaces, term frequencies are non-negative, so similarity is always in [0, 1]. However, in word embedding spaces (Word2Vec, GloVe) where components can be negative, similarity can be negative, indicating semantic opposition.
In very high-dimensional spaces (thousands of dimensions), random vectors tend to become nearly orthogonal, pushing cosine similarity toward 0 for unrelated items. This is the "curse of dimensionality." Practically, this means similarity values of 0.3 - 0.5 can be significant in a 300-dimensional embedding space, whereas they might be noise in a 50,000-dimensional TF-IDF space. Always calibrate your threshold to the specific vector space and its dimensionality.
Pearson correlation is cosine similarity applied to mean-centered vectors. If you subtract the mean of each vector's components before computing cosine similarity, you get the Pearson correlation coefficient r. Use Pearson when your data has different baselines (e.g., user rating scales differ), and raw cosine when the absolute component values carry meaning (e.g., term frequencies).
No. This tool operates on real-valued vectors in Rn. For complex vectors in Cn, cosine similarity requires the Hermitian inner product (conjugate transpose), not the simple dot product. The formula becomes cos(ฮธ) = Re(Aโ€  โ‹… B) รท (โ€–Aโ€– ร— โ€–Bโ€–), which this tool does not implement.