About

A confusion matrix is the primary diagnostic instrument for evaluating classification models. Misreading it costs real resources: a false negative in medical screening means a missed diagnosis; a false positive in fraud detection means frozen accounts and lost customers. This calculator extracts over 20 performance metrics from any N×N matrix, including MCC (Matthews Correlation Coefficient), which remains informative even on imbalanced datasets where raw accuracy is misleading. All formulas assume integer counts. Results approximate true population parameters subject to sample size limitations.

For binary problems, the tool reports threshold-dependent metrics like F₁, LR⁺, and Diagnostic Odds Ratio. For multi-class problems, it computes per-class breakdowns and micro, macro, and weighted averages. Pro tip: if your classes are severely imbalanced (prevalence < 5%), focus on MCC and F₁ rather than accuracy. Note that MCC is undefined when any row or column of the matrix sums to zero.

Formulas

For a binary classification problem, the confusion matrix is a 2×2 table of counts. The primary metrics derive directly from four values:

TPFPFNTN

where TP = True Positives (correctly predicted positive), FP = False Positives (negative samples incorrectly predicted positive), FN = False Negatives (positive samples incorrectly predicted negative), TN = True Negatives (correctly predicted negative), and N = TP + TN + FP + FN is the total sample count.

Accuracy = TP + TNN

MCC = TP ⋅ TN − FP ⋅ FN√(TP + FP)(TP + FN)(TN + FP)(TN + FN)

For Cohen's Kappa, p_o is observed agreement (accuracy), and p_e is expected agreement by chance:

p_e = (TP + FP)(TP + FN) + (FN + TN)(FP + TN)N²

κ = p_o − p_e1 − p_e

For multi-class (N > 2 classes), per-class TP_i is the diagonal element, FP_i is column sum minus diagonal, FN_i is row sum minus diagonal, and TN_i is everything else. Micro-averaging sums all per-class TP, FP, FN before dividing. Macro-averaging computes per-class metrics then takes the arithmetic mean. Weighted averaging weights each class metric by its support (row sum).

Reference Data

Metric	Formula (Binary)	Range	Best Value	Use When
Accuracy	(TP + TN) ÷ N	[0, 1]	1	Balanced classes
Precision (PPV)	TP ÷ (TP + FP)	[0, 1]	1	Cost of false positives high
Recall (Sensitivity, TPR)	TP ÷ (TP + FN)	[0, 1]	1	Cost of false negatives high
Specificity (TNR)	TN ÷ (TN + FP)	[0, 1]	1	Screening tests
F₁ Score	2 ⋅ P ⋅ RP + R	[0, 1]	1	Imbalanced datasets
MCC	TP⋅TN − FP⋅FN√(TP+FP)(TP+FN)(TN+FP)(TN+FN)	[−1, 1]	1	Most reliable single metric
Cohen's Kappa	p_o − p_e1 − p_e	[−1, 1]	1	Agreement beyond chance
Balanced Accuracy	(TPR + TNR) ÷ 2	[0, 1]	1	Imbalanced binary
Youden's J	TPR + TNR − 1	[−1, 1]	1	Optimal threshold selection
Prevalence	(TP + FN) ÷ N	[0, 1]	-	Context for other metrics
FPR (Fall-out)	FP ÷ (FP + TN)	[0, 1]	0	ROC analysis
FNR (Miss Rate)	FN ÷ (FN + TP)	[0, 1]	0	Safety-critical systems
NPV	TN ÷ (TN + FN)	[0, 1]	1	Negative predictions matter
FDR	FP ÷ (FP + TP)	[0, 1]	0	Multiple hypothesis testing
FOR	FN ÷ (FN + TN)	[0, 1]	0	Complement of NPV
LR+ (Positive Likelihood)	TPR ÷ FPR	[0, ∞]	∞	Clinical diagnostics
LR− (Negative Likelihood)	FNR ÷ TNR	[0, ∞]	0	Clinical diagnostics
DOR (Diagnostic Odds Ratio)	LR⁺ ÷ LR⁻	[0, ∞]	∞	Overall test effectiveness
Informedness (BM)	TPR + TNR − 1	[−1, 1]	1	Same as Youden's J
Markedness (MK)	PPV + NPV − 1	[−1, 1]	1	Prediction reliability

Frequently Asked Questions

MCC (Matthews Correlation Coefficient) is superior whenever class distribution is imbalanced. With 95% negatives and 5% positives, a model predicting all-negative achieves 95% accuracy but MCC = 0, correctly indicating zero predictive power. MCC ranges from −1 to +1, where +1 is perfect, 0 is random, and −1 is complete inversion. It uses all four quadrants of the confusion matrix and is equivalent to the Pearson correlation between observed and predicted binary variables.

Landis & Koch (1977) benchmarks: κ < 0.00 = poor, 0.00-0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 = substantial, 0.81-1.00 = almost perfect agreement. Kappa adjusts accuracy for chance agreement, making it critical when comparing models on datasets with different class distributions. A κ of 0 means performance no better than random assignment proportional to class frequencies.

Micro-averaging aggregates all per-class TP, FP, FN into global counts, then computes a single metric - it is equivalent to overall accuracy for Precision/Recall/F1. Macro-averaging computes each class metric independently then takes the unweighted mean - it treats all classes equally regardless of size. Weighted averaging computes each class metric then averages by support (actual class frequency) - it accounts for class imbalance while still reflecting per-class performance. Use micro when you care about overall correctness, macro when minority classes matter equally, and weighted as a compromise.

Division by zero. Precision is undefined when TP + FP = 0 (the model never predicted that class). Recall is undefined when TP + FN = 0 (the class has no actual instances). MCC is undefined when any marginal sum (row or column total) is zero. LR+ is undefined when FPR = 0, and LR− is undefined when TNR = 0. These are not bugs - they reflect genuinely indeterminate conditions. The calculator displays "N/A" and excludes undefined values from macro averages.

For class i in an N×N matrix: TP = cell(i,i), FP = sum of column i minus cell(i,i), FN = sum of row i minus cell(i,i), TN = total sum minus TP minus FP minus FN. This "one-vs-rest" decomposition lets you compute all binary metrics per class. The calculator performs this automatically and displays per-class breakdowns.

Yes. F1 is the harmonic mean of Precision (penalizes FP) and Recall (penalizes FN), weighting them equally. If you need asymmetric weighting, use Fβ: at β = 2, recall is weighted 4× more than precision (useful in medical screening); at β = 0.5, precision is weighted 4× more (useful in spam filtering). This calculator computes the standard F1 (β = 1).