User Rating 0.0
Total Usage 0 times
Predicted →
Actual ↓
Is this tool helpful?

Your feedback helps us improve.

About

A confusion matrix is the primary diagnostic instrument for evaluating classification models. Misreading it costs real resources: a false negative in medical screening means a missed diagnosis; a false positive in fraud detection means frozen accounts and lost customers. This calculator extracts over 20 performance metrics from any N×N matrix, including MCC (Matthews Correlation Coefficient), which remains informative even on imbalanced datasets where raw accuracy is misleading. All formulas assume integer counts. Results approximate true population parameters subject to sample size limitations.

For binary problems, the tool reports threshold-dependent metrics like F1, LR+, and Diagnostic Odds Ratio. For multi-class problems, it computes per-class breakdowns and micro, macro, and weighted averages. Pro tip: if your classes are severely imbalanced (prevalence < 5%), focus on MCC and F1 rather than accuracy. Note that MCC is undefined when any row or column of the matrix sums to zero.

confusion matrix classification metrics precision recall F1 score MCC Cohen's Kappa machine learning accuracy sensitivity specificity

Formulas

For a binary classification problem, the confusion matrix is a 2×2 table of counts. The primary metrics derive directly from four values:

TPFPFNTN

where TP = True Positives (correctly predicted positive), FP = False Positives (negative samples incorrectly predicted positive), FN = False Negatives (positive samples incorrectly predicted negative), TN = True Negatives (correctly predicted negative), and N = TP + TN + FP + FN is the total sample count.

Accuracy = TP + TNN
MCC = TP TN FP FN(TP + FP)(TP + FN)(TN + FP)(TN + FN)

For Cohen's Kappa, po is observed agreement (accuracy), and pe is expected agreement by chance:

pe = (TP + FP)(TP + FN) + (FN + TN)(FP + TN)N2
κ = po pe1 pe

For multi-class (N > 2 classes), per-class TPi is the diagonal element, FPi is column sum minus diagonal, FNi is row sum minus diagonal, and TNi is everything else. Micro-averaging sums all per-class TP, FP, FN before dividing. Macro-averaging computes per-class metrics then takes the arithmetic mean. Weighted averaging weights each class metric by its support (row sum).

Reference Data

MetricFormula (Binary)RangeBest ValueUse When
Accuracy(TP + TN) ÷ N[0, 1]1Balanced classes
Precision (PPV)TP ÷ (TP + FP)[0, 1]1Cost of false positives high
Recall (Sensitivity, TPR)TP ÷ (TP + FN)[0, 1]1Cost of false negatives high
Specificity (TNR)TN ÷ (TN + FP)[0, 1]1Screening tests
F1 Score2 P RP + R[0, 1]1Imbalanced datasets
MCCTPTN FPFN(TP+FP)(TP+FN)(TN+FP)(TN+FN)[−1, 1]1Most reliable single metric
Cohen's Kappapo pe1 pe[−1, 1]1Agreement beyond chance
Balanced Accuracy(TPR + TNR) ÷ 2[0, 1]1Imbalanced binary
Youden's JTPR + TNR 1[−1, 1]1Optimal threshold selection
Prevalence(TP + FN) ÷ N[0, 1] - Context for other metrics
FPR (Fall-out)FP ÷ (FP + TN)[0, 1]0ROC analysis
FNR (Miss Rate)FN ÷ (FN + TP)[0, 1]0Safety-critical systems
NPVTN ÷ (TN + FN)[0, 1]1Negative predictions matter
FDRFP ÷ (FP + TP)[0, 1]0Multiple hypothesis testing
FORFN ÷ (FN + TN)[0, 1]0Complement of NPV
LR+ (Positive Likelihood)TPR ÷ FPR[0, ]Clinical diagnostics
LR− (Negative Likelihood)FNR ÷ TNR[0, ]0Clinical diagnostics
DOR (Diagnostic Odds Ratio)LR+ ÷ LR[0, ]Overall test effectiveness
Informedness (BM)TPR + TNR 1[−1, 1]1Same as Youden's J
Markedness (MK)PPV + NPV 1[−1, 1]1Prediction reliability

Frequently Asked Questions

MCC (Matthews Correlation Coefficient) is superior whenever class distribution is imbalanced. With 95% negatives and 5% positives, a model predicting all-negative achieves 95% accuracy but MCC = 0, correctly indicating zero predictive power. MCC ranges from −1 to +1, where +1 is perfect, 0 is random, and −1 is complete inversion. It uses all four quadrants of the confusion matrix and is equivalent to the Pearson correlation between observed and predicted binary variables.
Landis & Koch (1977) benchmarks: κ < 0.00 = poor, 0.00-0.20 = slight, 0.21-0.40 = fair, 0.41-0.60 = moderate, 0.61-0.80 = substantial, 0.81-1.00 = almost perfect agreement. Kappa adjusts accuracy for chance agreement, making it critical when comparing models on datasets with different class distributions. A κ of 0 means performance no better than random assignment proportional to class frequencies.
Micro-averaging aggregates all per-class TP, FP, FN into global counts, then computes a single metric - it is equivalent to overall accuracy for Precision/Recall/F1. Macro-averaging computes each class metric independently then takes the unweighted mean - it treats all classes equally regardless of size. Weighted averaging computes each class metric then averages by support (actual class frequency) - it accounts for class imbalance while still reflecting per-class performance. Use micro when you care about overall correctness, macro when minority classes matter equally, and weighted as a compromise.
Division by zero. Precision is undefined when TP + FP = 0 (the model never predicted that class). Recall is undefined when TP + FN = 0 (the class has no actual instances). MCC is undefined when any marginal sum (row or column total) is zero. LR+ is undefined when FPR = 0, and LR− is undefined when TNR = 0. These are not bugs - they reflect genuinely indeterminate conditions. The calculator displays "N/A" and excludes undefined values from macro averages.
For class i in an N×N matrix: TP = cell(i,i), FP = sum of column i minus cell(i,i), FN = sum of row i minus cell(i,i), TN = total sum minus TP minus FP minus FN. This "one-vs-rest" decomposition lets you compute all binary metrics per class. The calculator performs this automatically and displays per-class breakdowns.
Yes. F1 is the harmonic mean of Precision (penalizes FP) and Recall (penalizes FN), weighting them equally. If you need asymmetric weighting, use Fβ: at β = 2, recall is weighted 4× more than precision (useful in medical screening); at β = 0.5, precision is weighted 4× more (useful in spam filtering). This calculator computes the standard F1 (β = 1).