Confusion Matrix Calculator
Calculate accuracy, precision, recall, F1-score, MCC, Cohen's Kappa and 20+ metrics from any N×N confusion matrix with heatmap visualization.
About
A confusion matrix is the primary diagnostic instrument for evaluating classification models. Misreading it costs real resources: a false negative in medical screening means a missed diagnosis; a false positive in fraud detection means frozen accounts and lost customers. This calculator extracts over 20 performance metrics from any N×N matrix, including MCC (Matthews Correlation Coefficient), which remains informative even on imbalanced datasets where raw accuracy is misleading. All formulas assume integer counts. Results approximate true population parameters subject to sample size limitations.
For binary problems, the tool reports threshold-dependent metrics like F1, LR+, and Diagnostic Odds Ratio. For multi-class problems, it computes per-class breakdowns and micro, macro, and weighted averages. Pro tip: if your classes are severely imbalanced (prevalence < 5%), focus on MCC and F1 rather than accuracy. Note that MCC is undefined when any row or column of the matrix sums to zero.
Formulas
For a binary classification problem, the confusion matrix is a 2×2 table of counts. The primary metrics derive directly from four values:
where TP = True Positives (correctly predicted positive), FP = False Positives (negative samples incorrectly predicted positive), FN = False Negatives (positive samples incorrectly predicted negative), TN = True Negatives (correctly predicted negative), and N = TP + TN + FP + FN is the total sample count.
For Cohen's Kappa, po is observed agreement (accuracy), and pe is expected agreement by chance:
For multi-class (N > 2 classes), per-class TPi is the diagonal element, FPi is column sum minus diagonal, FNi is row sum minus diagonal, and TNi is everything else. Micro-averaging sums all per-class TP, FP, FN before dividing. Macro-averaging computes per-class metrics then takes the arithmetic mean. Weighted averaging weights each class metric by its support (row sum).
Reference Data
| Metric | Formula (Binary) | Range | Best Value | Use When |
|---|---|---|---|---|
| Accuracy | (TP + TN) ÷ N | [0, 1] | 1 | Balanced classes |
| Precision (PPV) | TP ÷ (TP + FP) | [0, 1] | 1 | Cost of false positives high |
| Recall (Sensitivity, TPR) | TP ÷ (TP + FN) | [0, 1] | 1 | Cost of false negatives high |
| Specificity (TNR) | TN ÷ (TN + FP) | [0, 1] | 1 | Screening tests |
| F1 Score | 2 ⋅ P ⋅ RP + R | [0, 1] | 1 | Imbalanced datasets |
| MCC | TP⋅TN − FP⋅FN√(TP+FP)(TP+FN)(TN+FP)(TN+FN) | [−1, 1] | 1 | Most reliable single metric |
| Cohen's Kappa | po − pe1 − pe | [−1, 1] | 1 | Agreement beyond chance |
| Balanced Accuracy | (TPR + TNR) ÷ 2 | [0, 1] | 1 | Imbalanced binary |
| Youden's J | TPR + TNR − 1 | [−1, 1] | 1 | Optimal threshold selection |
| Prevalence | (TP + FN) ÷ N | [0, 1] | - | Context for other metrics |
| FPR (Fall-out) | FP ÷ (FP + TN) | [0, 1] | 0 | ROC analysis |
| FNR (Miss Rate) | FN ÷ (FN + TP) | [0, 1] | 0 | Safety-critical systems |
| NPV | TN ÷ (TN + FN) | [0, 1] | 1 | Negative predictions matter |
| FDR | FP ÷ (FP + TP) | [0, 1] | 0 | Multiple hypothesis testing |
| FOR | FN ÷ (FN + TN) | [0, 1] | 0 | Complement of NPV |
| LR+ (Positive Likelihood) | TPR ÷ FPR | [0, ∞] | ∞ | Clinical diagnostics |
| LR− (Negative Likelihood) | FNR ÷ TNR | [0, ∞] | 0 | Clinical diagnostics |
| DOR (Diagnostic Odds Ratio) | LR+ ÷ LR− | [0, ∞] | ∞ | Overall test effectiveness |
| Informedness (BM) | TPR + TNR − 1 | [−1, 1] | 1 | Same as Youden's J |
| Markedness (MK) | PPV + NPV − 1 | [−1, 1] | 1 | Prediction reliability |