Trustworthy Machine Learning
The broader trustworthy ML page.
Accuracy tells you whether predictions are often right. Calibration tells you whether the probabilities can be trusted. This page is a focused implementation companion to the broader Trustworthy Machine Learning article.
A calibrated classifier makes probability estimates that match reality. If a model says a group of cases has 70 percent risk, about 70 percent of those cases should actually become positive over repeated observations.
Calibration matters when probabilities drive decisions: review queues, credit thresholds, medical triage support, churn interventions, fraud escalation, or any workflow where confidence affects action.
This small example computes expected calibration error for binary classification without external dependencies.
def expected_calibration_error(y_true, y_prob, n_bins=10):
"""Compute binary expected calibration error."""
if len(y_true) != len(y_prob):
raise ValueError("y_true and y_prob must have the same length")
if not y_true:
raise ValueError("inputs must not be empty")
if any(y not in {0, 1} for y in y_true):
raise ValueError("y_true must contain 0/1 labels")
if any(p < 0 or p > 1 for p in y_prob):
raise ValueError("probabilities must be between 0 and 1")
total = len(y_true)
ece = 0.0
for bin_index in range(n_bins):
lo = bin_index / n_bins
hi = (bin_index + 1) / n_bins
in_bin = [
i for i, p in enumerate(y_prob)
if lo <= p < hi or (bin_index == n_bins - 1 and p == 1.0)
]
if not in_bin:
continue
confidence = sum(y_prob[i] for i in in_bin) / len(in_bin)
accuracy = sum(y_true[i] for i in in_bin) / len(in_bin)
ece += len(in_bin) / total * abs(accuracy - confidence)
return ece
print(expected_calibration_error([0, 1, 1, 0], [0.2, 0.8, 0.7, 0.6]))
A reliability diagram plots average predicted probability against observed positive rate by bin. The perfect calibration line is diagonal: predicted probability equals observed frequency.
| Bin | Average confidence | Observed rate | Interpretation |
|---|---|---|---|
| 0.0 to 0.2 | 0.12 | 0.10 | Close to calibrated. |
| 0.2 to 0.4 | 0.31 | 0.22 | Over-confident. |
| 0.4 to 0.6 | 0.51 | 0.49 | Close to calibrated. |
| 0.6 to 0.8 | 0.72 | 0.60 | Over-confident in a decision-heavy region. |
Calibration can drift when the population, label definition, product surface, or model changes. Monitor calibration by segment, time period, model version, and decision threshold. A model can keep similar AUC while becoming less trustworthy as a probability estimator.
Useful monitoring views include reliability diagrams by month, ECE by segment, threshold review volume, false-positive cost, and human correction rate.
For adjacent topics, read Trustworthy Machine Learning and Uncertainty Quantification for LLMs.
Model calibration means predicted probabilities match observed frequencies. If a model predicts 0.8 probability across many cases, about 80 percent of those cases should be positive.
A reliability diagram groups predictions into confidence bins and compares average predicted probability with actual outcome rate in each bin.
Expected calibration error summarizes the weighted gap between predicted confidence and observed accuracy across bins.