Trustworthy ML

Model Calibration in Python: Reliability Diagrams, ECE, and Decision Thresholds

Accuracy tells you whether predictions are often right. Calibration tells you whether the probabilities can be trusted. This page is a focused implementation companion to the broader Trustworthy Machine Learning article.

Contents

What calibration means
Python ECE example
Reliability diagram data
Production monitoring
FAQ

What calibration means

A calibrated classifier makes probability estimates that match reality. If a model says a group of cases has 70 percent risk, about 70 percent of those cases should actually become positive over repeated observations.

Calibration matters when probabilities drive decisions: review queues, credit thresholds, medical triage support, churn interventions, fraud escalation, or any workflow where confidence affects action.

Python ECE example

This small example computes expected calibration error for binary classification without external dependencies.

def expected_calibration_error(y_true, y_prob, n_bins=10):
    """Compute binary expected calibration error."""
    if len(y_true) != len(y_prob):
        raise ValueError("y_true and y_prob must have the same length")
    if not y_true:
        raise ValueError("inputs must not be empty")
    if any(y not in {0, 1} for y in y_true):
        raise ValueError("y_true must contain 0/1 labels")
    if any(p < 0 or p > 1 for p in y_prob):
        raise ValueError("probabilities must be between 0 and 1")

    total = len(y_true)
    ece = 0.0
    for bin_index in range(n_bins):
        lo = bin_index / n_bins
        hi = (bin_index + 1) / n_bins
        in_bin = [
            i for i, p in enumerate(y_prob)
            if lo <= p < hi or (bin_index == n_bins - 1 and p == 1.0)
        ]
        if not in_bin:
            continue
        confidence = sum(y_prob[i] for i in in_bin) / len(in_bin)
        accuracy = sum(y_true[i] for i in in_bin) / len(in_bin)
        ece += len(in_bin) / total * abs(accuracy - confidence)
    return ece


print(expected_calibration_error([0, 1, 1, 0], [0.2, 0.8, 0.7, 0.6]))

Reliability diagram data

A reliability diagram plots average predicted probability against observed positive rate by bin. The perfect calibration line is diagonal: predicted probability equals observed frequency.

Bin	Average confidence	Observed rate	Interpretation
0.0 to 0.2	0.12	0.10	Close to calibrated.
0.2 to 0.4	0.31	0.22	Over-confident.
0.4 to 0.6	0.51	0.49	Close to calibrated.
0.6 to 0.8	0.72	0.60	Over-confident in a decision-heavy region.

Production monitoring

Calibration can drift when the population, label definition, product surface, or model changes. Monitor calibration by segment, time period, model version, and decision threshold. A model can keep similar AUC while becoming less trustworthy as a probability estimator.

Useful monitoring views include reliability diagrams by month, ECE by segment, threshold review volume, false-positive cost, and human correction rate.

For adjacent topics, read Trustworthy Machine Learning and Uncertainty Quantification for LLMs.

FAQ

What is model calibration?

Model calibration means predicted probabilities match observed frequencies. If a model predicts 0.8 probability across many cases, about 80 percent of those cases should be positive.

What is a reliability diagram?

A reliability diagram groups predictions into confidence bins and compares average predicted probability with actual outcome rate in each bin.

What is expected calibration error?

Expected calibration error summarizes the weighted gap between predicted confidence and observed accuracy across bins.

What calibration means

Python ECE example

Reliability diagram data

Production monitoring

FAQ

What is model calibration?

What is a reliability diagram?

What is expected calibration error?

Related reading

Trustworthy Machine Learning

Data Products

Uncertainty Quantification for LLMs