abstract image header ("model calibration") generated by the deep learning software "stable diffusion"

On Model Calibration

For Deep Learning Classification Models

Motivation

When judging the quality of a deep learning model, we are usually highly interested in its predictive performance. This can be described using a wide array of different metrics, such as the accuracy or more elaborate measures like precision, recall, or f1 score.
These metrics are fundamental in assessing the model quality, and reporting the most informative metrics is crucial for convincing auditors of the quality of the developed model and thus for regulatory success.
However, there is another dimension in model quality that auditors are becoming increasingly aware of in the context of interpretability, and that is model uncertainty.

Model uncertainty is a term that is used differently in various domains of machine learning, but in the regulatory sector, it typically means: “for a given sample, the probability that the model's prediction is correct”. This information contributes to the model interpretability and can be extremely helpful in the medical domain. If a deep learning model is used to diagnose cancer, knowing if the model is 99% or 51% sure for a specific diagnosis can make all the difference in treatment decisions.

Model Calibration

Conveniently, deep learning classifiers are built in a way that allows for class certainty predictions out of the box. The output of the model usually assigns each class a score, with the sum of class scores adding up to one. During deployment, the class scores are often times ignored and only the class with the highest score is selected as the prediction. However, this ignores the valuable information that class probabilities convey. 

figure 1 showing the deep learning pipeline. Data gets fed to a model and class probabilities are the output.
Figure 1: Deep learning classifiers assign class probabilities under the hood.

So why aren’t they used in practice? The reason is quite simple: The class probabilities that lie under the hood are usually terrible. Technically speaking, the models are badly calibrated.

Given 1) that named bodies are increasingly likely to expect uncertainty estimates from your deep learning model, and 2) that this actually seems useful to have, what should you do to tackle this topic? We’ve got great news for you - there is a technique that can improve model calibration in a non-invasive post-processing step that adds another, calibrated layer to the existing model.

Expected Calibration Error

What does quality of calibration mean? Let’s look at a practical example; suppose an arbitrary binary classifier such as outlined in Figure 1. 
In the ideal case, the class probabilities actually convey the probability of, for example, a diagnosis being positive or negative. In our case, the model would expect the patient shown in the scan to have a benign diagnosis with a probability of 77%. From a patient’s or doctor’s perspective, this information is far more interesting than the raw diagnosis of “benign”.

But how can we measure the quality of these class probabilities? The usual metric is the expected calibration error. Simply speaking, it measures how far the model is off on its assigned class probabilities.

figure 2 showing that the calibration error is the difference of predicted and actual class probabilities.
Figure 2: Calibration error as the difference of predicted vs. actual class probability.

Check out Figure 2 as an example. The upper bar shows that the model assigned the probability of a certain class to be 77%. However, the actual class probability is 91%. For a well-calibrated model, we would expect these values to be as very similar.
Here, there is a difference of 14%. We call this the calibration error.

If we repeat this process on a bunch of data, we can calculate how far off the model is on average, we receive a number called the expected calibration error. Intuitively speaking, if a model has an expected calibration error of say 15%, we have to assume that the predicted class probability is off by 15% on average.

Improving Calibration in Post-Processing

There are many ways to improve neural network calibration, most of them either through regularization or post-processing. One such technique is called Platt scaling. It is both straightforward and very effective. Basically, it determines a post-processing function that maps the model's class probability outputs onto more accurate values. 

figure 3 showing how postprocessing can get the predicted class probabilities closer to the actual class probabilities.
Figure 3: Better class probabilities through postprocessing.

Check out Figure 3 for an illustration - the predicted class probabilities now resemble the actual probabilities much closer. The calibration error has decreased to 0.02.  As can be seen in Figure 4, this can be done by a simple postprocessing layer added to the end of the model. Congratulations! You now have a well-calibrated model with meaningful class probability estimates!

figure 4
Figure 4: Accurate class probabilities can be achieved non-invasively by adding postprocessing at the end of the model.