How to calibrate your classifier in an intelligent way using Venn-Abers Conformal Prediction

Valeriy Manokhin, PhD, MBA, CQF
9 min readJul 2, 2022

Venn-Abers predictors are the new kid on the block — and anyone can run them with just a few lines of Python code.

Machine learning classification tasks are everywhere, ranging from differentiating between cats and dogs to diagnosing severe diseases and identifying pedestrians for self-driving cars.

However, these problem definitions often overshadow the true goal of classification: facilitating informed decision-making. Merely having class labels doesn’t suffice; what’s crucial are well-calibrated class probabilities.

Moreover classification is just a post hoc decision after probabilistic prediction. Classification involves making a premature, definitive decision, which is frequently applied incorrectly in machine learning projects.

On the other hand, probability modeling aligns more closely with the actual objectives of a project.

While many data scientists gauge a model’s efficacy using metrics like accuracy, precision, and recall, these can be misleading outside of basic scenarios like the cats-vs-dogs example. Regrettably, essential topics like classifier calibration often go unaddressed in foundational machine learning courses and textbooks, such as Andrew Ng’s beginner machine learning course. Even popular textbooks such as ‘Introduction to Statistical Learning’ appear to complete bypass such important topic as classifier calibtaiton.

For researchers, data scientists in the corporate realm, or machine learning engineers developing crucial applications, classifier calibration should be a top concern. One might wonder, ‘Why the fuss over classifier calibration? Isn’t it simpler to just classify and move on?’

The answer is more nuanced. At the heart of classification lies the goal of fostering informed decision-making. These decisions inherently involve weighing the probabilities of various choices, along with the benefits and costs tied to each option.

“The primary goal of addressing a classification problem is to facilitate informed decision-making. Such decisions encompass the likelihoods of various available choices, and the corresponding advantages and drawbacks of each. Consider a scenario where a bank deliberates on granting a business loan. If you develop a classifier that merely states a prospective client won’t default, is that genuinely helpful? Not in the slightest.

For making a sound decision, especially with potentially millions on the line, businesses require models that provide accurate probabilities of a client defaulting or repaying the loan. These probability estimates can be integrated with the financial gains or losses from different outcomes to compute the net present value (NPV) of the loan decision.

However, the challenge is that many machine learning models don’t genuinely yield class probabilities. At best, they might offer classification scores that try to approximate real probabilities. At worst, these scores can be gravely miscalibrated, leading to erroneous decisions.

In industries like finance, personalized healthcare, and autonomous driving, such miscalibrations can be perilous. Take, for instance, a Tesla autopilot incident where a car collided with a truck due to a deep learning system inaccurately predicting ‘obstacle/no obstacle’ probabilities.

When a convolutional neural network says the road is clear.

Deep learning systems, especially convolutional neural networks, are infamous for their miscalibration. This issue is further detailed in the insightful paper ‘On Calibration of Modern Neural Networks.’

The authors of this paper contend that while recent advancements in deep learning have certainly enhanced the accuracy of neural networks, they have also inadvertently exacerbated their miscalibration.

Subsequent studies consistently show that when deep learning neural networks make erroneous class predictions, they often do so with unwarranted confidence, much like a gambler who wrongly yet assuredly bets all their money on an ill-chosen roulette color.

One might wonder about the calibration of other model types. Historically, many traditional algorithms have been recognized for their miscalibration.

In their seminal 2005 paper, ‘Predicting Good Probabilities With Supervised Learning,’ Caruana and Mizil highlighted that classifiers such as SVM, decision trees, and boosted trees (like XGBoost, LightGBM, and CatBoost) often produce miscalibrated probabilities.

However, this influential paper did not get everything right. Notably, it mistakenly asserted that traditional (shallow) neural networks are well-calibrated. Contrary to this claim, more recent research titled ‘Are Traditional Neural Networks Well-Calibrated?’ — which conducted a thorough evaluation — found that both individual multilayer perceptrons and their ensembles tend to be inadequately calibrated.

The researchers concluded that their findings starkly contrast with common recommendations for using neural networks as probabilistic classifiers, suggesting that nearly all classifier models suffer from miscalibration.

But what about statistical models such as logistic regression? Scikit-learn’s page ‘Probability calibration’ claims that “Logistic regression returns well calibrated predictions by default as it has a canonical link function for its loss, i.e. the logit-link for the Log loss.”

It happens that this claim is incorrect, in the recent paper from Salesforce Research and Berkeley, ‘Don’t Just Blame Over-parametrization for Over-confidence: Theoretical Analysis of Calibration in Binary Classification’ it is proven that “logistic regression is inherently over-confident, in the realizable, under- parametrized setting where the data is generated from the logistic model, and the sample size is much larger than the number of parameters. In this paper, we show theoretically that over- parametrization is not the only reason for over- confidence. We prove that logistic regression is inherently over-confident.”

To succinctly summarize the research landscape as of 2024, “ALL classification models are miscalibrated.”

Traditional approaches like Platt’s scaler and isotonic regression emerged as the parametric solutions for calibrating classifiers. However, these methods, being somewhat antiquated, often fall short due to their oversimplified assumptions, failing to yield well-calibrated probabilities.

Platt’s method, introduced in 1999, was tailored for the sigmoid-shaped distribution of classification scores produced by support-vector machines. Interestingly, Platt did not delve deeply into the mathematics underpinning the sigmoid-based assumptions in his paper, ‘Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods’.

Yet, subsequent research established that assuming a sigmoid distribution is tantamount to presupposing normally distributed and heteroscedastic per-class scores — a notably limiting assumption. Such a constraint curtails Platt’s scaler’s versatility, impeding its efficacy across diverse datasets.

In practical scenarios, the logistic curve calibration of Platt’s scaler can deteriorate probability estimates, making them inferior to the initial scores. Moreover, since the logistic curve family doesn’t encompass the identity function, it can inadvertently degrade an already well-calibrated classifier.

Isotonic regression, another traditional method from 2001, is focused on calibrating classifiers. It entails fitting an isotonic (monotonically increasing) function that maps uncalibrated classification scores to genuine class probabilities. This derived mapping function can subsequently transform raw, uncalibrated scores in the test set into class probabilities.

Isotonic regression, an illustration

“Isotonic regression, while useful, is not without its limitations. Its most significant shortcoming is the assumption that the classifier perfectly ranks objects in the test set, essentially implying an ROC AUC of 1 — a lofty expectation.

These stringent assumptions can lead isotonic regression to produce miscalibrated probabilities and exhibit overfitting, especially on smaller datasets.

How to address these pitfalls of classical calibrators in 2022? Enter Venn-ABERS.

Venn-ABERS is a member of the conformal prediction family. If you’re unfamiliar with conformal prediction, take a moment to explore — it’s genuinely transformative.

The ‘Awesome Conformal Prediction’ repository on GitHub offers a treasure trove of resources suitable for novices and experts alike, spanning both academia and industry.

One of the hallmarks of conformal prediction, and by extension Venn-ABERS, is its provision of mathematical assurances of validity (or unbiasedness), irrespective of data distribution, dataset size, or the foundational classification model.

Venn-ABERS predictors owe their inception to Prof. Vladimir Vovk, the pioneer of Conformal Prediction, and two of his PhD students Ivan Petej and Valentina Fedorova. Their NeurIPS paper “Large-scale probabilistic predictors with and without guarantees of validity” delves into the mathematical intricacies.

Curious about the name? It fuses ‘Venn predictors’ (another type of conformal predictor) and the initials of the authors — Miriam Ayer, H. Daniel Brunk, George M. Ewing, W. T. Reid, and Edward Silverman — who penned a seminal paper on sampling with incomplete data.

So, what sets Venn-ABERS predictors apart? At their core, instead of employing isotonic regression once, they apply it twice, postulating that each test object could belong to either class 0 or 1. This object is then integrated into the calibration set twice, under both labels, leading to two resultant probabilities: p0 and p1.

It’s vital to grasp that both p0 and p1 represent the likelihood of the object being in class 1. Think of them as a prediction range for the genuine class 1 probability, with a mathematical guarantee that the actual probability resides within this span.

In essence, Venn-ABERS beautifully circumvents pitfalls like score distribution assumptions in Platt’s scaler and offers robust mathematical assurances. The prediction interval (p0,p1) also imparts insights into classification confidence. Typically, for large datasets, p0 and p1 are closely aligned. However, discrepancies can emerge for smaller or more challenging datasets, signaling classification difficulties.

Critically, in precarious scenarios, Venn-ABERS not only delivers precise probabilities but also sends an ‘alert’ by expanding the interval width of (p0,p1). For practical decision-making, these probabilities can be unified as p =p1 /(1-p0+p1). Incorporate this into decisions like loan approval or autopilot disengagement, and you’ve achieved safer outcomes.

Elon Musk, perhaps this could be the answer to road safety challenges?

In summary, Venn-ABERS stands out as the premier calibration technology for all types of classifiers, be it for tabular data, computer vision, or NLP (and yes, even Transformers aren’t immune to miscalibration). As a component of Conformal Prediction, Venn-ABERS is versatile, free from distribution constraints, and compatible with any underlying statistical, machine, or deep learning model.

So how does Venn-Abers perform in comparison with Platt’s scaler and isotonic regression? In the study ‘Probabilistic Prediction in scikit-learn’ Venn-Abers outperformed Platt scaling and both Venn-Abers and Platt’s scaling outperformed isotonic regression. In addition unlike Platt’s scaling and isotonic regression has “the unique to output not only well-calibrated probability estimates, but also the confidence in these estimates is demonstrated.”

The new Python package Venn-Abers is, according to its developer Ivan Petej (who is one of coauthors of the original Venn-Abers paper) is faster than the original implementation and is the recommended package.

The new Venn-Abers package has implemented both binary classification and multi-class classification based on my paper ‘Multi-class probabilistic classification using inductive and cross Venn–Abers predictors” that also has repo “Multi-class-probabilistic-classification

Want to learn more about calibration, conformal prediction, time series and forecasting? Follow me on Medium, Twitter and LinkedIn.

Check out my book ‘Practical Guide to Applied Conformal Prediction in Python: Learn and apply the best uncertainty frameworks to your industry applications.’

References:

  1. Practical Guide to Applied Conformal Prediction in Python: Learn and apply the best uncertainty frameworks to your industry applications’
  2. Awesome Conformal Prediction
  3. Machine learning for probabilistic prediction
  4. venn-abers new python package by Ivan Petej — the recommended package
  5. VennABERS original implementation by
    Paolo Toccaceli
  6. Multi-class probabilistic classification using inductive and cross Venn–Abers predictors.” (repo “Multi-class-probabilistic-classification”)
  7. Tutorial on Conformal Predictors and Venn Predictors
  8. https://medium.com/towards-data-science/pythons-predict-proba-doesn-t-actually-predict-probabilities-and-how-to-fix-it-f582c21d63fc
  9. “On Calibration of Modern Neural Networks
  10. “Predicting Good Probabilities With Supervised Learning
  11. “Are Traditional Neural Networks Well-Calibrated?
  12. “Reliable Probability Estimates Based on Support Vector Machines for Large Multiclass Datasets
  13. “Large-scale probabilistic predictors with and without guarantees of validity” (paper).
  14. Kaggle notebook ‘Classifier calibration using Venn-ABERS’ by Carl McBride Ellis
  15. Kaggle notebook ‘Conformal_Prediction_PSS3, E2’ by myself

--

--

Valeriy Manokhin, PhD, MBA, CQF

Principal Data Scientist, PhD in Machine Learning, creator of Awesome Conformal Prediction 👍Tip: hold down the Clap icon for up x50