Mastering Classical (Transductive) Conformal Prediction in Action
Leverage the full dataset efficiently for more precise probabilistic predictions
Leverage the full dataset efficiently for more precise probabilistic predictions
Conformal Prediction is a sizzling š„š„š„š„š„ area of research and application, garnering significant attention in both academia and industry. The number of research papers on Conformal Prediction published in 2023 alone is expected to exceed 1,000+ papers.
At the prestigious NeurIPS 2022 machine learning conference, Stanford Professor Emmanuel Candes delivered a captivating keynote address, āConformal Prediction in 2022,ā to thousands of machine learning researchers and practitioners. He concluded by saying, āConformal inference methods are taking the academic and industrial worlds by storm. In essence, these methods provide exact prediction intervals for future observations without relying on any distributional assumptions, except for having independent and identically distributed (iid) data, or more generally, exchangeable data.ā
Conformal Prediction: Revolutionizing Machine Learning Applications
Conformal Prediction has been driving critical machine learning applications at top tech companies like Microsoft Azure (in anomaly detection) for nearly a decade. Many industry leaders are exploring ways to create, develop, and deploy solutions powered by this groundbreaking technique.
Over the last 2ā3 years, awareness of Conformal Prediction within the data science and tech community has skyrocketed, thanks in part to the release of popular tutorials and open-source libraries like MAPIE and Amazon Fortuna. These tools have made Conformal Prediction models accessible to anyone with just a few lines of code.
While applications have primarily focused on regression problems thus far, MAPIEās 2023 roadmap includes implementing Conformal Prediction for binary classification. Binary classification is a cornerstone of machine learning, with applications spanning industries such as finance, healthcare, and self-driving cars. In these domains, producing robust, well-calibrated, safe, and fair predictions is ā and thatās precisely where Conformal Prediction shines.
Awesome Conformal Prediction ā the best, most comprehensive, professionally curated resource for all things Conformal Prediction has been adding hundreds of GitHub stars every few weeks and is a reflection of growing popularity of Conformal Prediction in both academia and industry. If you have not checked it out, please do star š the repo and spread the word.
As the Conformal Prediction framework is becoming more popular, a few misconceptions about Conformal Prediction appeared on social media, including one about the āinefficientā use of data.
This inductive version of Conformal Prediction, whilst somewhat easier to understand for beginners, also happens to be not the original version of Conformal Prediction and indeed requires data splitting to create calibration dataset that can be used to compute non-conformity scores. The data used to create calibration dataset can not be used to train machine learning model resulting in potentially lower accuracy of both point and probabilistic predictions.
In Inductive Conformal Prediction, for each test point, the nonconformity score of this test point can then be compared to non-conformity scores from the calibration set and voila! Conformal Predictions can be made. Inductive Conformal Prediction (ICP) does not require retraining the underlying machine learning model (it is only trained once on the training set). ICP is not only easier to understand, but is also faster than full (transductive) Conformal Prediction.
So, given the advantages of the ICP, it should be the default tool of choice, right? Not so fast; the criticism of ICP about inefficient use of data is entirely correct and valid. The ICP also happens to be not the original version of Conformal Prediction and to truly understand and appreciate this awesome framework one needs to learn about and understand the original Conformal Prediction framework ā Transductive (full) Conformal Prediction.
In Inductive Conformal Prediction, to create a calibration set, one has to reserve part of the data for calibration. This data is lost for training the machine learning model resulting in lower point accuracy; whilst Inductive Conformal Prediciton retains key property of validity (lack of bias), prediction intervals from the Inductive Conformal Prediction tend to be wider and less efficient compared to what one might obtain using full (transductive) Conformal Prediction.
Whilst this might not matter for large datasets where there is enough data available to generate large training and calibration datasets, it might matter for not-so-large datasets and certainly matters for smaller datasets where one does not want to have wider prediction intervals if there are alternative solutions with reasonable computational speed.
To evaluate probabilistic predictors, one should look at validity and efficiency (see āHow to Evaluate Probabilistic Forecastsā for more details).
Whilst all models from the Conformal Prediction framework (including both inductive and full transductive versions of Conformal Prediction) automatically deliver valid (unbiased) predictions, the efficiency of predictions varies based on many factors, including how large and easy the dataset is, how good is the underlying machine learning model and even what nonconformity measure was selected to build Conformal Prediction model.
When one selects Inductive Conformal Prediction by default, it might not be the best choice as one is the trading speed for obtaining not-as-good (sub-optimal) prediction intervals.
Transductive (full) Conformal Prediction (TCP) is the original framework developed by the inventors of Conformal Prediction and utilises all data efficiently as it does not require a calibration set. It therefore provides more accurate point predictions as it allows for machine learning model to fully leverage all training set and also more efficient (narrow) prediction intervals due to efficient use of data.
Unfortunately, it has been difficult to find good explanations on how Transductive (full) Conformal Prediction works under the hood with code, until now. The purpose of this article is to provide easy to understand gentle introduction to TCP.
So how does it work exactly? Whilst TCP is generally somewhat more involved, the basic steps that one can follow to build a robust full Conformal Prediction model are explained here. In this article, we will look at how one can create a full Transductive Conformal Prediction binary classifier by either writing code from scratch (for learning purposes) or alternatively using Nonconformist library with just a few (4, in fact!) extra lines of code.
Step 1: Select a nonconformity measure; here is where things become foggy rather quickly (until now!), as selecting a good measure is not an easy task, especially for beginners.
Even understanding what potential nonconformity measures exist is not easy (compared to regression task where a simple nonconformity measure is y-y^hat, ouch how simple is that), asking GPT-4 might not help as well as it often hallucinates and does not produce correct answers (I have checked it over several attempts), this isnāt making a cooking recipe, but is a rather technical area with limited coverage on the internet ā a guaranteed way for GPT-4 to fail.
However, I have done all the leg work for you and reviewed relevant Conformal Prediction research papers that identified a couple of excellent and easy to use Nonconformity measures for binary classification.
There are several model agnostic Nonconformity measures that are both easy to start with and easy to understand. They also tend to perform well across various datasets and underlying models. One of them is called hinge loss (alternatively also called inverse probability) and can be computed as simply 1ā P(y|x), where P(y|x is) is the classification score produced by the classification model for the actual class. Remember, classification scores are not class probabilities, and predict_proba is a very unfortunate and misleading name in Scikit-learn.
How does one compute inverse probability (hinge) nonconformity score? For example, suppose your classifier outputs two scores: class_0 = 0.6, class_1 = 0.4 and the actual label y=1. Take the probability of the true class (in this case, 0) and subtract it from 1 to obtain an inverse probability (hinge) nonconformity score of 0.4.
Intuitively, this nonconformity score measures the gap between the probability score for the correct class produced by the ideal classifier (1) and the classification score produced by your model.
The better your underlying machine learning classification model, the lower the hinge (inverse probability) score. This again depends on many factors, such as how large and complex the dataset is, what type of machine learning model is used and how well it has been built.
Step 2: Use the training set to train the underlying classifier models. Conformal Prediction is model agnostic one can use any model as a classifier, from statistical to machine learning to deep learning.
The way to train the underlying classifier model is preparing it twice for each point in the test set assigning each test point two potential labels ā 0 and 1.
This is the crucial distinction from the Inductive Conformal Prediction procedure, where the underlying classifier is only trained once on the training set, so to repeat one trains underlying classifier by appending each test point to the training set twice ā once with label 0 and once with label 1. This procedure is repeated for every point in the test set.
Overall you will train 2 x m times, where m is the number of points in your test set.
This might become computationally expensive for huge datasets where you might want to use Inductive Conformal Prediction instead. For medium and small datasets, it is not costly, and to obtain better point predictions and more narrow probability intervals, you might prefer to use Transductive (full) Conformal Prediction to get better prediction intervals by training the classifier model 2 x m times. Many algorithms, like logistic regression, are high-speed.
Here we use the German credit dataset; this is a classical dataset describing good and bad credit risk based on features such as loan duration, credit history, employment, property, age, housing and others.
We can split the dataset into training and test set and iterate through each point in the test set.
You can see the first test point with the original index 30 that has been appended to the tail of the training set effectively becoming another point in synthetic training set we have created for training full Conformal Predictor. This dataset includes all points in the training set and one test point.
Now for this test point, we have a feature set to train two classification models, one with postulated label of 0 and one with postulated label of 1. We create two labels as following.
y_train_plus_test_0 = np.append(y_train, 0)
y_train_plus_test_1 = np.append(y_train, 1)
We train two models using any classifier (in my case I have used Logistic Regression from Scikit-learn):
# train classifier with y_test label 0
model.fit(X_train_plus_test, y_train_plus_test_0)
y_pred_score_train_plus_test_0 = model.predict_proba(X_train_plus_test)
# train classifier with y_test label 1
model.fit(X_train_plus_test, y_train_plus_test_1)
y_pred_score_train_plus_test_1 = model.predict_proba(X_train_plus_test)
You can see that training the model twice is important as classification scores produced are different not only for the test point but also for the training set.
Step 3: Compute nonconformity scores for both models for the training set enlarged with appended test point (one test point at a time).
The result is two vectors of nonconformity scores, one for potential label 0 and one for potential label 1.
InverseProbabilityNC function shows how to compute the hinge loss for each training set point and each test set point given two trained classifier models. Remember, we need two sets of nonconformity scores ā one for potential label 0 and one for potential label 1.
This means training each classifier twice for each test point added to the train set. After computations for the first test point are done, it is removed, the second test point is appended to the train and so on.
# Here, we use inverse probability nonconformity measure known as hinge loss. The function below calculates the probability of not predicting the correct class by simply looking up
# probability score predicted by the underlying model for the correct class and subtracting it from 1
# For each correct output in ``y``, nonconformity is defined as math:: 1 - \hat{P}(y_i | x)
# this measure is also known as a hinge loss and is based simply on the probability estimate provided for the correct class label y_i
def InverseProbabilityNC(predicted_score, y):
prob = np.zeros(y.size, dtype=np.float32)
for i, y_ in enumerate(y):
if y_ >= predicted_score.shape[1]:
prob[i] = 0
else:
prob[i] = predicted_score[i, int(y_)]
return 1 - prob
We can see from the distribution of nonconformity scores that for label 0, the nonconformity score is quite typical (conforming) compared to the training set, whilst, for potential label 1, the nonconformity score is in a low-density probability area and hence less likely.
This indicates that the test object can be assigned a label of 0 and that 1 is less likely. But Conformal Prediction is a robust mathematical machine learning framework, so we need to quantity this decision. This is where p-values come about.
Step 4: Compute p-values for each test point twice, for potential class label 0 and potential class label 1.
To do that, we use the definition of p-values as per Vovkās book āAlgorithmic Learning in a Random World.ā
According to the algorithm, what we need to do is to check (for each test point and each of the two vectors [one for label 0 and one for label 1] of nonconformity measures alpha) how many objects in a combination of the training set and a test point have nonconformity values that are larger or equal the nonconformity measure of the test point and divide it by the number of training points + 1 (1 is for the test point). As a result, we obtain two p-values for each test point, one for class 0 and one for class 1.
This is the central idea of conformal prediction ā we use the nonconformity values for each test point to see how well it fits the training set.
Suppose there are enough points in the training set that are as or even more nonconformal with the ābagā of training points as the test point. In that case, we consider that based on the data that we have seen and the nonconformity scores for the test point (for each specific postulated label), the respective postulated label [0 or 1 ] fits (or does not) well with the data and we include (or otherwise exclude this label into (from) the prediction set.
This is a form of hypothesis testing to test each potential label and is a concept similar to statistical hypothesis testing you might already be familiar with. For each postulated value of the y label, we form the null hypothesis that the label can be included into the prediction set if the p-value is larger than the significance level. But if the p-value is smaller than the significance level, we reject the hypothesis that postulated label conforms with the training data and exclude it from the prediction set.
Here is a Python code illustration of how to compute p-values. Donāt forget to compute two p-values for each test object, one for potential label 0 and one for potential label 1.
for i, test_point_conformity_score in enumerate(non_conformity_scores_test):
p_value = (np.sum(non_conformity_scores_train >= test_point_conformity_score) + 1) / (len(non_conformity_scores_train)+1)
p_values.append(p_value)
Think about test objects being like Schrƶdingerās cats ā they donāt have a label (as we donāt know their labels), so until we know the actual labels, test objects can have two labels simultaneously!
Full transductive Conformal Prediction is so brilliant ā it helps one to think about quantum mechanics. No amount of inductive conformal prediction applied to cats and dogs comes close ā in Inductive Conformal Prediction, we know the labels of the calibration set, no Schrƶdingerās cats there!
No stress! We are almost done.
Step 5: Final step, based on computed p-values computed for each test point and selected confidence level epsilon, we include (or exclude) potential labels as follows.
If the p-value for the corresponding potential label is larger than the significant level, we include such a label. Otherwise, we exclude it.
As an example, suppose that we have obtained two p-values for the first object in the test set:
- P-value for label 0ā0.55, as the p-value is larger than the significance level (0.05), we include a postulated label (0 in this case) into the prediction set for this test point.
- The p-value for label 1ā0.002, as the p-value is smaller than the significance level (0.05), we can not include a postulated label (1 in this case) into the prediction set for this test point.
The final prediction set for this point is [0].
All potential combinations that exist are [] (nothing, nada, zilch, empty set ā based on testing, we exclude both labels as the object is so strange it does not conform after assigning both options 0 and 1, so given selected confidence level, we are not confident giving a label. The model effectively says āI donāt knowā, a safer option, especially in critical applications like medical images for serious disease diagnostics, rather than saying there is no decrease and missing a critical disease). Other potential options are [0], [1], [0,1].
If you stayed awake reading through the steps, congratulations, you have understood how full transductive Conformal Prediction works under the hood and what is in common between TCP and Schrƶdingerās cats!
As a bonus, now that you have understood the framework and if you donāt want to code the above by hand, there is an excellent classical Conformal Prediction library called Nonconformist that lets you achieve Transductive (full) Conformal Prediction with a few lines of code! All it takes is four lines of code!
# Create a default nonconformity function; model is your underlying classifier
nc = NcFactory.create_nc(model,err_func=InverseProbabilityErrFunc())
# Create a transductive conformal classifier
tcp = TcpClassifier(nc)
# Fit the TCP using the proper training set
tcp.fit(X_train, y_train)
# Produce predictions for the test set with a confidence of 95%
prediction = tcp.predict(X_test.values, significance=0.05)
Check out my book āPractical Guide to Applied Conformal Prediction: Learn and apply the best uncertainty frameworks to your industry applicationsā https://a.co/d/j5p2eSa.
I hope you enjoy reading this article; you can follow me on LinkedIn and Twitter. Donāt forget to put a like on this article and a star on the Awesome Conformal Prediction repo.
Additional materials:
- Awesome Conformal Prediction
- Colab notebook for this article
- Practical Guide to Applied Conformal Prediction: Learn and apply the best uncertainty frameworks to your industry applications
- āModel-agnostic nonconformity functions for conformal classificationā
- A Tutorial on Conformal Prediction by Vovk and Shafer
- Algorithmic Learning in a Random World (Vladimir Vovkās book on Conformal Prediction)
- Nonconformist ā the original Conformal Prediction library by
Henrik Linusson - Pythonās predict proba does not actually predict probabilities