Conformal Prediction Via Regression-As-Classification

8 min readApr 13, 2024

R2CPP: The Latest Addition to the Conformal Prediction Landscape for Robust Uncertainty Quantification in Regression Analysis.

Update: the R2CPP model has been featured in Oxford-Man Institute of Quantitative Finance research May 2024 newsletter

Conformal Prediction is rapidly becoming one of the most celebrated frameworks in the realm of uncertainty quantification, standing out for its robust approach to gauging the reliability of predictions.

This innovative framework has gained tremendous traction among data scientists, researchers and analysts for its ability to provide valid, statistically significant measures of confidence, regardless of the underlying distribution of data.

Whether it’s financial forecasting, medical diagnosis, or any other predictive modeling task, Conformal Prediction offers a transparent and scientifically grounded method to ensure that the predictions are not just accurate, but also accompanied by a quantifiable measure of certainty.

This makes it an invaluable tool in a world where decisions are increasingly driven by data, providing a much-needed layer of assurance that helps mitigate risk and foster trust in predictive insights.

In my book ‘Practical Guide to Applied Conformal Prediction in Python: Learn and apply the best uncertainty frameworks to your industry applications’ I have covered some of the most popular Conformal Prediction models for regression including blockbuster model Conformalized Quantile Regression, Jackknife+ and Conformal Predictive Distributions that output the complete predictive distribution for every test point.

In my book “Practical Guide to Applied Conformal Prediction in Python: Learn and Apply the Best Uncertainty Frameworks to Your Industry Applications”, I deep dive into some of the most prominent Conformal Prediction models tailored for regression.

Highlighted models include the highly effective Conformalized Quantile Regression, the robust Jackknife+, and Conformal Predictive Distributions. Conformal Predictive Distributionss are particularly distinguished for their ability to provide a complete predictive distribution for every test point, ensuring comprehensive uncertainty quantification.

These conformal prediction models equip practitioners and researchers with the necessary tools to implement these advanced models in various industry scenarios, enhancing the reliability and interpretability of predictive outcomes.

Recently I have been exploring new Conformal Prediction models for regression and have come across an interesting method ‘Conformal Prediction Via Regression-As-Classification’ by researchers from RIKEN Center for AI Project, SambaNova Systems, Salesforce and Apple published as a conference paper at ICLR 2024.

Quantifying uncertainty in regression poses significant challenges, particularly when dealing with outputs that are heteroscedastic, multimodal, or skewed.

Conformal Prediction (CP) techniques, under relatively mild conditions, are designed to address these complexities. These methods strive to construct prediction sets that, given specific test inputs, are highly likely to encompass the true (yet unknown) output values. This reliability makes CP an essential tool in statistical modeling, providing a robust framework for managing and interpreting uncertainty effectively.

The construction of a set in Conformal Prediction (CP) is driven by a conformity score, which essentially gauges how similar a new test example is to the training examples. This score forms the basis of the conformal set, incorporating examples that exhibit conformity scores. One of the principal hurdles is crafting an appropriate conformity score. Commonly, simple metrics such as the distance to a mean regressor are used, but these can overlook complex characteristics of the output distribution, such as its shape. This oversight might result in symmetric intervals that fail to account for heteroscedasticity.

Ideally, it would be advantageous to estimate the (conditional) distribution of the output, perhaps through techniques like kernel density estimation, and use this estimation to construct a confidence interval. However, these estimation methods bring their own set of challenges; they are often sensitive to the choice of kernel and hyperparameters, which can lead to unstable outcomes.

Regression as classification (R2CPP) navigates these challenges by leveraging established CP techniques originally developed for classification. The approach involves transforming regression tasks into classification problems, allowing us to apply CP methods designed for classification to construct a conformal set.

The figure shows two examples of heteroscedastic and bimodal input and how R2CPP adaptively changes prediction interval.

R2CPP utilizes classification techniques to construct a distribution-based conformal set, adept at adapting to the shape of the output distribution while maintaining the simplicity and efficiency characteristic of CP for classification.

The method begins by discretizing the output space into bins, each treated as a separate class. To maintain the continuity of the original output space, the authors develop an alternative loss function. This function penalizes predictions far from the true output value but incorporates entropy regularization to encourage variability. The resultant method is effectively responsive to complexities such as heteroscedasticity and bimodality in the label distribution, making it a robust solution for handling diverse output scenarios.

R2CPP is designed to compute a conformity function that effectively determines the suitability of a label for a specific data point. Considering that the label distribution can exhibit various complex forms — such as bimodality, heavy-tailedness, or heteroscedasticity — our method is tailored to account for these diverse distribution shapes while maintaining precise coverage.

A common approach involves utilizing the conditional label density as the conformity function, which has proven to yield dependable outcomes in classification scenarios. Typically, in classification-based conformal prediction, practitioners use probability estimates generated by a Softmax neural network. This network processes K output logits with a cross-entropy loss to estimate these probabilities. Let us refer to this approach as the parameterized density.

Traditionally, we fit neural network by minimizing the cross-entropy loss on the training set:

Let’s consider that we have successfully trained a model and obtained a parameter theta that minimizes the traditional cross-entropy loss on our training dataset. A straightforward choice for the conformity score would be the probability of a label as given by the learned conditional distribution.

This method is both simple and effective. Thanks to the flexibility of the neural network, it can adaptively learn a variety of label distributions across different examples without the need for explicitly designing specific prior structures.

In the regression context, the distribution of labels is continuous. One strategy to bridge classification and regression challenges, known outside of conformal prediction literature as “Regression-as-Classification,” involves transforming a regression problem into a classification one.

This is achieved by segmenting the range of output values into bins. Specifically, we create K bins, with each bin being one K equally spaced intervals covering the range [y_min, y_max], where y_min and y_max are the minimum and maximum label values observed in the training dataset, respectively.

In this method, we define y_hat values as the midpoints of each discretized bin. Each bin, labeled as the k-th bin, encompasses all labels in the range that are closest to e y_hat.

Conceptually, we treat each bin as a distinct class, effectively transforming a regression problem into a classification one. This straightforward approach has surprisingly yielded robust outcomes. Recent studies, such as those by Stewart et al. (2023), have indicated that this binning technique leads to more stable training processes and significantly enhances the ability to learn conditional expectations.

To merge the strategies of classification and regression in conformal prediction, a practical approach is to adopt the Classification Conformal Prediction model with discrete labels. This adaptation allows for training the neural network using modified labels with cross-entropy loss, which results in a discrete distribution.

A key issue with using CrossEntropy loss in classification-based conformal prediction is its disregard for the structural relationships between classes. In standard classification, classes are considered independent with no inherent structure, so CrossEntropy loss focuses solely on maximizing the probability mass on the correct label, regardless of its proximity to the actual label.

However, in regression settings where labels have an ordinal structure, this approach can be limiting. To improve accuracy, it’s essential to develop a loss function that not only targets the correct bin but also considers the neighboring bins. Ideally, this would involve crafting a density estimate that assigns higher probabilities to points close to the true label y and lower probabilities to those farther away.

Therefore, a logical goal for learning the probability density function q is to ensure that the expected value of the product loss(y, y_hat) q(y_hat} | x) is minimized. We aim to identify a distribution q that reduces this loss effectively.

Although the proposed loss function effectively captures the relationships between bins, it tends to produce Dirac distributions, leading to overconfidence issues in neural networks, a common concern in recent research. Traditionally, smoothness is crucial in density estimation. To address this, the authors utilize a classic entropy regularization technique, favoring density estimators that align well with the training label distribution and concentrate probability mass on the most accurate bins.

The aurthos opt for distributions that maximize entropy, minimizing assumptions about the data’s underlying structure. This is achieved by incorporating the Shannon entropy H of the probability distribution as a penalty in calculations.

In the experiments R2CPP performed competitively with other methods. To test the method on additional data, it was tests in this Kaggle Prediction Interval competititon and has demonstrated good results out-of-the-box without additional parameter tuning.

The authors released the user friendly package R2CPP which allows to produce prediction intervals using scikit-learn compatible API with just a few lines of code.

model = R2CCP({'model_path':'model_paths/model_save_destination.pth', 'max_epochs':5})
model.fit(X_train, y_frtrain)

intervals = model.get_intervals(X_test)
coverage, length = model.get_coverage_length(X_test, y_test.values)
print(f"Coverage: {np.mean(coverage)}, Length: {np.mean(length)}")

If you’re interested in gaining a deeper understanding of Conformal Prediction techniques and acquiring hands-on experience in quantifying uncertainty in machine learning, consider joining our live, peer cohort-based course ‘Applied Conformal Prediction’.

This interactive learning environment offers an excellent opportunity to explore these advanced techniques with the guidance of experts and the support of a community.

References:

Conformal Prediction Via Regression-As-Classification

Written by Valeriy Manokhin, PhD, MBA, CQF

No responses yet