How to evaluate Probabilistic Forecasts

Valeriy Manokhin, PhD, MBA, CQF
3 min readDec 16, 2021

Have you ever wondered how to objectively and scientifically evaluate probabilistic predictions produced by statistical, machine and deep learning models?

In probabilistic prediction, the two critical evaluation criteria are validity and efficiency.

- 𝐯𝐚π₯𝐒𝐝𝐒𝐭𝐲 (terms such as "calibration" and "coverage" are also used) is essentially all about ensuring that there is no bias in forecasts. How does one measure bias in probabilistic forecasts that, unlike point predictions, produce PIs (prediction intervals)?

The concept is relatively simple and can be illustrated like this. If a forecasting model produced claimed 95% confidence, then (on average) prediction intervals should (by definition) cover ~ 95% of actual observations. That is ~ 95% of actual observations should be within the prediction intervals generated by the forecasting model.

In probabilistic prediction, 𝐯𝐚π₯𝐒𝐝𝐒𝐭𝐲 is a must-to-have (necessary) criterion before anything else comes into consideration. If forecasts produced by the model are not probabilistically valid, relying on such forecasts for decision-making is not helpful and could be risky and sometimes outright dangerous. In high-stakes applications such as medicine and self-driving cars, forecasts that lack validity (have a bias) can lead to catastrophic outcomes.

- The second criterium is that of π—²π—³π—³π—Άπ—°π—Άπ—²π—»π—°π˜† (also terms such as "width" or "sharpness" are being used). Efficiency is desirable but not a must-to-have after the validity requirement has been satisfied. Efficiency relates to the width of prediction intervals - having a more efficient predictive model means more narrow PIs (Prediction Intervals).

These are the two critical metrics for evaluating any probabilistic predictor or probabilistic forecasting (time series) model and/or application.

Validity (calibration/coverage) and efficiency (width/sharpness) are the natural and interpretable metrics for evaluating the predictive uncertainty of any probabilistic prediction (non-time series) or any probabilistic forecasting (time series ) model.

How can one ensure and optimise validity and efficiency?

In a nutshell, validity in final samples is automatically guaranteed by only one class of Uncertainty Quantification methods β€” Conformal Prediction.

All other alternative Uncertainty Quantification methods do not have in-built validity guarantees. In the first independent comparative study of all four classes of uncertainty quantification methods, only Conformal Prediction satisfied the property of validity.

Valid prediction intervals for regression problems (Nicolas Dewolf Β· Bernard De Baets Β· Willem Waegeman, 2022)
"Valid prediction intervals for regression problems." Nicolas Dewolf Β· Bernard De Baets Β· Willem Waegeman (2022)

On the other side, efficiency depends on multiple factors, including the underlying prediction model, the quantity of data, how "hard" the dataset is, and, in the case of conformal prediction, the conformity measure.

An example below shows the validity (coverage /calibration) and efficiency (width/sharpness) for two probabilistic prediction models. Each model produces a prediction interval for each x_i in this scenario; such prediction interval attempts to cover the true y_i, represented by the red points with a specified confidence level (say 95%). Validity (coverage) is calculated as the fraction of actual values contained in these regions. These regions' width is reported in multiples of the standard deviation of the training set y_i values.

One can see that the first model has about 80% coverage. The second model has only about 60% coverage. Therefore, the first model is better in terms of validity as it will result in much less bias.

The first model has lower efficiency as the Average Width is about 2.2 standard deviations, whilst the second model's Average Width is only about 0.7 standard deviations.

Want to learn more about the problems and terminology in probabilistic forecasting β€” the key paper in probabilistic forecasting is from Gneiting and Katzfuss is 'Probabilistic forecasts, calibration and sharpness' (2007)

#timeseries #uncertainty #metrics #forecasting #demandforecasting #probabilisticprediction #machinelearning

Additional materials:

  1. "Probabilistic forecasts, calibration and sharpness"

2. Awesome Conformal Prediction. The most comprehensive professionally curated list of Awesome Conformal Prediction tutorials, videos, books, papers and open-source libraries in Python and R.

3. "Valid prediction intervals for regression problems"



Valeriy Manokhin, PhD, MBA, CQF

Principal Data Scientist, PhD in Machine Learning, creator of Awesome Conformal Prediction πŸ‘Tip: hold down the Clap icon for up x50