How to evaluate Probabilistic Forecasts
Have you ever wondered how to objectively and scientifically evaluate probabilistic predictions produced by statistical, machine and deep learning models?
In probabilistic prediction, the two critical evaluation criteria are validity and efficiency.
- 𝐯𝐚𝐥𝐢𝐝𝐢𝐭𝐲 (terms such as "calibration" and "coverage" are also used) is essentially all about ensuring that there is no bias in forecasts. How does one measure bias in probabilistic forecasts that, unlike point predictions, produce PIs (prediction intervals)?
The concept is relatively simple and can be illustrated like this. If a forecasting model produced claimed 95% confidence, then (on average) prediction intervals should (by definition) cover ~ 95% of actual observations. That is ~ 95% of actual observations should be within the prediction intervals generated by the forecasting model.
In probabilistic prediction, 𝐯𝐚𝐥𝐢𝐝𝐢𝐭𝐲 is a must-to-have (necessary) criterion before anything else comes into consideration. If forecasts produced by the model are not probabilistically valid, relying on such forecasts for decision-making is not helpful and could be risky and sometimes outright dangerous. In high-stakes applications such as medicine and self-driving cars, forecasts that lack validity (have a bias) can lead to catastrophic outcomes.
- The second criterium is that of 𝗲𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 (also terms such as "width" or "sharpness" are being used). Efficiency is desirable but not a must-to-have after the validity requirement has been satisfied. Efficiency relates to the width of prediction intervals - having a more efficient predictive model means more narrow PIs (Prediction Intervals).
These are the two critical metrics for evaluating any probabilistic predictor or probabilistic forecasting (time series) model and/or application.
Validity (calibration/coverage) and efficiency (width/sharpness) are the natural and interpretable metrics for evaluating the predictive uncertainty of any probabilistic prediction (non-time series) or any probabilistic forecasting (time series ) model.
How can one ensure and optimise validity and efficiency?
In a nutshell, validity in final samples is automatically guaranteed by only one class of Uncertainty Quantification methods — Conformal Prediction.