Does NGBoost work? Evaluating NGBoost against key criteria for good probabilistic prediction

In one of my previous articles “How to evaluate probabilistic forecasts”, I have illustrated how to evaluate probabilistic predictors using the two critical evaluation criteria of validity and efficiency.

Until Conformal Predictive Distributions were invented (check out “How to predict full probability distribution using the machine learning Conformal Predictive Distributions”), the general approach to producing perfectly calibrated probabilistic predictions did not exist, so many researchers and practitioners had to resort to methods that mostly estimated parameters of various distributions using statistical or machine learning methods.

Whilst such approaches might look like a perfect solution to the Holy Grail of probabilistic forecasting; it is usually not publicised that methods relying on parametric distributions don’t come with any calibration guarantees and often produce uncalibrated predictions resulting in both the location of the peak as well as predictive intervals being out-of-sync with actual data.

In 2020, the Stanford Machine Learning group published NGBoost, which was then referred to as an “algorithm for generic probabilistic prediction via gradient boosting” to address the estimation of predictive uncertainty, including for critical applications such as healthcare.”

NGBoost approach was basically to extend the gradient boosting idea to probabilistic regression by treating parameters of the hypothesised distribution as targets for a multiparameter boosting algorithm. Essentially, instead of estimating the conditional mean, NGBoost authors proposed to estimate the location and scale parameters of a specified distribution, such as the Gaussian distribution. The NGBoost paper stated that NGBoost “matches or exceeds the performance of existing methods for probabilistic prediction while offering additional benefits in flexibility, scalability, and usability.”

In a range of experiments on standard datasets, NGBoost claimed competitive performance compared to the existing at the time methods, most of the Bayesian deep learning variety.

NGBoost: Natural Gradient Boosting for Probabilistic Regression

The authors of the paper however, did not compare the performance of the NGBoost to conformal prediction. In my article “How to predict full probability distribution using machine learning Conformal Predictive Distributions”, I have showcased how using Conformal Predictive Distributions can produce perfectly calibrated probabilistic CDF for each test point regardless of the underlying regressor for any data distribution and any data sample size.

So given that Conformal Predictive Distributions (invented in 2017) already produced perfectly calibrated probabilistic predictions by the time NGBoost appeared around 2020 and given that NGBoost was never fully benchmarked to the existing SOTA, a more detailed look at what and how NGBoost does and how it compares to other methods not mentioned in the NGBoost paper is warranted.

To test NGBoost, I have used the “Concrete” dataset (also used in the NGBoost paper that claimed SOTA performance on this dataset among others). One can find this dataset in the UCI data repository. The target feature is ‘concrete_compressive_strength’, and the objective is to estimate concrete strength from other features.

So, let’s of a deeper dive into NGBoost and see more beyond the dry performance figures produced in the NGBoost article.

UCI Concrete dataset

The distribution of the target variable (concrete strength) looks like this, so arguably clearly not normal. The NGBoost, meanwhile uses the normal distribution to estimate the uncertainty of predictions in a general case.

Distribution of the target variable, concrete strength

So what could go wrong if one tried to predict the target variable using XGBoost with normal distribution under the hood? Let’s see.

We slice the dataset into training, validation and test set. We then compute probabilistic predictions produced by NGBoost and plot them on the test dataset.

Predictions produced by the NGBoost on Cement dataset

What is happening here? Out of nine predictions, only about three are more or less on, another three are somewhat off, and the remaining three are way off.

Let’s check the objective of NGBoost “Not only do we want our models to make accurate predictions, but we also want a correct estimate of uncertainty along with each prediction.”

Based on the validity criteria (lack of bias), the NGBoost seems to be hit-and-miss. So perhaps on the efficiency (coverage) criteria, it does better? Let’s check the calibration of Prediction Intervals. Running the numbers for Prediction Interval coverage:

95% PI coverage promised by NGBoost - actual coverage 0.91
80% PI coverage promised by NGBoost - actual coverage 0.74

Not quite as expected, the 95% PI should deliver 95%, and the 80% PI should deliver 80%. Here we have quite a gap of 5–6%, the actual coverage is way less than promised.

What would be the result of decision-making relying on such predictions considering that this is the dataset for estimating the strength of concrete?

Well, in this case, incorrect estimation of max concrete strength might result in something like this, perhaps?

But wait, one might ask, we did not use the validation set to tune the NGBoost parameters. Perhaps hyperparameter optimisation would do the trick so that we will end up with the correct predictions.

Let’s optimise hyperparameters on the validation set and check the results again. It does not get any better; it seems to be getting worse based on the location of the peak vs the actual test point. And only marginally but inconsistently better for the PI coverage.

Optimised hyperparameters predictions produced by the NGBoost on Cement dataset
95% PI coverage promised by NGBoost - actual coverage 0.94
80% PI coverage promised by NGBoost - actual coverage 0.85

Predictive Intervals are still not well-calibrated; the 80% PI has flipped from being an overconfident prediction to an underconfident one on the same dataset using the same model!

TL;DR NGBoost does not produce unbiased point forecasts, and its Prediction Intervals are uncalibrated.

The solution to the NGBoost issues?

Replace it with your favourite point regression model (e.g. XGBoost) augmented by Conformal Predictive Distributions. Get the best of both worlds, highly accurate point predictions and well calibrated probabiltic predictions generated by Conformal Predictive Distributions. Conformal Predictive Distributions do exactly what they say on the tin and always provide well-calibrated prediction intervals by default.


  1. “How to predict the full probability distribution using machine learning Conformal Predictive Distributions
  2. NGBoost: Natural Gradient Boosting for Probabilistic Prediction
  3. “Probabilistic forecasts, calibration and sharpness”

3. Awesome Conformal Prediction. The most comprehensive professionally curated list of Awesome Conformal Prediction tutorials, videos, books, papers and open-source libraries in Python and R.



Principal Data Scientist, PhD in Machine Learning, creator of Awesome Conformal Prediction 👍Tip: hold down the Clap icon for up x50

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Valery Manokhin, PhD, MBA, CQF

Principal Data Scientist, PhD in Machine Learning, creator of Awesome Conformal Prediction 👍Tip: hold down the Clap icon for up x50