Benchmarking Neural Prophet. Part II — exploring electricity dataset.

Valeriy Manokhin, PhD, MBA, CQF
6 min readJan 3, 2022

Over the last year, I have written many posts explaining that Facebook prophet is a consistently underperforming forecasting algorithm, and in a nutshell, it does not work.

Not only does Facebook prophet not work across any reasonable set of time-series data, but it also underperforms most of the other forecasting algorithms such as ARIMA, TBATS and many other time-proven forecasting model families. The most curious fact is that Facebook Prophet also underperforms on the same time-series patterns it was expressly designed for, including for the time-series with seasonality.

More importantly, as explained in several posts, such issues can not be rectified as Facebook prophet contains flaws inherent in the prophet’s design itself.

Let's discuss it in more detail to understand what is meant by that and how (if) Neural Prophet addresses the shortcomings of its ancestor Facebook Prophet.

Arguably one of the main shortcomings of the Facebook prophet is its rigid parametric GLM (generalised linear model) structure. It is a 'curve-fitter' model that is not adaptive enough to local patterns.

One of the improvements in Neural Prophet was the inclusion of the auto-regressive terms. Yule-Walker equations are covered in any good econometrics. What is less known was that Yule published his paper in 1927 (95 years ago), and Walker published his paper in 1931.

The inclusion of auto-regressive terms for modelling time series has been around for almost 100 years because it works.

Most of the winning solutions on Kaggle always include auto-regressive terms (lags). The same goes for most of the top AutoML platforms and any successful time-series model, whether econometrics, machine or deep learning.

There are some exceptions, though — the Facebook prophet — which doesn't work because it can't model local patterns because its creators 'wisely' decided that auto-regressive terms don't matter. The result is — one of the worst forecasting algorithms of the XXIst century. The moral of the story — it often pays attention to learning what people did before and why they did it.

But this article is not about Facebook Prophet as such; it is about further exploring Neural Prophet.

In the first part of this series, 'Benchmarking Neural Prophet. Part I — Neural Prophet vs Facebook Prophet.' we have looked at how Neural Prophet performs on the Solar energy dataset to conclude that:

Conclusion: AR terms are crucial.

We will now use the 'Electricity' dataset from the UCI'. This dataset has been used in many recent research papers, including in many deep learning papers making it a good candidate for benchmarking Neural Prophet.

UCI 'Electricity' dataset used in many recently forecasting papers is flawed

Upon doing EDA and plotting the dataset, it immediately becomes clear that this dataset has serious issues. In particular, the data for 2011 appears incomplete, so the whole dataset can not be used as the data was not collected properly [TL;DR, a lot of data in 2011 is missing].

In addition, the downward spikes need to be processed (these are artefacts, as they correspond to changes to summertime).

Cleaned up 'Electricity' data resampled to monthly data

We will use clean data with 2011 removed, artefacts processed, and the data resampled to a monthly frequency. This data is then split 90–10 to test the performance of forecasting algorithms out-of-sample.

Again, similar to Part I, Neural Prophet without AR (autoregressive terms) performs abysmally. This is as expected as Neural Prophet without AR terms is essentially Facebook Prophet. There is nothing new here apart from yet another experiment concluding what we already knew — that Facebook prophet is generally a non-performant forecasting algorithm.

The fit on the training data looks like this.

Neural Prophet without auto-regressive terms on the training data

And here are out-of-sample predictions vis-a-vis actual data (even worse, but again nothing unexpected), as without AR terms, Neural Prophet is essentially Facebook Prophet.

Neural Prophet without auto-regressive terms on test data

Now we train Neural Prophet with linear AR terms. Knowing that AR terms are crucial, we obtain better fit and better predictions on the test dataset.

Neural Prophet with auto-regressive terms on training data

And here are out-of-sample predictions vis-a-vis actual data. Now with auto-regressive terms included, the training fit looks much better.

And the predictions on the test dataset look better as well.

Neural Prophet with auto-regressive terms on test data

Let's magnify predictions to include only the last month.

Neural Prophet with auto-regressive terms on test data, the last 30-days only.

Before we move on to 'AR-Net' (Neural Prophet with non-linear AR terms modelled by neural network), let's save the performance so that we can compare it with errors obtained from 'AR-Net' later. To avoid doubt, as this is quite important, these are out-of-sample errors on the test set.

Metrics on the test set, linear autoregressive terms.

Now we can increase model capacity by activating the 'AR-Net' part of Neural Prophet. AR-Net is called the USP (unique selling proposition) of Neural Prophet, and its success (or not) hinges on whether it can deliver on it. Why, one might ask? Well, because ARIMA is a tried and tested tool, it can handle linear autoregressive terms much better than Neural Prophet.

We use 30-day lags and three hidden layers with 16 neurons in every hidden layer.

Here are the results:

Metrics on the test set, linear autoregressive terms.

MAE — 1305; RMSE — 1849. Both errors are now higher than when using linear auto-regressive terms.

Let's plot training set fit and test set predictions vis-a-vis actuals.

Neural Prophet with AR-Net on training data
Neural Prophet with AR_Net on test data

Frankly, I can't see much difference in the visual fit, but we know that error went up, so these are worse predictions by AR-Net compared to linear auto-regressive terms.

Expanding to last month again, slightly different prediction but higher error.

Neural Prophet with AR-Net on test data, the last 30-days only

Let's compare components:

Linear AR terms.

Neural Prophet with linear auto-regressive terms components.


Neural Prophet with AR_Net components.

Well, I don't know about you, but I would go for the first one (other things being equal) on the basis that:

  1. It is a parsimonious model (Occam's razor principle).
  2. This model is easier to understand and explain.
  3. The AR-Net components look like "A little bit of everything is a whole lot of nothing."

How does one even reconcile AR-Net to physical and economic realities and energy consumption patterns? But again, this is deep learning territory that is difficult to interpret. MIT recently published a paper about convolutional neural networks classifying pictures based on 'pixel dust.'

Conclusion 1: AR terms are crucial. Neural Prophet adds value via a vis Facebook prophet, but only when auto-regressive terms are included. But, when it comes to forecasting, the concept of value is always relative (relative to "what?"). There are certainly other models that also include AR terms, in particular, ARIMA and also DeepAR.

Conclusion 2: deep learning-based AR-Net does not seem to add much value and seems to be destroying value. It also results in 'foggy' AR terms that are not interpretable.

Conclusion 3: As Neural Prophet without AR terms is essentially Facebook Prophet, there is no need to install or use Facebook prophet. It is redundant.



Valeriy Manokhin, PhD, MBA, CQF

Principal Data Scientist, PhD in Machine Learning, creator of Awesome Conformal Prediction 👍Tip: hold down the Clap icon for up x50