Benchmarking Neural Prophet. Part I — Neural Prophet vs Facebook Prophet.
In 2020–2021 have written many LinkedIn posts explaining 𝘁𝗵𝗮𝘁 𝗳𝗮𝗰𝗲𝗯𝗼𝗼𝗸 𝗽𝗿𝗼𝗽𝗵𝗲𝘁 𝗶𝘀 𝗮 𝗻𝗼𝗻-𝗽𝗲𝗿𝗳𝗼𝗿𝗺𝗶𝗻𝗴 𝗳𝗼𝗿𝗲𝗰𝗮𝘀𝘁𝗶𝗻𝗴 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺 𝘁𝗵𝗮𝘁 𝗻𝗼𝘁 𝗼𝗻𝗹𝘆 𝗱𝗼𝗲𝘀 𝗻𝗼𝘁 𝘄𝗼𝗿𝗸 𝗮𝗰𝗿𝗼𝘀𝘀 𝗮𝗻𝘆 𝗿𝗲𝗮𝘀𝗼𝗻𝗮𝗯𝗹𝗲 𝘀𝗲𝘁 𝗼𝗳 𝘁𝗶𝗺𝗲𝘀𝗲𝗿𝗶𝗲𝘀 𝗱𝗮𝘁𝗮𝘀𝗲𝘁𝘀, but also 🆄🅽🅳🅴🆁🅿🅴🆁🅵🅾🆁🅼🆂 🅼🅾🆂🆃 🅾🅵 🅾🆃🅷🅴🆁 🅵🅾🆁🅴🅲🅰🆂🆃🅸🅽🅶 🅰🅻🅶🅾🆁🅸🆃🅷🅼🆂.
2022 update: Meta has discontinued all claims made by the original Facebook Prophet development team, including grotesque claims such as 'anyone can achieve forecasting performance on par with human experts by using Facebook Prophet.'
As I have mentioned in my interview with Analytics India magazine (see Facebook Prophet falls out of favour), Facebook Prophet's credibility and popularity have taken a severe hit. I have previously pointed out that recent papers on time series have not used Facebook Prophet as baselines as it does not perform well across any general forecasting task.
More importantly, as explained in several posts, such issues can not be rectified as facebook prophet contains pathological flaws inherent in the Facebook prophet's design itself.
With the recent launch of 'NeuralProphet' trumpeted by the new dev team with great fanfare, "We introduce NeuralProphet, 𝙖 𝙨𝙪𝙘𝙘𝙚𝙨𝙨𝙤𝙧 𝙩𝙤 𝙁𝙖𝙘𝙚𝙗𝙤𝙤𝙠 𝙋𝙧𝙤𝙥𝙝𝙚𝙩, 𝙬𝙝𝙞𝙘𝙝 𝙨𝙚𝙩 𝙖𝙣 𝙞𝙣𝙙𝙪𝙨𝙩𝙧𝙮 𝙨𝙩𝙖𝙣𝙙𝙖𝙧𝙙 𝙛𝙤𝙧 𝙚𝙭𝙥𝙡𝙖𝙞𝙣𝙖𝙗𝙡𝙚, 𝙨𝙘𝙖𝙡𝙖𝙗𝙡𝙚, 𝙖𝙣𝙙 𝙪𝙨𝙚𝙧-𝙛𝙧𝙞𝙚𝙣𝙙𝙡𝙮 𝙛𝙤𝙧𝙚𝙘𝙖𝙨𝙩𝙞𝙣𝙜 𝙛𝙧𝙖𝙢𝙚𝙬𝙤𝙧𝙠𝙨".
I did not realise that Facebook prophet set any standards other than generally terrible forecasting performance, but let's not spoil the show.
The launch of 'NeuralProphet' had caused a strange sense of Deja Vu reminiscent of when original facebook prophet devs claimed that 'anyone can obtain excellent performance on par with human experts by using Facebook prophet' whilst the paper about Facebook prophet didn't even benchmark Facebook prophet properly on any datasets beyond internal Facebook dataset or indeed against any other datasets or algorithms.
Fast forward 2+ years, many scientific papers, articles and social media posts have demonstrated that Facebook prophet does not work in general compared to other time-series forecasting algorithms. It does not generalise to diverse datasets and does not even work well on datasets it was expressly designed for— the data with trend and seasonality.
Coming back to the new incarnation of the 'prophet' — the NeuralProphet dev team recently posted a paper on ArXiv saying there is the need for hybrid solutions to bridge the gap between interpretable classical methods and scalable deep learning methods. However, this claim is not backed by any scientific evidence. The results from the M5 competition have demonstrated that data-driven machine learning methods outperformed both the simple and hybrid methods.
No 'hybrid methods' were seen in the M5 forecasting competition near the top winning table, and the creators of both winning 'hybrids' (Slawek Smyl and the team from Monash that took #1 and #2 winning places in the M4 forecasting competitions) did not win any top spots in the M5 competition either. Instead, the M5 forecasting competition was won by a variety of LightGBM methods that dominated Kaggle contests for a long time.
According to the claims from the NeuralProphet development team:
𝑶𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆, 𝑵𝒆𝒖𝒓𝒂𝒍𝑷𝒓𝒐𝒑𝒉𝒆𝒕 𝒓𝒆𝒕𝒂𝒊𝒏𝒔 𝒕𝒉𝒆 𝒅𝒆𝒔𝒊𝒈𝒏 𝒑𝒉𝒊𝒍𝒐𝒔𝒐𝒑𝒉𝒚 𝒐𝒇 𝑷𝒓𝒐𝒑𝒉𝒆𝒕 𝒂𝒏𝒅 𝒑𝒓𝒐𝒗𝒊𝒅𝒆𝒔 𝒕𝒉𝒆 𝒔𝒂𝒎𝒆 𝒃𝒂𝒔𝒊𝒄 𝒎𝒐𝒅𝒆𝒍 𝒄𝒐𝒎𝒑𝒐𝒏𝒆𝒏𝒕𝒔. 𝑶𝒖𝒓 𝒓𝒆𝒔𝒖𝒍𝒕𝒔 𝒅𝒆𝒎𝒐𝒏𝒔𝒕𝒓𝒂𝒕𝒆 𝒕𝒉𝒂𝒕 𝑵𝒆𝒖𝒓𝒂𝒍𝑷𝒓𝒐𝒑𝒉𝒆𝒕 𝒑𝒓𝒐𝒅𝒖𝒄𝒆𝒔 𝒊𝒏𝒕𝒆𝒓𝒑𝒓𝒆𝒕𝒂𝒃𝒍𝒆 𝒇𝒐𝒓𝒆𝒄𝒂𝒔𝒕 𝒄𝒐𝒎𝒑𝒐𝒏𝒆𝒏𝒕𝒔 𝒐𝒇 𝒆𝒒𝒖𝒊𝒗𝒂𝒍𝒆𝒏𝒕 𝒐𝒓 𝒔𝒖𝒑𝒆𝒓𝒊𝒐𝒓 𝒒𝒖𝒂𝒍𝒊𝒕𝒚 𝒕𝒐 𝑷𝒓𝒐𝒑𝒉𝒆𝒕 𝒐𝒏 𝒂 𝒔𝒆𝒕 𝒐𝒇 𝒈𝒆𝒏𝒆𝒓𝒂𝒕𝒆𝒅 𝒕𝒊𝒎𝒆 𝒔𝒆𝒓𝒊𝒆𝒔. 𝑵𝒆𝒖𝒓𝒂𝒍𝑷𝒓𝒐𝒑𝒉𝒆𝒕 𝒐𝒖𝒕𝒑𝒆𝒓𝒇𝒐𝒓𝒎𝒔 𝑷𝒓𝒐𝒑𝒉𝒆𝒕 𝒐𝒏 𝒂 𝒅𝒊𝒗𝒆𝒓𝒔𝒆 𝒄𝒐𝒍𝒍𝒆𝒄𝒕𝒊𝒐𝒏 𝒐𝒇 𝒓𝒆𝒂𝒍-𝒘𝒐𝒓𝒍𝒅 𝒅𝒂𝒕𝒂𝒔𝒆𝒕𝒔. 𝑭𝒐𝒓 𝒔𝒉𝒐𝒓𝒕 𝒕𝒐 𝒎𝒆𝒅𝒊𝒖𝒎-𝒕𝒆𝒓𝒎 𝒇𝒐𝒓𝒆𝒄𝒂𝒔𝒕𝒔, 𝑵𝒆𝒖𝒓𝒂𝒍𝑷𝒓𝒐𝒑𝒉𝒆𝒕 𝒊𝒎𝒑𝒓𝒐𝒗𝒆𝒔 𝒇𝒐𝒓𝒆𝒄𝒂𝒔𝒕 𝒂𝒄𝒄𝒖𝒓𝒂𝒄𝒚 𝒃𝒚 55 𝒕𝒐 92 𝒑𝒆𝒓𝒄𝒆𝒏𝒕.'
We already know that Facebook prophet is a low-performance forecasting algorithm of terrible quality and is overperformed by many other algorithms, including on datasets where facebook prophet is supposed to work. So if anything, the new paper by the NeuralProphet dev team confirms that Facebook prophet is terrible by pointing out that it is outperformed (by 55 to 92) per cent by being conceptually the same type of algorithm [we will talk about differences later in this article].
But the NeuralProphet paper does not tell us anything about whether NeuralProphet is any good at all in comparison with many other algorithms available as it simply does not benchmark NeuralProphet against anything other than … Facebook Prophet.
Why did the paper's authors not include any benchmarks, as is standard in almost any other paper introducing a new forecasting algorithm? After all, who needs another algorithm unless it is proven to perform vis-a-vis at least a core set of already available methods?
So let's start the journey to see how (if any) good 'NeuralProphet' really is.
In the first part of the series, we will take NeuralProphet out of the garage, open the bonnet (hood), and kick the tires.
To begin, we will use the same dataset that the NeuralProphet developer team has used https://neuralprophet.com/html/energy_solar_pv.html
We first do the same experiments here to ensure reproducibility and compare apples with apples.
First Neural prophet model — with no AR terms included (this would be broadly the same model as the original Facebook prophet that also does not have AR terms, so a priori a flawed model, but let's see how it goes anyway).
As seen in the plot below, without AR (autoregressive) terms, the fit, even in-sample, is quite wrong. This is in line with the current domain knowledge — without the AR terms, Neural Prophet is the original Facebook prophet.
If anything else, the new Neural Prophet model is yet another confirmation that Facebook Prophet did not manage to reproduce the data with seasonalities for which it was expressly designed.
Predictions on the test set for a week and the day ahead are also terrible.
Let's magnify this to see what happens in terms of predictions one day (24 hours) ahead — Neural Prophet without AR terms fails to capture the simple pattern. The RMSE for the training set is 118, and for the test, set is 143.
So we can reach our first conclusion — without the AR terms Neural Prophet = Facebook Prophet = generally a useless forecasting model that is not fit for purpose.
Second Neural Prophet — linear AR terms added.
We next fit neural prophet with the AR terms included using the same parameters as from the neural prophet website (n_lags = 3*24)
The in-sample fit is now much better. Using the AR terms, the Neural Prophet can now capture the dynamics of the time series better.
The RMSE for the training set is now 53, and for test, set is around 31.
Much better than Neural Prophet without AR terms; however same as prophet Neural Prophet seems unable to constrain radiation forecasts from going below zero…
Let's do one more final architecture from the Neural Prophet website before we kick the tires and open the hood to check the engine.
Third (and the last) neural prophet model using (non-linear) AR-Net.
One step ahead forecast with AR-Net: Using a neural network with several hidden layers. We use an optimised 0.003 learning rate as in the final version of the exercise on the Neural Prophet website.
m = NeuralProphet(growth='off', yearly_seasonality=False, \
weekly_seasonality=False, daily_seasonality=False, n_lags=3*24, \
num_hidden_layers=4, d_hidden=16, learning_rate=0.003)
We train Neural Prophet by providing it with the training set.
m = NeuralProphet()
df_train, df_test = m.split_df(df, valid_p=0.1)
train_metrics = m.fit(df_train)
test_metrics = m.test(df_test)
I have to say the plots look much nicer now, but one should never let the in-sample pictures charm you. What matters is the performance out-of-sample.
The RMSE for the training set is now 39, and for the test set is around ~31.
We note that whilst training error (RMSE) has come down from 53 to 39 due to much higher model capacity (4 layer DNN instead of linear function), the test error did not change. So in this case for this particular dataset AR_Net has provided no improvement out-of-sample in comparison with linear AR terms.
Conclusion 1: AR terms are crucial. Neural Prophet is only adding value via a vis Facebook Prophet when auto-regressive terms are switched on.
But on the other side, other models include AR terms, in particular ARIMA.
Conclusion 2: AR-Net does not seem to add much additional value after linear AR terms have been included. Whether it is for this dataset or a more general result remains to be seen, but for now, something to bear in mind as if this is a general observation, then Neural prophet does not offer anything new in addition to ARIMA / SARIMA model family.
Let's plot the final Neural Prophet Model (AR-Net with 4 hidden layers) predictions on the test set.
So far, so good, the tires did not fall off when we kicked them, but the engine needed some oil and tuning. In our next article "Benchmarking Neural Prophet. Part II — exploring electricity dataset", we take our Neural Prophet model for a ride and check what else is on the road and if it can stay in lane or needs to go into a slow lane.
To be continued…
References: