Valeriy Manokhin, PhD, MBA, CQF
5 min readDec 8, 2021

--

How to use machine learning to forecast time series (or the retrospection on the results from the Kaggle M5 forecasting competition).

More than a year ago I have been closely tracking the Kagle M5 forecasting competition https://www.kaggle.com/c/m5-forecasting-accuracy

TL;DR — after 4 decades of applied forecasting academia claiming that ‘simple forecasting methods work best’ culminating in the M4 forecasting competition where there were hardly any participants from the machine learning community (yet the two top solutions utilised machine learning), the organisers doubled down on fanciful claim that it was ‘still not clear’ if machine learning is the better approach for time series forecasting in comparison with simple methods such as exponential smoothing.

Roll forward a couple of years from the end of the M4 competition - this time the organisers put the next M5 competition on Kaggle which exposed ‘M-competitions’ for the first time in their history to a large community of machine learning experts.

The result — is a spectacular crash of the long-prevailing academic forecasting dogma and clear evidence (ALL of the top winning solutions used machine learning) that machine learning is the future of time-series forecasting.

What have we learnt since then? Quite a few things — one of them that in the M5 forecasting competition winners of the point forecasting track did not manage to convert their gains into winnings in the probabilistic forecasting track.

Why was this the case one might ask? Also, does this indicate a general result — namely shall we assume that it is futile to have both excellent point forecasts AND excellent probabilistic forecasts?

Whilst it is still very early days in terms of adoption of probabilistic forecasting by companies due to the complexity of the field and given that there are unfortunately no well-performing methods produced by the applied forecasting academia after being in the business of business forecasting for half a century what we do know is this:

- Probabilistic prediction is a ‘skills squared’ hard problem. It is hard enough to produce accurate point forecasts, it is even harder to produce point forecasts that are also accurate probabilistic forecasts.

  • Machine learning for time-series is still rather a craft, yes there are some high-performing models such as NBeats, but at the more granular level of forecasting at SKU/store level, it is still a hard problem requiring a lot of expert a-la-carte work creating bespoke features, designing correct time-series based validation, selecting correct metrics etc.
Time series expert at work creating a-la-carte state-of-the-art forecasting model

- Even if someone designed a good point forecast model (e.g. student from South Korea who used and tweaked the publicly posted LightGBM approach for many months same as essentially most of the other winners ), it takes an inordinate amount of time to create a well performing probabilistic model in addition to accurate point forecast model. Even despite a lot of knowledge being shared publicly during the M5 competition on Kaggle for the benefit of all participants (what metric to use, what model typeto use etc.) the winning teams spent months designing, building, testing and tweaking models further.

The data released by Walmart for the Kaggle M5 competition contained very limited set of Walmart data and only contained data from 10 stores in three US states. To put things into perspective if Walmart released all their data instead of just 10 stores that took thousands of people on Kaggle to spend hundreds of thousands of hours to solve problem on a very limited dataset, it would probably take a millennium of best and brightest working 24/7 to deliver state-of-the-art forecasts across all of Walmart SKUs (Walmart has to create forecasts for over a billion time series).

One can only conclude that there is zero likelihood that Walmart would ever use the Kaggle M5 solution for their forecasting systems in production, same as Netflix and later Zillow learnt that just because something wins Kaggle competition doesn’t mean it will work in reality

[NB: this is not a Kaggle bashing post, Kaggle has a lot of value and by own admission Walmart learnt a lot from machine learning approaches used in the Kaggle M5 and will be using learnings to augment their systems].

- Even if someone built very a very good point forecasting model, it would not be possible to design another a la carte probabilistic forecast on top of it as it would be 1) prohibitively expensive in terms of time 2) skills for accurate point forecasting are not transferrable into expert a la carte skills for probabilistic forecasting 3) fiddling with the point forecasting model will most likely impair its point performance, so the company building the model will have to go back to square one.

How to solve it then? Or does a common sense solution even exist? Well it turns out that it does.

If one already has worked hard to build an excellent point forecasting model, conformal prediction is able to augment it with probabilistic prediction intervals making the model:

1) well calibrated — if one selects 95% (or 99%) confidence this is what #conformalprediction driven model will deliver by default 95% (or 99%). No ifs, no buts and no excuses — the probabilistic model will be calibrated by default due to in-built mathematical guarantees.

2) non-invasive second layer to create probabilistic model won’t impair point forecasting model performance. Data scientists who spent months to create excellent point forecastsing models — rejoice, you won’t have to go back to white board.

3) is parameter free [hence does not suffer from bias]. It is a known fact both from the academic and practical perspective that parametric statistical models suffer from bias and produce overconfident prediction intervals.

What one might think is prediction interval covering 95% is in reality not covering 95% of data [I recently wrote an article on Medium illustrating how Facebook Prophet fails to deliver calibrated prediction intervals]

https://valeman.medium.com/benchmarking-facebook-prophet-53273c3ee9c6

4) does not rely on distribution assumptions about data. Unlike applied forecasting, machine learning does not rely on distribution assumptions and conformal prediciton follows the same approach. After all as they say assumptions are … let’s ask Google.

The road to forecasting bias is paved with good old parametric assumptions

5) provides in-built guarantees of validity in final samples. If one wants 95% it will produce 95%, no ifs, no buts and no excuses.

How to learn about better ways to forecast time series? Follow me on Medium, LinkedIn and Twitter.

#timeseries #machinelearning #forecasting

A medium article about Facebook Prophet’s unable to provide correct probabilistic predictions.

https://valeman.medium.com/benchmarking-facebook-prophet-53273c3ee9c6

--

--

Valeriy Manokhin, PhD, MBA, CQF
Valeriy Manokhin, PhD, MBA, CQF

Written by Valeriy Manokhin, PhD, MBA, CQF

Principal Data Scientist, PhD in Machine Learning, creator of Awesome Conformal Prediction 👍Tip: hold down the Clap icon for up x50

Responses (1)