Transformers Are What You Do Not Need
In my previous article, “Deep Learning Is What You Do Not Need”, I explained that deep learning for time series is not the solution that most companies need even to consider; in this article, we delve deeper into more good reasons why transformers are not the solution that is effective for time series.
2024: GitHub repo Transformers_Are_What_You_Dont_Need is a colleciton of resources and models showing why transformers might not be the answer for time series forecasting and showcasing the best SOTA non transformer models
The article was published in June 2022 and was a rather prescient call. Since then, many scientific papers have shown that transformers are not the answer for time series forecasting.
In August 2022, a compelling research paper, "Are Transformers Effective for Time Series Forecasting?” was published by a collaborative team of researchers from the Chinese University of Hong Kong and the International Digital Academy (IDEA). This paper emerged as a significant contribution to the ongoing discourse in the field of time series forecasting, particularly regarding the usage of Transformer-based models.
The study was conducted in response to a notable increase in the proposed solutions that utilised Transformers for long-term time series forecasting (LTSF). This trend in the research community has gained substantial momentum, with more and more researchers turning to Transformer-based models in their attempts to improve forecasting accuracy.
However, the researchers from the Chinese University of Hong Kong and IDEA questioned the validity of this growing research direction. They raised concerns about the efficacy of Transformers in the specific context of LTSF and challenged the assumption that these models were inherently suited to this task. Their work served as a critique and a call for reevaluation, urging the research community to scrutinise the evidence supporting the widespread adoption of Transformer models in time series forecasting.
Let’s consider the limitations of Transformers in Time Series Forecasting.
Temporal Information Loss
Transformers have been a breakthrough in many areas of machine learning, notably in natural language processing. However, when applied to time series forecasting, one of the significant issues is the loss of temporal information due to the permutation-invariant self-attention mechanism of Transformers.
In a Transformer model, the self-attention mechanism allows each token in the input sequence to interact with every other token, thereby understanding their dependencies and relationships. While this mechanism is highly effective for tasks like machine translation, where the order of words in a sentence can be shuffled without losing meaning, it becomes problematic when applied to time series data.
Time series data is fundamentally sequential, where the order of data points carries significant meaning. A key characteristic of time series is that it’s temporal — each data point is intrinsically linked to its position in time. This is where the permutation-invariant property of the self-attention mechanism becomes a problem. Permutation-invariance means the model treats the sequence as a set, ignoring the order of elements, which leads to the loss of temporal information. This property allows the model to give the same attention values even when the positions of data points are switched around.
Although Transformers try to preserve some order information by employing positional encoding and using tokens to embed sub-series, the self-attention mechanism's inherent permutation-invariance inevitably leads to temporal information loss. This can hinder the model’s ability to capture the temporal relationships in the data, which are often crucial for effective time series forecasting.
To showcase the issue of temporal information loss in Transformers, the researchers devised a clever experiment. They decided to scramble the data randomly, then sat back and watched to see how the Transformers would react. In an ideal world, this act of scrambling the data sequence would cause a significant drop in performance for any competent time series forecasting model. However, the Transformers seemed unfazed. Their performance remained pretty much the same, displaying a startling indifference to the order of data, which indicates that they were capturing only limited temporal relations.
The plot thickens when we bring into play a simple linear model dubbed LTSF_Linear. When put to the same test, this humble model, which had been outperforming the Transformers in other tests, significantly hit its performance. Unlike the Transformers, LTSF_Linear responded exactly as you’d expect a well-functioning forecasting model to react when faced with shuffled data. This stark contrast between the Transformers and the LTSF_Linear further underscored the limitations of the former when it comes to time series forecasting.
Basic Linear Models can outperform sophisticated Transformers
In the study “Are Transformers Effective for Time Series Forecasting?”, the researchers made an intriguing revelation. They introduced a collection of simple one-layer linear models, christened LTSF-Linear, to challenge the Transformer-based models in the realm of time series forecasting. Contrary to what might be expected given the complexity and sophistication of Transformer models, these straightforward LTSF-Linear models displayed superior performance across various real-life datasets.
This surprising result highlighted a crucial fact: complexity does not always guarantee superior performance. The Transformer-based models, despite their advanced architecture, sophisticated self-attention mechanism, and widespread success in other areas such as natural language processing, faltered in their ability to accurately forecast time series data compared to the LTSF-Linear models.
The superior performance of LTSF-Linear models was not just marginal but was often by a large margin. This revelation raises questions about the effectiveness of Transformer models in time series forecasting and points to the need for more research and innovation in this area. The study also underscores the value of simple, linear models, which can offer an effective and efficient alternative to more complex models, especially for time series forecasting.
This observation is not limited to this particular paper; in my previous article on this topic 𝐃𝐞𝐞𝐩 𝐋𝐞𝐚𝐫𝐧𝐢𝐧𝐠 𝐈𝐬 𝐖𝐡𝐚𝐭 𝐘𝐨𝐮 𝐃𝐨 𝐍𝐨𝐭 𝐍𝐞𝐞𝐝 ” I have mentioned that in the paper from MIT ‘FreDo: Frequency Domain-based Long-Term Time Series Forecasting”, simple, almost mechanical benchmark outperformed transformers.
Failure to extract temporal relations from longer input sequences
In their quest to evaluate the effectiveness of Transformers in time series forecasting, researchers discovered some significant shortcomings. One of the critical findings was that Transformers struggled to extract temporal relations from long sequences effectively.
Temporal relation in time series forecasting refers to understanding the dependencies between different points in time. In other words, it’s about grasping how past data points influence future ones. This concept is paramount in time series forecasting, allowing models to identify patterns and trends that extend over time.
However, Transformers, which have achieved remarkable success in areas such as natural language processing, surprisingly seemed to falter in this aspect when applied to time series data. The researchers found that Transformers’ forecasting errors were not reduced, and in some cases, even increased with the increase of look-back window sizes. The look-back window size refers to how far back in time the model looks to predict the future.
This discovery is counter-intuitive because, typically, a larger look-back window should provide more historical data, allowing the model to understand the temporal dynamics better and, thus, improve forecasting accuracy. However, the failure of Transformers to reduce forecasting errors with an increased look-back window size indicates that they could not utilise this additional historical data to improve their forecasting effectively.
This finding questions the ability of Transformer models to handle long-term dependencies in time series data, which is a critical aspect of time series forecasting.
Failure to adapt temporal attention
One of the strengths of Transformers is their ability to use attention mechanisms to learn the importance of different input elements when making a prediction. This is particularly useful in tasks such as natural language processing, where the context and meaning of words can vary widely based on their position in a sentence or their relationship with other words.
However, when applied to time series forecasting, a problem arises with this attention mechanism. Specifically, Transformers tend to generate indistinguishable temporal attentions for different series. This means that they assign similar importance to different points in time, regardless of the specific characteristics or behaviour of the series at these points. This can be a serious problem, especially when dealing with non-stationary real-world data, where the joint distribution changes over time.
For example, in a nonstationary time series, there may be periods of rapid change followed by periods of relative stability. A good forecasting model should be able to recognise and adapt to these changes, assigning different levels of importance to different periods based on their characteristics. However, if a Transformer assigns similar attention to all periods, it may fail to capture these changes effectively, thus hindering its predictive capability.
This issue, termed “over-stationarisation”, leads Transformers to generate indistinguishable temporal attentions for different series and impedes the predictive capability of deep models.
Outperformed by simpler models
The trajectory of Transformer models in time series forecasting is reminiscent of the path that Facebook’s Prophet model followed. Prophet was developed by Facebook’s original developers, who initially claimed that it could deliver forecasting performance on par with human experts, resulting in a swift rise in its popularity. Over time, however, these claims were moderated, and it is now widely acknowledged that Prophet is not the high-performing forecasting model that was initially suggested.
For context, Facebook’s Prophet model was developed for forecasting time-series data based on an additive model. It was designed to be easy and flexible, accommodating a wide range of potential inputs and providing interpretable parameters. Despite its initial popularity and success, Prophet faced criticism and was found to be outperformed by other models, including some simpler models, on various benchmarks.
In conclusion, the story of Facebook’s Prophet is a cautionary tale for the Transformer models’ trajectory in time series forecasting. While the Transformers have shown promise due to their global range modelling ability and success in other domains, such as natural language processing, recent studies have highlighted their limitations in handling time series data effectively.
References: