Timeseries, Transformers, and Two Cultures
How Transformer-Based Neural Networks could Predict Chaotic Timeseries
In the academic literature on statistical learning (“AI” if you prefer”), there is a famous paper by Leo Breiman1 about two cultures in statistics2. One culture assumes that observed data are generated by some stochastic process that can be modeled. The other culture assumes the data generating process is unknown but algorithms can make predictions about it, even if they cannot describe the causal process behind that prediction (ie. a “blackbox”). At the time the paper was written, the first culture was in vogue – statisticians, economists, and social scientists focused on proposing models that could explain the world and then fit the data to the model. These models propose interpretable statistical representations and causal theorems of the world, but often perform poorly at prediction and empirical data deviate from the model. The black box approach to statistics was, at the time, less common. However, this second culture caught up as large datasets (ie. “big data”) and cheap compute made methods that previously were only hypothesized to work practical. Competition from computer science departments also drove this culture forward. In most prediction and forecasting applications today, the state of the art now relies on this second culture. Neural networks3, in particular, ended up outperforming most other methods in many applications. As of this writing, a relatively simple neural network architecture, called the transformer, has yielded astounding prediction results in the area of language, images, and video (referred to as “unstructured” data). The applications of these predictions, in areas such as chatbots, self-driving cars, or video generation, has made the second culture dominant.
However, these impressive advances in deep learning have not yielded similar advances in forecasting chaotic systems. Examples of such systems include geopolitics, financial markets, or the weather. This category of prediction problem shares some characteristics. They involve structured tabular data, most often dealing with timeseries, the best known models fit the data extremely weakly, the processes are non-stationary4, and human ability at this class of problems is poor and not limited just by speed, information, or computation. Neural networks, including transformers, have not transformed this field yet. Attempts at applying transformer based models to timeseries have had mixed results. Amazon and Salesforce, for instance, recently released Chronos5 and Morai6 respectively, a transformer based foundational models. Interestingly, basic benchmarks show these complex models underperform ensembles of standard econometric forecasting models7. Systematic evidence from the M-Competitions show that linear models continue to perform well and tree-based models generally win the competitions8.
I have heard three competing hypotheses as to whether and how transformer based neural networks will eventually overcome this relative underperformance in chaotic system forecasting: first, that they will eventually solve this given the rate improvement, second that not enough data exists for them to do so, and third, that they will solve chaotic forecasting indirectly, by helping practitioners source more specialized statistical methods.
The first hypothesis is the most straightforward. It proposes that after future improvements to transformer-based models, they will begin to perform well on chaotic systems prediction. I do not have a deep enough intuition to really evaluate the merit of this hypothesis. All statistical learning approaches – deep learning, tree-based methods, hierarchical Bayes, and others – take the basic approach of estimating a huge number of parameters and then regularizing them to prevent overfitting9. It is certainly conceivable that sufficiently large transformer-based models, with the right balance of parameter count and regularization penalty, could replicate other statistical approaches.
The second hypothesis is that we do not have enough data – or at least enough observations of events – due to timescales to forecast non-stationary systems. This may sound counterintuitive as financial markets, for instance, generate terabytes of data a day based on the high frequency events. Other chaotic systems like astronomy or weather generate the largest structured datasets. This argument hinges on the possibility that certain stochastic processes across regimes, and those regimes take an extremely long time to observe. For example, if financial markets behave differently under recessions, it is surely problematic that only around 8 recessions have occurred since 194510. The same can be said of major geopolitical events such as wars, climate patterns like global warming, or astronomical phenomena. The timescale at which events of interest occur makes it very difficult to capture sufficient observations. Related to this reasoning is the problem of reflexivity: the very act of successfully forecasting certain chaotic systems at scale will change those systems. For instance, if OpenAI released a model that could predict stock prices, market participants would immediately employ it, causing it to lose power. More generally, if participants in a chaotic system react to forecasts and adjust accordingly, top-level timeseries models on aggregate timeseries will not produce useful forecasts – the Lucas Critique11 of economics. This hypothesis does imply, however, that micro-level models of agents in a chaotic system (agent-based modeling12) might be a viable approach and, interestingly, might be possible with transformer models.
A final proposition is that transformer based models will not assist in pushing chaotic system forecasting forward but that these algorithms will push the state of the art forward by “sourcing” appropriate methods. The number of effective statistical algorithms available and well suited to such problems far outpaces the private sectors’ ability to apply them. Many of these algorithms belong in the first culture of statistics, but only produce empirically good predictions in very specific circumstances. As a result, these algorithms are often esoteric, difficult to parameterize, and require domain knowledge. There is a shortage of human capital aware of and able to apply them. Large language models may well serve as “co-pilots” or “auto-pilots” that identify opportunities to apply highly specialized but non-transformer based models to make forecasts.
The somewhat amusing implication of this last hypothesis is that black box neural networks – the epitome of second culture statistics – may propose, develop, and test first culture models. The second culture will lead us back to the first.
On a side note, Breiman is probably under-appreciated, especially outside academic circles, for his contributions to machine learning. Many of his contributions underpin key concepts in AI/machine learning and some of his statistical learning algorithms, such as Random Forest, remain mainstream.
Unfortunately, I will use the terms Neural Networks, Deep Learning, and Transformer-based models and their common application in Large Language Models (LLMs) interchangeably.
A formal definition of non-stationarity.
Andrew Gelman has an interesting discussion on this topic in context of Breiman’s paper here.
I have written about LLM-based agent based models here:
Given that the current LLMs are trained on an internet sized volume of text, the same would possibly be true for a "transformer-timeseries" model...echoing your point regarding sufficient recessionary events to model.
Informative read, Alex. Keep up the great work.