An Alternative Data Algorithm Example
In other words, what data science/statistics do we actually do at Cybersyn?
There is a lack of academic literature on the pragmatic challenges of working with real-time economic data. For example, it is not publicly known what algorithms hedge funds uses to turn raw credit card data into something that predicts the economy.
While there has been some recent interest in the economic literature in working with real-time economic data1, one downside of this literature is that it focuses on the economic policy conclusions, relegating the methodology for dealing with messy and imperfect data to an appendix. This methodology is often performed by graduate students or outsourced to software engineers or data providers, who make critical preprocessing decisions. The principal researchers are interested in the economic implications of the data, and overcoming shortcomings in the data is just a roadblock.
Additionally, the challenges in such data are not interesting or innovative enough for these methodologies to be taken seriously by academic statisticians. They are complex but not mathematically interesting, using algebraic and statistical tools that a second year undergraduate would easily grasp. Yet, the decisions and processes to clean and normalize such data is of pragmatic and commercial importance (and arguably, it may well invalidate any policy conclusion!).
Aladangady et al. provide a notable exception, dedicating significant discussion to this subject2. Fiserv3 provides merchant-centric consumer spending data to the authors. This data is complete for certain merchants that are Fiserv clients, rather than a sample or panel of consumers. Their data faces two key problems:
FiServ's acquisition or churn of merchants is at least partially independent of opening/closure of merchants in the economy.
FiServ's merchant mix is not representative of merchants in the United States (the authors specifically mention missing eCommerce).
Aladangady et al.'s methodology strips out the artificial effect of acquisition/churn in FiServ's data, but they do not address the true merchant creation/destruction in the economy. Instead, their methodology holds merchants constant across varying convenience samples and then combines them. There is no mathematical wizardry here: the level of math is equivalent to what most financial analysts do in spreadsheets. However, the methodology and steps are complex, intricate, and clever.4 Their solution is, of course, incomplete, but it is the best open paper I have come across that gives a realistic sense of what working with external, alternative data is actually like. Imagine extending their methodology to the limit, and you have the makings of a proprietary and valuable set of algorithms for anyone needing an informational advantage in measuring the economy. You also now understand what Cybersyn actually does when we say derived data.
Can large language models (LLMs) automate this type of reasoning? Despite the impressive results of text-to-SQL, I have not seen good evidence that LLMs can reason through a net new dataset with the complications mentioned above, even with significant prompting. The operator needs to have a firm grasp of the methodology first, even if they are not writing the code themselves.5 One solution would be to add every such paper to the training set, but there are not many.
I would love to be proven wrong on this point. If you disagree, you should come work with us at Cybersyn and use LLMs to automate such analysis. If you agree and want to solve more such problems, that’s also a good reason to come work with us. I can imagine a red team vs. blue team approach to this.
Raj Chetty and Opportunity Insights is particularly notable but there is even a recent book on the subject.
The data comes from First Data, a company acquired by Fiserv.
Sections 3.1 and 3.3 are the most interesting and relevant to this discussion.
Amusingly, the authors here come up with the methodology but outsource the programming and computation to Palantir. With LLMs, it is clear that the authors could simply have their methodology translate to SQL.