I shared a list of strongly opinionated research principles with my data science team and received positive feedback, so I figured I would share them broadly.
The below is advice for working with alternative data, especially at a buy-side firm or a market research firm like Cybersyn.
I have found data scientists that come from the tech industry (where you usually work with internal data) or academia (where you rarely see, or at least care about, real data) often take time to learn these principles. When you see a hedge funds or venture firms churning through accomplished data science leaders, I suspect it is because they do not follow these principles.
When working with alternative data…
Always measure correlation and mean absolute error (on year-over-year values) on all the timeseries you are trying to predict or nowcast.
This is the toughest standard; if you have high correlation and MAE on YoY % values, you will have good metrics on any other derivative metric as well and won’t be “fooled” by seasonality, tricked by leverage points, etc.
You need to look at both metrics and always over the maximum length of time available for any idea you test.
Test ideas by pushing levers to extreme values
Many ideas will just “do nothing” – it’s fastest and most efficient to try extreme values to see if the idea has any “leverage” on the real result. You can always finetune later.
You will also save yourself the compute cost… if you make a very small tweak and it does nothing… it is hard to know if it is because it was a bad idea or if it is because you tweaked the parameter by too little. Move the lever a lot to start.
Always visualize the data in a timeseries
Regardless of summary stats, always plot the data. You'll often see crazy things that are unrealistic immediately. You have to look at the plots.
Try to plot timeseries by strata - if you have a series of variables in your alternative dataset, make multiple plots where you draw a line for the aggregate by each variable. When you see big changes, you get ideas where problems are coming from.
Always have a “Strawman” experiment that is the dumbest thing possible
If you do “the dumbest thing possible” – how does your solution compare?
This can be the “null” experiment (ie. how does it compare to existing model)
But it can also be sort of the “dumb thing” (ie. if we do no re-weighting and just take the weights as they empirically appear in the data or an AR(1)1 model)
You will be surprised at how often the “dumbest” solution is incredibly hard to beat.
It is very very hard to beat linear regression in out of sample timeseries forecasting of economic variables over the long term. This upsets everyone who comes to the space from tech where modern machine learning dominates linear models in almost everything.
Use linear relationships where possible
Most alternative data should be linearly related to a response variable. It would be unclear why a relationship would be nonlinear and if you do anything fancy, you should know why it was necessary.
Again, this upsets anyone working in tech but it is a fact of life in finance. If you don’t believe me, try to speak to some quants.
At the very least, use a regression as a “Strawman” to any fancier model. Also test AR(1) models where applicable2.
Sanity checks everywhere
Sadly, with alternative data, a really complex, fancy, and useful model can be made useless because of an accidental “where” clause early on in the pipeline. It’s critical to output sanity checks (ie. how many users am I left with at each step, how many did I start out with, etc.)
I have found it helpful to plot intermediate steps as well
Use smallest sample possible or find a way to iterate faster
It is extremely difficult to do research if you are waiting more than a few seconds between cycles. You will never get anywhere if you have to wait hours between experiments.
Always change one thing at a time at maximum
You can stack innovations, but you probably want to unwind contributions of ideas.
Don’t be afraid to “look at the micro”
It is often incredibly helpful to isolate a strata of the data: a single user, single website, single store, etc. and look longitudinally what actually happens in the data. It is a good way to get ideas and understand why trends might be appearing in aggregate.
80/20 rule – always do the easy, low hanging stuff first, even if it means leaving good ideas on the table
Read this and recognize you need to produce commercially useful results in days. Do whatever it takes to do so, ignore interesting projects that do not move the needle.
If you can make something commercially useful, it is better to do that than incrementally improve something that is already useful.
Divide all ideas into a 2 dimensional grid with axises: Easy vs. Hard and Low Impact vs. High Impact
Get familiar with the data’s summary statistics and memorize them.
Is 500 samples a lot in this dataset? How many distinct attributes can these variables have? What is the average magnitude of an individual observation?
If you do not memorize this, it is harder to have intuition or gut feelings about what problems (or not) you see in the data
Related to the above, when you see broken data, quantify how big the issue is as a percentage of the total data before doing any work.
It is easy to notice an error (say duplicates, mislabelled data, etc.) and spend enormous effort trying to fix it only to notice that it affects 0.1% of your sample and thereby fixing it has no effect on the final effect.
These types of issues fall into the “Low Impact, but Hard” category and aren’t worth doing. The only way to know is to measure ideas against the scale of the data and you need natural intuition about this to move fast.
Never trust a data sales person with certainty about the integrity or meaning of data
You can never be certain the person providing the data themselves has all the answers. Often fields are not exactly what is in the data dictionary or what you would assume.
It is almost impossible to extract certain answers; you are better off empirically testing assumptions (and you should empirically test assumptions). Often, asking sales people questions about data results in a long “game of telephone” whereby a lot is lost in translation.
Read the news and get familiar with “what’s reasonable”
If I told you that spending at Walmart increased 30% year-over-year in a non-Covid year, what would you say?
If the answer is not an immediate “that can’t be right, that’s way too high!” then you do not have any intuition about reasonableness.
Become familiar with the US population, GDP growth, consumer spending growth, inflation, etc. You do not need to memorize exact numbers, but you do need to have immediate intuition that you can look at a timeseries plot and immediately react if the data “looks unreasonable”.
Most investment professionals develop this intuition without thinking about it, but data scientists do not in the course of most ordinary data science jobs in tech (surprisingly!).
Ask real world questions about what the data represents
If you are working with credit card data, for example, remember that different cities and states have different sales taxes, that there is a difference between credit card authorizations versus processed transactions, that certain types of purchases require pre-authorization and others do not, etc.
Every alternative dataset represents something in the real world, but the real world is messy and likely the data you have only captures some part of it. Being acutely aware of what is represented helps spawn ideas.
Alternative data nowcasting / forecasting has (almost) nothing to do with stock returns.
Alternative data is around forecasting economic realities. Those economic realities are reflected in the market in non-obvious ways based on others’ expectations.
Michael Lewis3 details the case of Jane Street traders correctly predicting Donald Trump would win in 2016… but then failing to make money off of the trade anyway. If you are using alternative data to invest in the stock market, remember that the “converting economic predictions to stock market predictions” is a different step. It is possible to be excellent at the former but bad at the latter.
I do not want to overstate the point, clearly machine learning models can be useful in timeseries, but I am underscoring it because I have the bias from data scientists on this point is so extreme. Nixtla, a startup, is probably the most interesting innovators in using machine learning for timeseries.
Found my self nodding in agreement with every bullet point.