What Makes Alternative Data Scientists Alternative?
Interest in alternative data1 has seen significant growth in the past several years, particularly in the asset management industry2. McKinsey3, BCG4, and other global consulting firms5 have pointed out the opportunity that exists for businesses of all kind to monetize internal data and incorporate data from outside the enterprise to improve decision making. The growing interest has increased demand for practitioners: data scientists who can help business leaders realize a return on their investment in data. Three qualities distinguish successful data scientists in alternative data6 from other, adjacent data science fields such as business intelligence, analytics, or machine learning, especially in the technology industry. Working with alternative data requires a particularly pragmatic, parsimonious approach, an ability to understand the broad context of data sources, and the ability to communicate with a broader set of stakeholders than traditional data science roles.
The benefits of alternative data come from innovative uses of novel data sources where there were no data prior. Improved business results come from cleaner data or identification of key datasets rather than from optimization of statistical methodology or algorithms. The lift of a machine learning model relative to a straightforward linear model is often marginal. So, the opportunity cost of optimizing a methodology as opposed to exploring and exploiting new datasets is more significant than in adjacent fields. This dynamic occurs because alternative data is most useful either in applications where approximate answers are sufficient or in applications where ground-truth data is sufficiently limited that enough training data observations or degrees of freedom do not exist.
The process of working with alternative data was once been described to me by a colleague as “working only on POCs [proofs of concept]”. Because of the large number of data sources and the difficulty in predicting the success of a project, a lot of the work in alternative data consists of trying new things. A successful alternative data effort is akin to a scientific laboratory maintaining a portfolio of experiments rather than a monolithic software system. System building is very important to increase the speed at which new datasets can be tested or presented, but the actual “science” done on alternative datasets needs to be parsimonious and constrained. One needs to develop a clear sense of when to stop.
This portfolio of experiments approach is a consequence of the multitude of data sources that can solve a business problem. For instance, one could track commodity shipping volumes using satellite imagery, AIS transponders, tariff agency records, or parsed broker emails. These data sources each have their own industry jargon, standards, and context. Utilizing them requires understanding their respective opportunities and limitations. It is difficult to separate the data practitioner from the domain expert: the person most familiar with a data schema, cause of distortions, and capture mechanism is most likely to fix these issues. It is much easier to work with satellite data if one is familiar the evolution of imagery resolution over the past several years. Therefore, the successful alternative data scientist will be eager and quick to learn about a wide variety of industries, including payment processing, satellite imagery, or regulatory reporting requirements for example, and must develop a broad knowledge base. This is a stark contrast to data science fields where the data generating processes are narrower, the providers of the data are internal, and changes can be made to collection processes to address data deficiencies. In such cases, a product manager can relay requests between the data science team and the corresponding engineering team.
Further still, coming up with creative solutions using alternative data often comes from reading methodology reports. It becomes much easier to predict the unemployment rate if one understands the intricacies of how the Bureau of Labor Statistics calculates it7. So, the best alternative data scientists solve problems by proactively finding answers in publicly available data sources, methodology documents, and industry reports. It is difficult (and maybe disingenuous) to ask for “out of the box” thinking, but a prerequisite to such thinking is an ability and willingness to proactively find answers to questions that cannot be internally answered.
This breadth of data sources also necessitates communication with a variety of stakeholders, both within the organization and outside. Data providers are among the most important external points of contact. Because alternative data does not originate within the firm, a successful data scientist will invariably need to speak to sales people, IT staff, and domain experts in a given field to develop the aforementioned context. While data scientists in other practice areas are expected to communicate or present results, it is a question of degree. Further, the regulatory environment around external data requires communication with lawyers, compliance officers, and others tasked with ensuring the firm remains in compliance with privacy regulations, contract terms, and data provenance laws. While this compliance function can often be fulfilled by product managers, the most innovative uses of data will become apparent only to practitioners that can actively manipulate and explore data, so the corresponding legal questions are best asked by them.
Together, these attributes are not necessarily entirely absent in “ordinary” data scientists, but it is a question of degree. I do not mean to suggest that data scientists outside alternative data do not communicate with many stakeholders, do not need to be pragmatic, or have narrow context. Rather, these attributes are disproportionately important in succeeding in alternative data. Similarly, alternative data efforts benefit from better algorithms, machine learning, and occasionally can control and improve their sources of data but these aspects are less critical.
If after reading this, you are left with a desire to learn more what a role in alternative data might look like, please do not hesitate to contact me. I am hiring data scientists and engineers for my new company.
The definition of “alternative data” is not well agreed on. Within the asset management industry, it has taken to mean non-accounting, financial, or market based datasets used to generate “alpha”. A broader definition might consider “external data”: any dataset from outside the enterprise.
I am always skeptical of TAM/market-sizing exercises of alternative data, given that the exact definition can change the market size by a factor of ten or more. Nonetheless, attempt include this one.
I take a broad view of alternative data to refer to any data science practice in which external data is used. My comments are naturally colored by my experience in the financial services sector, in particular.