Data Marketplaces in a LLM World

And Some Draft Ideas

Aug 25, 2024

A perennial idea among external data consumers is a data marketplace. There are many large specialized data vendors, and thousands of small ones, each with often hundreds of datasets – so why would there not be a central marketplace to discover and buy them?

minimalist line art of a data market filled with robots

Have any attempts at data marketplaces succeeded at large scale? I think the answer is sort of. There have been successful pseudo-marketplaces whereby a single buyer takes control of the data and integrates it into a single product. The most successful of these has undoubtedly been the Bloomberg Terminal. Further, some marketplaces have been successful for narrow use cases, especially where there is a single data format or join key being sold. For example, advertising activation marketplaces for audiences have done well.1

There has also been relative success at distributing free data in marketplace form. For example, FRED or Data.gov - government attempts to catalog public data - have been successful judged by the downloads these repositories drive relative to the individual sources. Non-government open data repositories, such as Kaggle, have built large user bases (and website traffic estimates to kaggle.com/datasets indicate this is the most popular part of the website). In both cases, these serve an aggregation function while avoiding traditional data marketplace challenges because the data is public domain.

A new wave of data marketplaces, funded by database infrastructure providers such as Snowflake, AWS, Databricks, and Google have also recently been released. Success is still an open question2, although they clearly have a core advantage in that they tie the data buying and transfer to the same place that analysis occurs.

A few common, but recurring, issues I have observed among many attempts include:

Lack of neutrality: There have been attempts made by large data sellers to build their own data marketplaces. These DaaS companies can offer distribution to their already large customer base. Simultaneously, there could be some sort of efficiency or delivery benefit to the customer, given that the customer already has a MSA with the DaaS incumbent. In practice, the distribution advantage only works if the barrier for existing customers to adopt the new dataset is very low. For example, Bloomberg’s Terminal constantly makes new datasets available to end users without much action needed on the end users’ part – so data sellers get distribution without additional legal work. This model struggles in cases where the DaaS incumbent has a large data business that conflicts with marketplace participants: naturally, Factset is never going to list their data on S&P’s marketplace and most asset managers are going to use some Factset and some S&P content – so neither party can build a complete product.
Discoverability: Finding the right dataset to answer a business question is difficult. Good metadata search is essential. There may also be questions an analyst does not know they could be asking because they do not know such data might exist. Conviction, a venture firm, has a good write-up about why this problem may well be solved with LLMs.
Licensing and Monetization: The process of buying data varies by industry, but involves a general contracting process similar to enterprise software but also a data specific compliance process. Certain industries have developed standards for this, but the majority of marketplaces have not meaningfully solved the pain point of transacting. As with other failed marketplace categories, often the data provider and consumer find it easier (and cheaper) to take the transaction offline.

A fair question to ask is whether the advent of AI creates opportunities for either solving some of the above challenges or solving new challenges with data marketplaces. A few ideas I have recently come across include:

AI Inference: I am skeptical that selling data without strong marginal-temporal value for training AIs is a good business. Selling data for AI use at inference time, however, may be a good business. It is possible that AI use cases will present new opportunities for data sales that may be better sold with usage or consumption based systems rather than traditional bulk licensing. Eric Schmidt made a similar prediction recently: that data usage for AIs will look like music royalties3. Furthermore, selling embeddings of datasets, rather than the raw datasets themselves, may well be a value-add to many LLM software providers.
Legal Indemnification: A key friction in data acquisition and sales is that regulatory and compliance burden placed on the buyer to determine whether they can actually buy and use the data on offer. The increasing amount of regulations (CCPA in California, look-alike laws in Virginia, Maine, and others, GDPR in Europe) and the challenge in understanding these laws means that it is particularly costly to stay compliant.
- Traditional marketplaces absolve themselves of liability for their sellers’ products for the obvious reason that policing and verifying each seller's products would add a tremendous cost and potential liability to the marketplace. This works fine in areas where liabilities are not a concern for buyers or where the marketplace platform can enforce very heavy technical standards on providers. In data sales,
- Note that this is very different from just providing ‘standard’ terms or suggestions as to what disclosures data providers need to make – such suggestions are relatively low value-add
- If data buyers could be assured that a given dataset was legal and compliant for their use case, the entity providing that assurance - from a third party perspective - would add a lot of value to the selling process, easily enough to justify a significant take rate. The arrangement would work well for data sellers as well, as the added confidence would surely increase the liquidity of the market.
- Potential data sellers may also be reluctant to enter the business because of a lack of expertise on whether their exhaust data could be compliantly monetized.
Bundling: I predict that the market for data bundles will grow, as more companies will have increased capacity to process unstructured information. The marginal value of any specific one dataset may not increase, but the need for economical consumption of a very large variety of datasets will increase. Many of these use cases will initially be experimental and much of the value add may well be marginal. As such, bundles which provide a wide range of data.
- Shishir Mehrotra lays out a great economic explanation for this – bundles are a good deal for both consumers and suppliers when the bundle is constructed such that the number of ‘casual fans’ increases4. I think this is exactly the dynamic that will play out with data to be used for AI inference.
- For example, today there exists a well defined market for earnings transcripts among asset managers. Firms like S&P and Factset charge for transcribed earnings transcripts calls. Asset managers rely on this information for analysis and trading signals – the demand for this dataset among this customer base is relatively elastic: a price of 50K vs. 250K does not really matter, so long as the data is perfectly accurate. The cost of an asset manager not having this data would be much higher.
- Earnings transcripts could also be used by non-financial firms for understanding and identifying new selling opportunities, key supplier risks, and key competitor plans. Startups might also have new ideas that rely on this type of unstructured data. Traditionally the market for using this data has been very small because (a) the processing and structuring cost for transcripts did not justify the effort (b) transcripts data was expensive. With LLMs, the effort to process and synthesize the data is low, so if the price could be made acceptable, there could be a market. The marginal value of this data may initially be low, because the use cases are new and the data is almost substitutable with news stories, SEC filings, press releases, and so on – but the there now exists a clear need for some of this data and if a data provider or marketplace benefits from economies of scale in collecting it, it makes sense to offer such data at a much lower cost in a bundle. In Shishir’s terminology, the number if casual fans for this data has massively increased.

LiveRamp and TradeDesk’s marketplace, in particular, are examples of successful execution.

As far as I know, none of these companies have released any kind of comprehensive metrics on these marketplaces.

Unfortunately, the video appears taken down. I will post a new link if it ever becomes available.

Four Myths of Bundling by Shishir Mehrotra

Magis

Discussion about this post