LLM Data Sales: A Market for Lemons?
Why selling data to AI model trainers is a difficult proposition
Many technology investors and pundits predict that proprietary data owners will capture the majority of economic profits from the recent advances in artificial intelligence, particularly in large language models (LLMs) and generative foundation models12. A possible conjecture is that a market for training datasets will appear. Model training companies have already reportedly been seeking datasets to buy. The concept of selling data as a core business model is not new – existing data vendors make billions of dollars per year selling data3. However, a business model of selling data for the purpose of training generative models is new but uniquely difficult. If the associated challenges are not solved, value may well still accrue to proprietary data owners, but only to those able to train and operationalize their own models, rather than more broadly4.
Traditional data sales largely depend on the value of marginal data points. Marginal data points are necessary because data becomes outdated. The underlying data could be a timeseries where dates have explicit meaning (ie. stock prices) or observations that degrade in value over time (ie. corporate emails of sales leads). In both cases, the data buyer continually subscribes to a license which leads to a sustainable business model for the data vendor. In this traditional scenario, the historical data may still be valuable – in the case of stock prices it is, in the case of sales leads it is not – but it is not sufficient. I will refer to this property of data becoming outdated as marginal temporal value.
The types of data most in demand for training generative models, on the other hand, lack marginal temporal value. For training, the value is in a dataset’s volume and history. For example, a database of user generated questions and answers, such as Quora, has almost all of its relevant value in the historical data to a model trainer5. Quora will have an advantage in expanding this dataset – the marginal cost of creating a new question and answer observations is cheaper for Quora than for a model trainer – but the majority of Quora’s value proposition is that the dataset exists already. And, while there may be some value in getting up-to-date information for time sensitive applications6, for truly general reasoning, generative foundational models, this seems significantly less important than amassing enough historical data to allow the model to learn “to reason”.
The related, corollary, property is irrevocability. Once a data vendor shares data with the model trainer, the value exposed becomes difficult to take back. While a license might be revoked, once a model has been trained, the data exists as derivative model weights. There is relatively little precedent around forcing data licensees to destruct derivatives of formerly licensed data. Beyond precedent, it would be challenging to force model trainers to destroy derivative works because of the high cost of creating them (ie. computational cost of training). This conundrum makes pricing and evaluation difficult – what model trainer would agree to a significant price without being certain of the improvement to their model? The vendor and model trainer will have difficulty agreeing to an upfront price as a result of this lack of information and high cost of revoking7.
The last difficulty of selling data to model trainers is that the generative nature (ie. ability to perpetually create new datasets from models) makes governance of downstream use cases difficult. The majority of model trainers offer their API as a service to downstream clients, who in turn build a variety of applications and services8. It will be practically difficult to control downstream client usage in such a way that ensures the downstream client derives no benefit post license termination.
These last two properties – irrevocability and governance – are best compared with a comparative example. Consider the traditional scenario where a bank was licensing Bloomberg data for equity research reports. This bank then subsequently decides to not renew their license. The bank would no longer be able to issue new equity research reports but neither would the bank be compelled to retract past, already published, research reports. The bank’s clients relied on those past research reports and may still benefit for some relatively short period of time after the license lapses – but the economic benefit is short-lived and soon the bank’s clients need new reports from the bank to keep benefit (and so the bank needs a new license). The bank’s clients may have also trained models on these historical reports – Bloomberg has almost certainly no recourse to pull these but these models are purpose built (narrow) and likely decay in value over time. The bank’s clients keep benefiting from Bloomberg’s data but only in a limited, time constrained way.
In contrast, the generative nature of foundational models makes this far more complicated, especially as sophisticated downstream clients start using models to build and evaluate other models9 or generate synthetic data10. A model trainer licenses Quora’s data. That model trainer’s downstream client then uses the trainer’s API to further train or fine-tune a more specialized but generative model using the trainer’s model as an evaluator or as a generator of synthetic data. Quora revokes the license and maybe the model trainer now needs to revoke access to their model – but what do they do about the client? The downstream model may well be able to perfectly reproduce Quora’s dataset. On the other hand, what client would sign up for a model where upstream suppliers could influence all future uses, including derived ones?
Taken together, these three properties of the foundational model training data use case – a lack of marginal temporal value, irrevocability, and downstream governance – make data-as-a-service for generative models very difficult.
Selling data to model trainers is not hopeless, there are some potential partial solutions. For instance, a royalty or revenue-sharing model based on consumption might well assist with fairly pricing (although downstream governance is a question). The irrevocability property could perhaps be circumvented if the marginal value of a dataset could be approximated computationally cheaply. It also may be possible to mitigate these concerns by keeping all usage of models in neutral, third party controlled computational sandboxes11. Nonetheless, both the business model and the technical challenges associated with this type of sale remain challenging in the near-term.
From investors: https://a16z.com/cloud-lessons-for-the-ai-era/
Amusingly, Sequoia, my own investor, claims in their latest AI article that value did not accrue to proprietary data owners. I disagree, I do not think it is a settled question. The best of AI may well still come from proprietary data. I do agree that easy to replicate data will become commoditized and predict that many copyright challenges will fail, further commoditizing internet data. Of course, predictions are hard, especially about the future.
Quora is an example for convenience. In actuality, many publishers and social networks have began charging for data.
Even in time series applications, this is usually inference-oriented rather than training as far as I know.
The acrimony about X’s, formerly Twitter’s, data sales is demonstrates this issue. An upfront price potentially appropriate for a generative model is incompatible with the traditional way data is valued.
OpenAI, Anthropic, Mistral, etc.
LLMs as a Judge: https://arxiv.org/pdf/2306.05685.pdf
Synethetic data from LLMs: https://www.amazon.science/blog/using-large-language-models-llms-to-synthesize-training-data
Snowflake Container Service or Native Applications, for example.
Reddit user content being sold to AI company in $60M/year deal. Interesting that is an exclusive with unnamed AI company ... will be curious how they (the unnamed AI company) plans to get a return on that investment.
https://9to5mac.com/2024/02/19/reddit-user-content-being-sold/