In this essay, I reflect on some of the practical challenges of data monetization and relate them to theoretical concepts in information economics1. I then point out how these theoretical failures are real and result in obviously useful data – MLS listings or B2B transactions for example – not being widely available. I explain two particular market failure conditions resulting from the non-exclusionary nature of data and from the cold-start problem data aggregators face. By sharing these observations, I hope to solicit solutions for solving these challenges.
Economists refer to data as non-excludable, because once it is sold once, infinite copies can be made, thereby depriving the data seller from charging for additional copies. In practice, measures are taken to enforce exclusion. Data markets fail when the exclusion measures taken end up being not effective enough or too restrictive. These measures can be contractual or technical. Certain attributes of data - such as marginal temporal value - can cause data to expire and so also assist in creating exclusion. Contractual measures can be as simple as requiring data consumers to sign a licensing agreement. These generally work well when the data consumers are vetted, reputable, and unlikely to want to engage in legal disputes. Technical measures, such as data clean-rooms, aggregation constraints, digital watermarks, and so on can often be effective. Technical measures always involve a trade-off between flexibility (thereby value to the consumer) and protection (thereby insurance for the seller). If taken too far, both contractual and technical measures can entirely prevent the market, or part of a market from using a given dataset.
For example, data from multiple listing services (MLS)2 is notoriously expensive and difficult to license. MLS’ serve as cooperatives among realtors in specific geographic areas. They are highly fragmented and require extremely specific compliance rules and circumstances to license data3. This data model makes sense for industry participants – data is highly localized and there is a real risk that if the data were readily available, unsanctioned competitors could create competing listing or realty services without participating. Therefore, MLS data is extremely tightly controlled and perhaps intentionally fragmented. These measures (mostly) protect the dominance of the National Association of Realtors and realtor licensing system4. On the other hand, it creates a market failure in that noncompetitive consumers at a lower price in the demand curve are left unserved. These consumers are not able to pay hundreds of thousands of dollars for access and do not fit into prescribed acceptable use cases, but they have noncompetitive use cases that have nothing to do directly with realtors or running competing listing websites5. The lack of an effective technology to price discriminate by use case, means that data remain practically unavailable to certain portions of the market (economists would call this deadweight loss). The severity of this problem depends on how you size the noncompetitive market for real estate data.
A second market failure occurs when owners of data have insufficient incentive to share it in any one transaction, and no aggregator or coordinator exists. This failure is akin to a lack of market makers in financial markets. This situation occurs when owners’ data is only valuable if combined with many other owners’ data but not so valuable alone. It leads to a “cold start” coordination problem wherein individual owners ask for too high a price relative to what individual data consumers can bid, but the collective demand of a group of consumers for a collection of owners’ data would have a clearing price. To solve this, a data aggregator must incur the cost of acquiring data from each source. Doing so is expensive and logistically difficult because it must be done nearly simultaneously across owners to minimize the time the aggregator licenses some data but not yet enough to have a viable product to market. An analogous problem sometimes occurs simultaneously when the owners’ data requires a large amount of cleaning, modeling, or transformation to be useful. In such cases, the R&D cost may eclipse the ability of any one data consumer to incur, but the buying power of all consumers combined would suffice.
Data aggregators, therefore, have two useful roles to play: one to consolidate raw data and a second to develop data products. This is quite different from data marketplaces, data catalogs, or data brokers. Beyond coordination, data aggregators also incur duration risk, by which I mean they front capital data owners today, in exchange for data product revenue in the future.
Revenue share agreements are often proposed as an equitable solution to eliminate duration risk. Such agreements involve aggregators paying owners a percentage of data product revenues. In practice, revenue shares are far from perfect solutions for both parties. For aggregators, revenue shares limit the amount of data combination that can be done by permanently impairing margins. They can be particularly detrimental to unlocking full data product potential if they are structured as an absolute percentage of sales – eventually, only so many datasets can be combined before the endeavor becomes unprofitable. This leads to a less compelling product (which is bad for everyone). Also, for data owners specifically, revenue-shares may not cross a greater-than zero dollar profit hurdle short-term if a significant amount of R&D work (therefore time) is required to make a useful product. In theory, the perfectly rational, profit-maximizing company would agree to any incremental marginal profit from monetizing their data. In practice, companies have dollar profit hurdles much larger than zero to overcome. These hurdles are due to either institutional inertia or due to the real but difficult-to-quantify expected value costs of data monetization (legal risk, press risk, employee distraction, etc.). Revenue shares are therefore not panacea solution to these challenges.
A combination of these unhappy circumstances lead to obviously marketable data being unavailable. For example, B2B transaction data is particularly fragmented, its owners assign a particularly high hurdle value, and it requires a particularly heavy amount of modeling to make useful. It is perhaps not surprising then that it is largely not an available product today.
API consolidation based on RESO or Zillow’s Bridge are not what they appear at first glance — you still need to license with each MLS
You may ask if this is anticompetitive and there have been legal cases about it (for example, here) but these cases have largely focused on sharing MLS listings between different types of realtors or brokerages.
You could imagine a range of use cases for the data — such as investment or economic analysis, property tech startups, etc. that would not really undermine the purpose of protecting participating realtors and brokerages (regardless of your view of the desirability of the protection intent).
The first problems you mention perhaps explain why some aggregators sell insights vs. raw data. This approach solves for non-excludability (presuming that desired insights vary by buyer) and also allows price discrimination (e.g. for MLS, you're given the non-competitive economic/investment question the data needs to answer beforehand and price accordingly).
The flip-side problem is that generating quality insights has historically required non-scalable human effort and drastically reduces margin. Arguably, new AI 'data analysts' can change the game here by allowing data owners to create an insights business with more scalability and less investment (this is our entire focus at Datacakes).
Very interesting, thanks for sharing