Creating Impossible Data

Combining panels and ground-truth data leads to entirely new datasets

Dec 19, 2022

Every industry relies on datasets needed for competitive intelligence: the sale of every product, the web traffic to any website, or the usage of every mobile application. These types of datasets are difficult to procure either because no single organization has all the data (for example, internet infrastructure is physically distributed and decentralized) or organizations are not interested in sharing their data at reasonable cost (for example, it is unlikely Amazon will share their sales data). Nonetheless, it is possible to approximate such data, if the problem is be framed as deriving a set of measurements (ie. total web traffic) based on the actions of a target population (ie. all internet users) for all entities (ie. websites) across a domain (ie. the internet).

Two types of imperfect but obtainable datasets, panel and ground-truth, are required. They can be combined to create estimates for all entities within a domain with a known variance. Major information services providers (“DaaS” companies1) rely on such methodologies to build data products, while hedge funds have begun to use the same approaches to better model metrics of interest to investors.

Panel data are longitudinal datasets that track a fraction of a target population over time. An example of such a dataset would be the internet browsing of a few thousand people over time (say coming from a mobile application or internet service provider). The advantage of panel datasets is that they provide insights across all entities in a domain of interest. For example, in a panel of web users, one could see all of the websites users visit as well as the frequencies with which they visit, click, and shop. Measurements could be calculated for any website appearing in the data, so the entire domain is covered. These datasets are often easy to assemble (in that you only need consent from panel members) and can be scaled by increasing the sample size (presumably by increasing financial incentives) to some degree. Panel datasets can also be obtained passively, from exhaust data, if some service automatically captures such a longitudinal dataset with the consent of its users.

The term “target population” refers to all the actors whose sum of actions make up the measurements of interest. The target population for an election panel would be the eligible voters who end up participating in that election, for a web traffic panel, would be all internet users, and for domestic sales panel, would be all people that can spend money in the country. A panel dataset would be perfect if its sample consisted of the entire target population, although this is never possible in practice due to the diminishing returns to financial incentives or the incomplete market-share of a passive source2.

Panel datasets are disadvantaged in that they measure only an incomplete sample of the target population. The sample is not random in a statistical sense, and it covers an unknown percentage of the target population. The former problem arises from panel composition: only some members of the target population respond and passive sources only cover some subset of the target population. The latter problem occurs because the target population, itself, is often not known (for example, how many total internet users are there?). Taken together, these problems make relating the measurements taken in a panel to the measurements of interest very difficult (ie. in a panel of 5 million internet users, what does 1 million visits mean in terms of actual total visits by the entire target population?).

The other type of data is ground-truth datasets. These are datasets that provide a measurement for the entirety of a target population, leaving no doubt about the accuracy of the measurement but only for certain entities. The obvious disadvantage of these datasets is that they are available for only a very narrow set of entities (not an entire domain), and they cannot be expanded. For example, returning to the problem of measuring website traffic, certain websites such as Wikipedia disclose their total website visits3. The number of websites that do this, however, is quite limited, so only a small percentage of the domain is covered. Ground-truth datasets can be very granular or very coarse (as might be the case with quarterly financial filings). In any case, they provide a full, unsampled measurement but only for a subset of entities of interest.

Combining ground-truth and panel datasets can allow for measurements across all entities in a domain. The overlap of entities between the panel and ground-truth can be used to calibrate the panel and relate panel measurements to measurements on the target population. In statistical terms, this means building statistical models with ground-truth measurements as the dependent variables and panel measurements as the independent variables4, for the overlapping entities, creates this relation. In plain English, the simplest such model checks what percentage of the target population the panel covers on average. This model can then be extrapolated to entities not covered by the ground-truth data, thereby giving measurements across the entire domain of entities.

Returning to the illustrative web traffic example, the traffic to any website can be estimated using a moderately sized panel and a narrow ground-truth dataset. For instance, a panel of 5 million users covers those users’ entire internet activity, including traffic to websites contained in the ground-truth dataset and a dataset of websites that publicly disclose traffic (e.g. Wikipedia). The simplest plausible model would calculate what percentage of total traffic (as measured by the ground-truth data) does the panel measurement represent for each website for which we have ground-truth data5. With this percentage calculated as an average across all websites present in both datasets, one could scale up the observed panel measurement for all websites. It is further possible to calculate what error rate this “scale up” factor leads to on overlapping entities, and thus be reasonably confident how accurately we are tracking websites outside the ground-truth data as well. Of course, in practice, such models could do far better than averaging: we could consider that our panel members might be biased and make up a greater percentage of certain websites than others and this can change over time – but these refinements are nothing more than adding covariates to the illustrative model.

See here for a shortlist

A good counter example in the United States would be the Decennial Census which at least attempts to build a panel from member of its target population. The vast majority of US Census data products are only ever a sample, though.

https://stats.wikimedia.org/#/all-projects

The ground-truth data is your y variable, the panel measurement is you x. So your regression is: y_i ~ x_i for each website i, that is observed in both ground-truth and panel.

For example, our panel says 150,000 people visited Wikipedia in November while Wikipedia (ground-truth) tells us the actual number was 3B. So, we can assume our panel makes up 0.005% of the internet traffic. If this were our only ground-truth data point, our best estimate in the simplest model would be scale up all panel measurements by 1/0.005%.

Magis

Discussion about this post