Free as in freedom, not as in beer.

The costs and problems of using public & government data

Feb 14, 2023

Governments publish a vast variety of economic data at an impressive level of depth1. In the United States, the Bureau of Labor Statistics alone publishes more than eight hundred thousand monthly time series, covering hundreds of geographic regions.2 Individual time series can track data as specific as the wages of workers in a specific-sized restaurant business in a particular zip code. It is possible to answer questions such as where student debt saddles consumers the hardest, where the wealthy are migrating to, what the relationship between heart disease and median income is, or which socioeconomic factors predict social mobility down to the zip code3. When combined with proprietary economic data, government data is useful for benchmarking, calibration, and reference. As a benchmark, public datasets can be used to test whether proprietary datasets are accurate: for example, granular consumer spending data can be compared to survey-based government measurements to ensure that the granular data is well calibrated and free of any datasource-specific bias. Creative and high frequency datasets, derived for example from satellite imagery of ships at ports, can be benchmarked against slower but accurate government sources, such as bills of lading, to develop faster ways to measure port congestion. As a calibration target, demographic data points can be used to reweigh and correct proprietary datasets that contain survey or respondent bias of certain demographics. Finally, as a reference dataset (or “join key”), public datasets can be used to align proprietary datasets by defining common definitions of a geographic area, commercial activity, industry, or other entity that is helpful for financial markets to agree on. Therefore, it makes sense that any organization that aspires to make decisions based on external data, should be not only be cognizant of government (or public) data but also fully use it as a prerequisite for pursuing purchasing proprietary datasets.

Despite their usefulness, government data is hard to work with because of the overwhelming variety of formats and the engineering cost of integrating them. First, multiple agencies publish data on the same topic but with different formats, that is, slightly varying definitions, granularity, and release schedules. Interagency standardization can only be achieved by carefully mapping data fields, which is a very labor-intensive task. Second, the technology that makes government data available varies from agency to agency. These two fundamental problems prevent widespread use of public data, necessitate armies of data analysts to munge through data in order to conduct basic analyses, and create a large engineering burden for enterprises that want to use the data effectively.

The disparate documentation, terminology, and conventions found in public datasets present the first problem. Does a time series identifier refer to a measurement taken in a specific geography or is geography encoded in a separate field? When data is missing, is it because of privacy suppression or because it was not collected? Do the data contain restatements and corrections? And if a dataset is restated, how is the lineage of an individual data point tracked. There are hundreds of ways of structuring even the simplest of datasets, and there are no canonical answers. Consequently, using such data takes a lot of reshaping to achieve a consistent format. And, a prerequisite to reshaping, it takes significant analyst effort to read through documentation and methodology documents to ensure the data are reformatted consistently across sources.

To illustrate this point further, consider comparing a time series from the Federal Housing Financial Agency (FHFA)4 to one from the US Census. Whereas the standard geographic unit for a city is a Census Based Statistical Area (CBSA), many government agencies use other designations. The FHFA releases aggregate mortgage benchmarks but groups certain cities into Metropolitan Statistical Divisions (MSDs) and others into the standard CBSA. These are still different from Metropolitan Statistical Areas or Micropolitan Statistical Areas, which are used by other agencies. Translation tables or crosswalks - separate datasets - to support such analysis and allow for joining are necessary5. Sometimes, these crosswalk datasets are released on separate schedules and in separate formats while at other times they need to be internally developed. In creating these crosswalk datasets, the analyst must make a decision on how to combine data that overlaps a group (for example, if a MSD consists of multiple CBSAs, the analyst would need to aggregate the CBSA time series). Of course, these crosswalks are only partially useful because some units of geography are not hierarchical: zip codes cross state, county, and CBSA boundaries, for example.

Beyond data interpretation and formats, no standard technology exists for publishing and sharing government data. Public agencies often develop their data portals and documentation websites or use third party software. Some agencies have APIs while others simply upload static files at given web addresses. Each individual dataset requires custom business logic to retrieve relevant data on the schedule it is released. In any case, users need to build their own software to systematically download data and centralize it into an environment where it can be joined with other sources or internal data. In recent years, much of that scientific computing has taken place in cloud data warehouses, which allow everyone at an organization to analyze data using SQL, a programming language, or business intelligence tools. In practice, this means that significant effort needs to be put into adding data into these data warehouses, ideally normalizing and documenting it at the same time.

These two challenges when working with public data, namely the lack of data standardization and documentation as well as the engineering cost of adding new datasets into cloud warehouses, represent an obstacle that organizations need to overcome. The speed with which data can be loaded is a major competitive advantage – if an organization strives to be data-driven, new sources of data will constantly come up. During the Covid-19 pandemic, it would have been helpful to quickly collate data from every government source - but with every government jurisdiction publishing their own data in their own formats - this proved challenging.

In the context of Cybersyn, these two public data problems are particularly fascinating because these data are a prerequisite for making meaningful use of proprietary, external data sources. It is unlikely organizations can adopt proprietary sources of data without first having a system for ingesting arbitrary public datasets often used in the benchmarking proprietary ones. Finally, before spending on proprietary datasets, it makes sense for organizations to exhaust the limits of free data.

My incomplete proposed solutions for some of these problems include:

Using the Snowflake Marketplace6 for transferring data. Enterprises are centralizing their data in cloud warehouses, SQL is the most widely known language for manipulating data, and the modern data stack is largely predicated on cloud data warehouses. The Snowflake Marketplace enables data transfer without writing custom business logic (unlike APIs). The data is updated by the data provider and every consumer can immediately query it.
Shaping data in an entity-attribute-value (“EAV”) format7. The EAV model is very flexible, allowing for new time series, additional data, and new attributes to be adopted without making breaking changes to the consumer of the data. The traditional disadvantages of this approach (compute intensity for certain queries and ) are less relevant in cloud data warehouse concepts. Most importantly, this allows for appending of data, attributes, entities without having to change the schema (critical when for data pipelines that span organizations)8.
Flatten and extend schema.org9: The labor-intensive data work in mapping cross-walks ultimately has to be done only once and then ideally reused for every marginal dataset. In the future, advances in generative AI techniques that do not require custom labeled training data could be used to label data automatically. The caveat here is that the accuracy bar here is extremely high.

But perhaps not at an impressive frequency or using impressive collection methodologies as I wrote about here

https://fred.stlouisfed.org/tags/series

Compare various relationships between government data here

The raw data can be found at https://www.fhfa.gov/DataTools/Downloads and you can find this specific dataset here.

Here is an example of such a file from the US Census.

See our data for yourself here.

Formal definition on Wikipedia

At Cybersyn, we have extended DataCommons to include additional geographic entities such as MSDs, here.

I have written extensively about the DataCommons project, extending schema.org.

Magis

Discussion about this post