Governments and other organizations often publish “open” data. This data is often freely available, but the cost of effectively using it is often significant. In a previous essay1, I described some of the initial challenges and contemplated solutions in operationalizing and distributing open data. The below is an extended list of additional specific challenges Cybersyn is contemplating as we continue to distribute this data.
Point-in-time Data
It is often useful to understand how data has changed over time. The historical values of some property are valuable, as is the record of when those values changed. Scientific and engineering cultures refer to this concept as point-in-time (PIT) data, bi-temporality2, or vendor-specific names like Snowflake’s Time Travel3.
Capturing revisions as published by the source:
Many public data sources issue restatements or revisions. For example, the Bureau of Labor Statistics will often restate historical unemployment or inflation numbers based on more accurate data becoming available. The challenge here is just organizing this data
Capturing revisions not published by the source
Other public data sources will revise data but permanently overwrite or delete historical values. This means that a consumer of the data today cannot effectively view the history of changes or reproduce what they would have seen had they looked at it in a past point in time. The only solution for the data distributor, Cybersyn, to snapshot the data and maintain the revision history internally.
Capturing revisions and errors incurred by us
As a data distributor, there is value in keeping a record of actual actions taken or omitted by us, even if they have nothing to do with the underlying data source. This allows customers to understand how relying on data delivery via Cybersyn would impact them and prevents the consumer from recursively incurring the above costs.
Schema Migrations
Schema migrations introduced by the source
The publisher of a data source may often change the underlying schema of a data source. This can happen because some data is no longer available, new values become available, or a methodology has changed.
A particularly problematic version of this issue occurs when a data source elects to stop publishing statistics. For instance, the USPS has recently indicated that it will stop publishing population migration data4.
Schema migrations introduced by us
As a data distributor, structuring data into a consumable and joining format requires developing a schema that anticipates future improvements. It has proven extremely difficult to anticipate all future changes or improvements that get thought of in the future, so the schema migrates to reflect improvements or ergonomics.
Licenses
Copyleft agreements
Certain public datasets are free, but require attribution or adherence to other terms. Often, these terms need to be passed along through any derived works and impose restrictions on both data intermediaries or end users. The philosophy of Copyleft aside, a data product that uses a combination often.
Not-actually-free licenses
In other cases, public datasets contain elements that may not actually be freely licensed. While the publishing agency may avoid scrutiny because of their status as a government agency or non-profit, using those data elements in a commercial product could create serious liabilities. For example, CUSIP codes that identify securities are a proprietary identification system that requires a license to redistribute. Yet, CUSIP codes appear in many ostensibly public domain datasets published by government agencies. It is not clear on what terms those agencies are providing the CUSIP data, and the issue of licensing appears to be a “gray” area. Certain organizations, such as the Small Business Administration, have started moving away from situations (in the SBA’s case, moving away from Dun & Bradstreet’s DUNN number) but historical data still relies on such identifiers.
Machine Readability:
Website blockers
Many ostensibly public datasets are available only behind logins, rate-limited API endpoints, or lack of bulk download options. It remains a mystery to me, for instance, why the Delaware Division of Corporations forbids automated web scraping while not making bulk download options available. Why shouldn’t public records be public?
Data formats
Data formats are incredibly varied. Often, former standards become inconveniences today: for instance, XML, a standard championed in the mid-2000s, has fallen out in favor of JSON but many open data sources still rely on various versions of the XML standard. Only partially adopted standards pose another challenge: data may be selectively available in XBRL5 or RDF/SparQL6 accessible formats. So, building a complete dataset involves stitching together formats even from the same source. For reasons outside the scope of this article, Cybersyn provides all data in relational tables in a SQL data warehouse. This necessitates consolidating all of these formats. Regardless of the preferred data format one chooses, inevitably some data sources will be challenges to consume.
Upstream Aggregators:
Finally, it is often convenient to retrieve data from sources that, themselves, already aggregate data from underlying sources. For instance, the Data Commons Project or the St Louis Fed FRED system. Retrieving data from these sources, while convenient, introduces complexity. All the above issues suddenly multiply as they can all be incurred by the aggregator in addition to being incurred by Cybersyn.
Documentation
Standardizing documentation across sources is a significant effort. Documentation for public data sources is scattered across formal documents, websites, and files included alongside the data. Documentation for previous versions of the data is often not present or formatted entirely different.
The most important goal of Cybersyn’s documentation is to make data discoverable. We have failed if someone looking for data we publish does not find it. The design challenges here have surprised me7.
If these problems appeal to you, we are hiring. If you do not see a role that specifically applies to you but these problems sound interesting, please reach out.
XBRL is meant to be a structured format for financial reporting, primarily used by the SEC in the United States.
Details on RDF/SparQL: https://www.w3.org/TR/sparql11-query/
If you happen to have feedback on our docs, please email me.
This just seems to be like a function of the government's technological competence (or rather, their lack thereof) -- particularly w/r/t the bullets about machine readability and point-in-time data. I'm no data expert, but I feel like a lot of these issues can be fixed by just checking in the raw data sources into a public git repo instead of hiding it behind an API/website that is only cursorily maintained.