Realpolitik for the Semantic Web

Not the web3(.0) you are thinking of.

Jul 29, 2023

If you have spent extended periods of time with programmers, they will eventually tell you about the semantic web. The original idea, championed by World Wide Web creator Tim Berners-Lee, was to create machine-readable data that would be linked together across the internet in a central knowledge graph1. The semantic web would be a layer over the World Wide Web in that it would link together information and data across it, rather than just webpages. In practice, the semantic web hinged on tagging data elements on the web to identify and structure data points from unstructured data and then link those data points together in ontologies. Ultimately, the aggregation of these links would create a central knowledge graph. This knowledge graph could be accessed and queried, akin to a Google Search, by a human or machine agent, provide deterministic answers to queries like “What products does this company sell?”, “How many counties are there in Idaho?”, or “What is the GDP of the 5 largest countries by population?”. In effect, it would enable the type of question-answer web searches that are now becoming popular with the advent of AI tools like Google’s Bard except without having to invent AI.

The vision was also to promote governments to publish open data in the knowledge graph formats, allowing public information to easily be used in analyses or data-driven applications. Significant work is done today to join data sources even from the same countries’ different ministries or departments. A semantic web would have solved these problems for academic, commercial, and policy purposes.

Technologies specific to the semantic web were developed over the past two decades to enable the vision. Data formats such as RDF, OWL, and JSON-LD were developed to store the meta-data necessary to accomplish this goal. A query language, SPARQL, is the most popular language for then explicitly querying these knowledge graph formats. Finally, central ontologies, such as the Schema.org, have been published to provide standard vocabularies to refer to the same concepts.

In practice, these technologies would enable agents to be able to easily parse web-pages (for instance, the name of a product on an eCommerce website or a price) for both information retrieval purposes and for applications. In effect, the semantic web would have made web-scraping far easier as it would be dependent on structured labels rather than relying on unreliable parsing or data APIs that require custom integration for each source.

Empirically, the semantic web has not come to complete fruition and its hey-day was in the 2000s. There are many theories as to why, but one unavoidable problem was the core cost of labeling data according to a standard ontology. Perhaps among the most comprehensive and amusing criticisms is Cory Doctorow’s Metacrap arguments. In Doctorow’s own words:

A world of exhaustive, reliable metadata would be a utopia. It's also a pipe-dream, founded on self-delusion, nerd hubris and hysterically inflated market opportunities.

However, there have been some successes. For example, there remain a series of open data sources and projects, such as Data Commons, DBpedia.org, Wikidata, that serve important and occasionally commercial purposes to sharing linked data. For instance, Data Commons powers the graphs you see when you type “population of USA” in Google search. Also, certain narrow commercial applications are in production, for instance, JSON-LD, a technology developed largely along the principles of the semantic web, powers Google Shopping2. Nonetheless, the semantic web which was originally dubbed “web 3.0” has not come to fruition in the same way that Web 2.0 is largely seen as a concrete, well adopted advance.

My own interest in the semantic web comes largely from the promise to join and unify concepts across public and private data sources. I am somewhat less interested in the World Wide Web extension, as I am in the ability to derive insights from publicly available data found on the internet that is frustratingly hard to use. Data scientists and analysts that deal with this data are spending significant time wrangling data sources together, repeating the same joins over and over. Ultimately, even data about relatively unambiguous concepts such as states, counties, and cities can be labeled in a multitude of ways. In certain industries such as financial services, there are entire companies and products dedicated to building subsets of the broader this type of semantic layer3. For instance, in financial services, the security master refers to the concept of centralized databases that map together financial instruments (securities) to the corporate entities that issue them.

I have only recently began diving into this topic, but have some initial thoughts I am seeking feedback on as it pertains to Cybersyn. Specifically, I am concluding that:

No solution exists for universal standard ontologies, but the value of an ontology increases.
Relational, SQL-based databases are the right format for a semantic database that maps together public data sources.
Large language models will solve the cost of labeling unstructured data

My own view is that getting a series of competing government agencies to agree on standards is likely less tractable than simply independently mapping together the most commercially valuable data sources post-publication. Any corporate sponsored standard is unlikely to succeed if clear advantages adhere to the issuer of the standard. OpenFIGI and PlaceKey are reasonable attempts by corporate sponsored standards, but it remains to be seen whether they are ultimately successful. The pragmatic approach continues to be to offer as many potential identifiers as possible when publishing data. Standards are also more useful the more datasets they join together, so simply merging multiple useful publicly available datasets together gets us closer to a valuable solution. For example, at Cybersyn, we use the same geography identifier across every dataset we publish. This immediately makes all of the Data Commons data joinable to every other public domain data source we have added.

Second, I am skeptical that knowledge graph specific technologies are necessary in most pragmatic contexts. The majority of analytical workloads are performed on OLAP systems that are largely relational and use SQL as the lingua franca. Increasingly, cloud data warehouses (ie. Snowflake) are becoming centralized repositories of data that also perform the computation. The topic of NoSQL, graph, or even vector databases deserves its own essay, but I subscribe and defer to Andy Pavlo’s arguments and wager that relational databases will remain dominant. The challenge of converting data analysts to new technologies and formats seems insurmountable. It is also unclear whether analysts would have an easier time performing analysis on noSQL databases (my guess is no). So, any semantic open data solution needs to be distributed as tabular, relational structures.

Finally, a new major catalyst for lowering the data mapping burden is the advent of large language models (LLMs). Computers will likely become able to extract and reason about entities in unstructured, but human readable data faster than human will label such data to become machine readable. This will remove the challenge . The advent of LLMs, however, will not obviate the need for publication of datasets with context: no amount of generative AI can tell you what inflation is without some data source that publishes it. Further, the ability of LLMs to avoid hallucination and accurately access the correct data point remains a problem. Today, a LLM can provide a reasonable sounding answer to any factual question about economic data, for example, there are few guarantees today that answer is correct.

The original article, published in Scientific American.

Some documentation on this here.

The term semantic layer has now been used to refer to internal analytics metric definitions. This use is philosophically related to the Semantic Web, but not quite the same thing in practice.

Magis

Discussion about this post