SimCity & Data Commons

Nov 15, 2022

Simcity was my favorite video game growing up. I began to appreciate the complexity of the game just as Simcity 4 was released, so I grew up playing that iteration. The naive 11 year-old believed that SimCity was an accurate representation of the job of a government: I literally believed the mayor of Winnipeg could look at the city’s economy, energy, climate, and infrastructure in one glance. If that were the case, even if poor governing decisions were made, it would be easy to see the results and optimize them. Something about this idea guided my future interests in statistics, economics, and operations research.

This begs the question, what would it take for the real world to be modeled and represented by the sort of data available in Simcity? Would it be possible for businesses and governments to actually view the economy as Simcity players do?

The problem is not an engineering one1. The amount of data required for such a representation is not particularly large, when compared to the amount of data processed in other fields such as ad-tech, astronomy, or weather forecasting. Recent advances in data warehouse technologies, along with the modern data stack, have created a wealth of tools that researchers and data scientists could use to process the relatively modest amount of data such an effort would entail.

The problem is also not one of new measurements or surveying - the data for such a representation of our economies already exists. Government agencies in the United States, alone, publish hundreds of thousands of survey results per year across economics, climate, energy, and many other topics. Further, corporations themselves have tremendous amounts of data, measuring what the economy is doing in real-time2. There might be some challenges in finding ways to incentive structures to access corporate data but even the public data is valuable3.

The missing piece is the integration of datasets and a user interface allowing for the manipulation of that data.

The integration of datasets is most important. Take a seemingly simple question: is heart disease prevalence associated with counties that have projected temperature increase in the United States?4 This question could easily be the subject of an academic research paper and take months of data analysis to obtain. In practice, most of this work would be around finding sources for the data and finding common identifiers between datasets to be able to obtain the data. Most likely, a graduate student would be doing this work, which consists of little more than finding reading methodology reports, data dictionaries, and stitching together code to join things.

The Data Commons project that aimed to solve this integration issue. Founded by R.V. Guha, the creator of schema.org and Google fellow5, the Data Commons Project integrates data from public domain sources in a knowledge graph. That knowledge graph powers what you see in Google searches when you type “Population of India” into the Google search bar6. This data is also now available in Snowflake as my contribution.

The user interface problem is also a significant issue. There is a shortage of technical talent that is able to work with data. Further, those with the technical skills to manipulate data often are not the same people that have the most context about the questions to be answered. In academia, this problem is often solved by having graduate students do the data analysis while professors and researchers pose questions. In industry, data scientists and data analysts are often creating dashboards while business leaders are making specific requests. Data Commons aims to solve this issue as well. I have been working on a contribution here as well in the context of Streamlit.