Measuring web traffic - the volume and characteristics of visitors to web properties - is critical to the digital economy. Website visits are the top of the funnel to all internet driven eCommerce. Each visitor represents an opportunity to make a sale: a “person that walks into the shop.” Web visitors further make up the audience for the advertising served by that website. Understanding how many people are visiting and who these people are (or at least what characteristics describe them) is useful for selling and advertising better (the primarily monetization of digital properties). Measuring web traffic also allows for competitive insights. What other websites are customers browsing? Is slow growth a result of macroeconomic conditions or specific to a website? Web traffic is the ultimate normalized comparison between digital properties.
Despite the importance, broad internet measurement is extraordinarily difficult lacking access to every server that hosts content. Measuring web traffic amounts to counting server responses to a client. However, the decentralized nature of the internet makes it difficult to get in between every such request1. In contrast, brick & mortar sales in the United States are comparatively very centralized: Visa & Mastercard make up the vast majority of payment processing2, the IRS collects every businesses’ revenue, and the US Census can compel any business to respond to surveys3. It is far from perfect, but it is pretty straightforward to get accurate estimates4. It seems somewhat unfathomable that any entity will be similarly positioned in regards to web traffic, and even less likely such an entity would share the data with third parties.
Of course, this is possible in some large localized network areas – Amazon has access to records for all websites relying on AWS servers, an ISP provider may have a monopoly in a country, infrastructure providers like Cisco or Cloudflare5, or some very large internet agencies. One special case is walled gardens, where the measuring entity also controls publication on the platform6. In these economically happy situations, the entity can price advertising very efficiently because they know exactly who is seeing content.
A second core difficulty with web traffic measurement is that metric definitions are practically difficult to standardize. Even if every server owner offered to contribute statistics to a centralized data co-op, it would also be required that everyone measure traffic using the exact same software. Basic metrics such as users, sessions, or page views are immensely complicated to count consistently because of the intricacies and complex use patterns of web property access. For instance:
Users: When counting users, do you include bot traffic (can you even with certainty determine who is a bot?)? Can you keep track of the same user if their IP or MAC address has changed? Can you track the same user across devices?
Page views: A page view is simply the count of unique web pages that have loaded to any user – but has at least a couple challenges. Many websites use AJAX to load partial pages and content – does an AJAX call count as a page view or not? In some cases, such as in a SaaS application, probably not. And in other cases, when someone is scrolling through a list that would otherwise be paginated with unique URLs, it seems like you should if you want to make like-for-like examples. Does viewing a subdomain also count as page view for the main page? What about subdomains that are in iFrames or accessed primarily through an unrelated domain (ie. a checkout page hosted by Shopify)?
Sessions: Sessions are similarly ambiguous. How long does the person need to be away or inactive for it to be a new session? It seems it should be dependent on the website content. Do we count sessions that are initiated simply by someone re-opening their browser and the browser loading their previous tabs? Can users have parallel sessions (ie. multiple tabs) or is each tab switch a new session? If users can have parallel sessions, how long can a tab stay inactive before we count it as a new session? If the user has to be physically away from their browser, how do we deal with mobile phones in which users rarely “shut down” the web browser.
The point is that the definition of these metrics does not come down to simply agreeing on a standard, rather it is necessarily ambiguous because of the variety of web properties and methods of interacting with them. Web traffic measurement is materially different from television measurement (less ambiguous definitions of reach with discrete, non-continuous events and actions).
So, given these two difficulties: imperfect data access and necessarily ambiguous metric definitions, how well is web traffic estimated? The best public summary of commercial data providers I have found is Rand Fishkin’s November 2022 article, Which 3rd-Party Traffic Estimate Best Matches Google Analytics? The short answer is that none of the major web traffic providers are particularly amazing at making estimates of magnitude, although most providers’ data are at least moderately correlated with the ground-truth as measured by the websites themselves. The core metric Fishkin reports is the percent of the time that a provider’s monthly user estimate was with +/-30% of the number reported by Google Analytics (the best commercial data provider gets within the range two thirds of the time for large websites). All the providers end up with correlation around ~0.6-0.7, indicating it is comparatively easier to get sequential trends correct (the first derivative) than absolute values7. Fishkin doesn’t report results on rank correlation (which would answer whether these data providers get relative size correct).
What can be done to improve? We have been working on this problem for the past few months, and I hope to have say more soon. Nonetheless, a principled measurement approach begins with the approach I outlined a year ago:
Wikipedia’s article on internet governance is quite good. There is a persistent debate on how centralized the internet might be, but this seems to be more of a governance point than a practical reality.
To be more accurate, Visa, Mastercard, AmericanExpress, and Discover dominate the market: https://wallethub.com/edu/cc/market-share-by-credit-card-network/25531
Cloudflare publishes the useful Cloudflare Radar. Cisco publishes a ranked list of websites based on DNS lookups across their Umbrella network.
Obviously, Meta would be the canonical example.
This might well be because calibrating magnitude for web traffic is particularly difficult as there is no easily definable target population of internet users (unlike say spending households or consumers).