The Average Web Page (Data from Analyzing 8 Million Websites)
The following is a guest post by Catalin Rosu, who along with some colleagues, dug up a ton of data about the HTML content of web sites. This is the most recent study of its kind and wildly fascinating to see the results. I find it especially fun to compare the top results to what I would have guessed would have won.
We’ve all been there. We try to improve our HTML code making it clean, beautiful, and readable. We do this in pursuit of better semantics and better accessibility, so that everyone can use it. It’s our top priority. And we always have questions:
- What is the best way to structure the markup?
- How are others doing it?
Questions like these were running through my mind. I wondered about how people write markup these days, as new web technologies emerge. So, I teamed up with a few of my colleagues at AWRCloud and we came up with a data set of over 8 million pages from Google top twenty results.
The studies that came before this one
Back in 2005, Ian Hickson, the editor of HTML5 specification, made an analysis of a sample of slightly over a billion documents, looking to see what the web is made of. A billion is an enormous number, but to Google, nothing is impossible. With this huge amount of documents, he extracted valuable information about popular class names, elements, attributes, and related metadata. The outstanding results were later published as Web Authoring Statistics, which is still the most powerful web authoring study ever made.
More recently, in 2008, the Opera Metadata Analysis and Mining Application crawler, MAMA, ended up analyzing about 3.5 million URLs. Brian Wilson, the author of this impressive work, expanded the study by publishing results detailing page structures, including HTML, CSS, and JavaScript.
One of the analyses from Web Authoring Statistics that later proved vital for the work in progress HTML5 development, was a list of the most popular class names in those HTML documents. The Opera MAMA crawler also searched for the most common class names and in addition to Google’s results and they’ve published relevant results on the popular ID attribute values given to elements as well.
What does this study add to the conversation?
The data for this study comes from 8,021,323 index pages gathered from the top twenty Google results for about 30 million keywords, chosen by keyword volume. Meaning: we had 30 million keywords. We ran a Google search for each of them and took the URLs for the top 20 results and added them to the list and removed the duplicates.
We can only assume that the relevance of these web pages to the general web population is very high. That is based on the likelihood these are popular and high-trafficked websites commensurate to their search result positions.
How fresh is this data?
The latest data set is from May 20th, 2016.
This new study will never surpass the former study Google made back in 2005. It’s not about overcoming Opera’s great study either. It’s about finding new and relevant insights on the actual markup used by the most popular and successful web pages on the internet.
So, how does the average HTML page look like nowadays? Take a look at the screenshots below and check out the study for the full statistics.
The Stats
Following our study, we find that the average website index page uses twenty six different different element types.
The top twentysix elements used on the most number of pages are:
Among the document type declarations that specify which version of (X)HTML a page is using, the latest HTML5 doctype is clearly leading the way.
If we look at all the elements that are specifically about telling browser or search engines about the site and how to style it, we found about 175 million elements, and here’s how they broke down:
The breakdown of the 105 million elements for content sectioning looks like this:
Of the billion text content elements: