News & Analysis

Google Confirms Rankings Doc Leak 

The leaked data reveals API features that Google has long contradicted in public while discussing how its search works

A large cache of documents that describes how Google ranks its search results were spotted on GitHub by a few SEO operatives. They went to town claiming that the documents proved what Google had always contradicted about its ranking processes. However,  Google chose to remain silent on the issue but has now confirmed that the leaked docs were authentic. 

The documents, which could have been inadvertently committed to GitHub around mid-March by Google’s own automated tooling, contains data that Google tracks and possibly used in the company’s secret sauce ranking algorithm. It gives users a peek under the hood of one of the most consequential systems that has shaped the internet.  

A report in The Verge quoted Google spokesperson Davis Thompson as saying that the company would advise caution against making inaccurate assumptions about Search “based on out-of-context, outdated, or incomplete information”. “We’ve shared extensive information about how Search works and the types of factors that our systems weigh, while also working to protect the integrity of our results from manipulation,” the official said. 

What leaked and how did they leak?

The leaked material was first found and outlined by search engine optimization experts Rand Fishkin (of SparkToro) and Mike King (of iPullRank), who published their analyses of the documents and their content a couple of days ago. Reports also suggested that the material was actually spotted first by another SEO specialist Erfan Azimi of EA Digital Eagle. 

The researchers noted that the error had occurred at Google’s end through an automated tooling process back on March 13 when the automation tacked an Apache 2.0 open source license to commit, a standard process for Google’s public documentation. The reports also noted that a follow-up commit on May 7 also attempted to undo the earlier commit. 

The leaked documents describe an older version of Google’s Content Warehouse API that offers a sneak under the hood into how the search engine rankings are done. They do not contain any code or technical stuff. What they do is contain references to internal systems and projects that are likely to be internal documentation of the processes involved. 

It must be noted here that Google has already put a similarly named Google Cloud API document in the public domain but the one on GitHub appears to go well beyond it. It contains references to what Google considers critical when ranking web pages for relevancy – something that the SEO community is now salivating. 

There’s still a lot one can speculate about

There are over 2,500 pages of documentation (you can check them here) that contain more than 14,000 attributes accessible or associated with the API. Of course, one has to only speculate as to how many of these signals are used by Google as there is no information around what weightage Google gives to them in its ranking algorithm. 

Of course, SEO consultants believe that the documents contain enough details and that these are significant variance with the public documents made by Google from time to time. In his post, Rankin notes that the leak contradicts public statements by Googlers over the years, “in particular the company’s repeated denial that click-centric user signals are employed or that sub-domains are considered separately in rankings.” 

But, there’s also stuff that provides clarity

In his post, King points to a statement from Google search advocate John Mueller who notes in a video that the company doesn’t have anything like a website authority score where it measures whether Google considers a site authoritative and therefore worthy of higher rankings for search results. 

However, the docs that appeared on GitHub actually reveal that as part of the Compressed Quality Signals Google stores for documents, there’s one on “siteAuthority” score that can be calculated. Another one relates to the importance of click types as a ranking factor in web search, while another uses websites viewed on Chrome as a quality signal – this is seen in the leaked API as “ChromeInTotal”. 

Of course, there are references in the documents that confirm what we have known for long – that Google considers factors such as content freshness, authorship, whether a page is related to a site’s central focus, alignment between page title and content, and an average weighted font size of a term in the document’s body. 

How much of it can we believe?

Though one could question the authenticity and the currency of the documents, there is no doubt that it is indeed a treasure trove for SEO, marketing, and publishing industries, given the secrecy that Google has maintained around its algorithms. Coming on the back of Google’s testimony in the US Department of Justice antitrust case, these leaks are significant. 

Choices that Google makes on search have a major impact on how the internet works for individuals and businesses of all hues. Not to mention a burgeoning class of experts who claim to be finding ways to outsmart Google’s algorithm through SEO efforts. That Google has always been vague about its processes adds to the value of these leaked documents. 

What’s more, their response to the leaks while admitting them further underscores that the industry is probably on to something here. How the company will respond to this in the future is something we may have to wait and see.