Revisiting Google’s Information Retrieval Based Upon Historical Data
Can patents be said to have family histories? If so, this post is going to introduce a barely known ancestor to one of the most written about search related patents on the Web, as well as a brand new grandchild to the patent.
The patent is Google’s Information retrieval based on historical data, which was filed in 2003, and granted in 2008. When it was published as a pending patent application in 2005, it created a pretty big stir amongst the forums and blogs of the search community.
The patent has two focuses which both take advantage of recording changes to a site over time. One is to help identify web spam, and the other is to help avoid stale documents being returned in response to a query. It raised questions between SEOs such as how important are the ages of domains and of links, as well as:
- Does Google favor fresher sites over older sites, or older sites over fresher sites?
- Even more, how does Google weigh the age of a website?
- Are the search engines looking at whois data to see who owns websites, and if there has been a change of ownership?
- If the content of a site changes, and the anchor text pointing to it remains the same even though it’s no longer relevant, will it still rank for the terms in the anchor text?
- If you buy a website and make changes to it, will the PageRank for that site start to evaporate or expire?
That’s really just brushing the surface.
One thing that caught many eyes at the time was that one of the named inventors on the patent was Google’s Head of Webspam, Matt Cutts, who was well known in the community for his interactions with forum members on behalf of Google, and his participation in conferences and with the press. (Actually the whole roster of inventors listed on the patent is like an all star team of search engineers.)
Another was that it said things like the amount of time a domain name was registered might be an indication of whether or not it was intended to be a spam site – with spammers usually only registering a site for a year, and people more “serious” about their businesses registering their sites for longer.
Matt went on to rebuff that assertion more than once since the patent was published, but hosting businesses such as GoDaddy caught wind of it, and used the FUD (fear, uncertainty, and doubt) behind the patent as a selling point to try to get people to register their domains for longer than a year. Regardless of whether it was true or not, they saw the possibility of using the information within the patent as a path to more profits.
The patent reads in part like a list of ways that Google might try to catch people engaging in web spam.
The Patent’s Father – Information Retrieval Based Upon Historic Information
There are some patents I recommend anyone interested in learning about SEO should read.
The little known first PageRank patent is one of them.
The Reasonable Surfer patent is another.
The one on historical data that I linked to above is a third.
If you can grasp the points and ideas in those, and the assumptions behind them, you’ll have a decent foundation to build upon while learning and doing SEO.
I was doing research on a newer patent filing related to the historical data patent (more on that below), when I noticed that there was an earlier provisional patent filed that was also related, by the name of Information Retrieval Based Upon Historic Information (pdf – Application Number 60/507,617). It covers almost exactly the same ground as the patent that was granted, but it does so in language that’s a little easier to grasp, without as much legal mumbo jumbo.
If you’re teaching yourself SEO or have an SEO training program in place where you work, it definitely wouldn’t hurt to share the document with others, discuss and debate it, try to figure out what things from it Google might be using today, or just revisit it if you hadn’t had the chance to spend some time with it before (or even if you have).
Document Scoring Based Upon Document Content Update
Want to visit websites which look like they might have been designed in the 1990s? Do searches on Google for terms like [bill of rights] where older sites that haven’t seen substantial changes to the main content of their pages are likely favored in search results.
A funny thing happened to the original Historic Data patent along the way to its being granted. Google filed a handful of patent applications that could be said to be the children of the patent. Those traveled through the USPTO and became full fledged granted patents.
I was looking through the recently published patent applications this week and recognized the names of one of those patents appearing, with a completely different set of claims attached to it. The patent still had the original description, but the claims focused upon a different way of identifying which section of a page it might look at to see how much that page had changed.
The patent application is:
Document Scoring Based on Document Content Update
Invented by Anurag Acharya, Jeffrey Dean, Paul Haahr, Monika Henzinger, Steve Lawrence, Karl PFLEGER, and Simon Tong
Assigned to Google
US Patent Application 20110258185
Published October 20, 2011
Filed: June 30, 2011
A system may determine a measure of how a content of a document changes over time, generate a score for the document based, at least in part, on the measure of how the content of the document changes over time, and rank the document with regard to at least one other document based, at least in part, on the score.
As I mentioned in the start of this section, if you search some query terms, you may notice that the documents returned for those queries might make you feel like you were doing some time traveling into the past. A search for [bill of rights] is one of them, where the average age of pages returned at the top of the search results is fairly old. Some other queries, such as a search for [new avengers movie] unsurprisingly will show very new and fresh pages in Google’s search results.
One of the original purposes behind the Historic Data patent was to avoid showing stale documents when they weren’t appropriate. However, for searches on topics like [bill of rights], older documents may be the best results that the search engine could return. With this Document Content Update patent, the focus is upon creating a score based upon the age of the documents listed in search results, and using that score to favor documents based upon an average age for a certain number of the top documents returned in response to a query.
This new patent differs from the older one in a couple of ways. First, the original version had a lot of language in its claims about anchor text pointing to the document, and how well a page being scored might match up with the anchor text, while mentions of anchor text in the claims of the newer version have disappeared. That doesn’t mean that Google probably isn’t doing that any more, but rather that this version of the patent is focusing upon something different.
That difference is in where it looks on a web page. The original claims for the patent told us that Google might ignore the “boilerplate” language it finds on pages, and the changes to those. In the newer version, instead of mentioning the word boilerplate, the patent tells us that it might calculate the frequency with which words appear on a page (excluding stop words), and look at changes to the section of a page that contains the most frequently used words. In pages about the Bill of Rights, that’s usually going to be a page section that reproduces the amendments.
So pages that contain the full text of the Bill of Rights may have changed in a few ways since the 1990s, but the actual text of those amendments to the US Constitution on the pages shouldn’t have. The “last modified” dates of the HTML files that content is found at might show fairly recent dates, but for those pages that have been online since sometime in the 90s, the date Google looks at for them are either when they first went online, or when Google first became aware of them through crawling or some other process.
I was excited to find the early provisional patent version of the Historic Data patent because it is easier to read through, and I’d definitely recommend people interested in how search engines work read through it. Chances are that not everything covered in it has been implemented by Google, but it provides some great examples of the kinds of things that the search engine might do that may not be very obvious, when it comes to ranking pages within its index.
The newer version of the Document Content Update patent was also interesting reading for a few reasons. One of them is that it shows the importance of keeping older content around and available when the queries it serves are best answered by older content, and that it can be helpful to update content in meaningful ways for queries that might best be served by newer content.
For example, if you want to rank well for the term [world series], you had best be showing fresh new content (try the search) since Google seems to rank pages higher for that query that do.
It can be helpful to know when you’re doing keyword research whether a query term that you may select might favor older content or fresher content.