Revisiting Google’s Information Retrieval Based Upon Historical Data

Can patents be said to have family histories? If so, this post is going to introduce a barely known ancestor to one of the most written about search related patents on the Web, as well as a brand new grandchild to the patent.

The patent is Google’s Information retrieval based on historical data, which was filed in 2003, and granted in 2008. When it was published as a pending patent application in 2005, it created a pretty big stir amongst the forums and blogs of the search community.

The patent has two focuses which both take advantage of recording changes to a site over time. One is to help identify web spam, and the other is to help avoid stale documents being returned in response to a query. It raised questions between SEOs such as how important are the ages of domains and of links, as well as:

  • Does Google favor fresher sites over older sites, or older sites over fresher sites?
  • Even more, how does Google weigh the age of a website?
  • Are the search engines looking at whois data to see who owns websites, and if there has been a change of ownership?
  • If the content of a site changes, and the anchor text pointing to it remains the same even though it’s no longer relevant, will it still rank for the terms in the anchor text?
  • If you buy a website and make changes to it, will the PageRank for that site start to evaporate or expire?

That’s really just brushing the surface.

One thing that caught many eyes at the time was that one of the named inventors on the patent was Google’s Head of Webspam, Matt Cutts, who was well known in the community for his interactions with forum members on behalf of Google, and his participation in conferences and with the press. (Actually the whole roster of inventors listed on the patent is like an all star team of search engineers.)

Another was that it said things like the amount of time a domain name was registered might be an indication of whether or not it was intended to be a spam site – with spammers usually only registering a site for a year, and people more “serious” about their businesses registering their sites for longer.

Matt went on to rebuff that assertion more than once since the patent was published, but hosting businesses such as GoDaddy caught wind of it, and used the FUD (fear, uncertainty, and doubt) behind the patent as a selling point to try to get people to register their domains for longer than a year. Regardless of whether it was true or not, they saw the possibility of using the information within the patent as a path to more profits.

The patent reads in part like a list of ways that Google might try to catch people engaging in web spam.

The Patent’s Father – Information Retrieval Based Upon Historic Information

There are some patents I recommend anyone interested in learning about SEO should read.

The little known first PageRank patent is one of them.

The Reasonable Surfer patent is another.

The one on historical data that I linked to above is a third.

If you can grasp the points and ideas in those, and the assumptions behind them, you’ll have a decent foundation to build upon while learning and doing SEO.

I was doing research on a newer patent filing related to the historical data patent (more on that below), when I noticed that there was an earlier provisional patent filed that was also related, by the name of Information Retrieval Based Upon Historic Information (pdf – Application Number 60/507,617). It covers almost exactly the same ground as the patent that was granted, but it does so in language that’s a little easier to grasp, without as much legal mumbo jumbo.

If you’re teaching yourself SEO or have an SEO training program in place where you work, it definitely wouldn’t hurt to share the document with others, discuss and debate it, try to figure out what things from it Google might be using today, or just revisit it if you hadn’t had the chance to spend some time with it before (or even if you have).

Document Scoring Based Upon Document Content Update

Want to visit websites which look like they might have been designed in the 1990s? Do searches on Google for terms like [bill of rights] where older sites that haven’t seen substantial changes to the main content of their pages are likely favored in search results.

A funny thing happened to the original Historic Data patent along the way to its being granted. Google filed a handful of patent applications that could be said to be the children of the patent. Those traveled through the USPTO and became full fledged granted patents.

I was looking through the recently published patent applications this week and recognized the names of one of those patents appearing, with a completely different set of claims attached to it. The patent still had the original description, but the claims focused upon a different way of identifying which section of a page it might look at to see how much that page had changed.

The patent application is:

Document Scoring Based on Document Content Update
Invented by Anurag Acharya, Jeffrey Dean, Paul Haahr, Monika Henzinger, Steve Lawrence, Karl PFLEGER, and Simon Tong
Assigned to Google
US Patent Application 20110258185
Published October 20, 2011
Filed: June 30, 2011

Abstract

A system may determine a measure of how a content of a document changes over time, generate a score for the document based, at least in part, on the measure of how the content of the document changes over time, and rank the document with regard to at least one other document based, at least in part, on the score.

As I mentioned in the start of this section, if you search some query terms, you may notice that the documents returned for those queries might make you feel like you were doing some time traveling into the past. A search for [bill of rights] is one of them, where the average age of pages returned at the top of the search results is fairly old. Some other queries, such as a search for [new avengers movie] unsurprisingly will show very new and fresh pages in Google’s search results.

One of the original purposes behind the Historic Data patent was to avoid showing stale documents when they weren’t appropriate. However, for searches on topics like [bill of rights], older documents may be the best results that the search engine could return. With this Document Content Update patent, the focus is upon creating a score based upon the age of the documents listed in search results, and using that score to favor documents based upon an average age for a certain number of the top documents returned in response to a query.

This new patent differs from the older one in a couple of ways. First, the original version had a lot of language in its claims about anchor text pointing to the document, and how well a page being scored might match up with the anchor text, while mentions of anchor text in the claims of the newer version have disappeared. That doesn’t mean that Google probably isn’t doing that any more, but rather that this version of the patent is focusing upon something different.

That difference is in where it looks on a web page. The original claims for the patent told us that Google might ignore the “boilerplate” language it finds on pages, and the changes to those. In the newer version, instead of mentioning the word boilerplate, the patent tells us that it might calculate the frequency with which words appear on a page (excluding stop words), and look at changes to the section of a page that contains the most frequently used words. In pages about the Bill of Rights, that’s usually going to be a page section that reproduces the amendments.

So pages that contain the full text of the Bill of Rights may have changed in a few ways since the 1990s, but the actual text of those amendments to the US Constitution on the pages shouldn’t have. The “last modified” dates of the HTML files that content is found at might show fairly recent dates, but for those pages that have been online since sometime in the 90s, the date Google looks at for them are either when they first went online, or when Google first became aware of them through crawling or some other process.

Conclusion

I was excited to find the early provisional patent version of the Historic Data patent because it is easier to read through, and I’d definitely recommend people interested in how search engines work read through it. Chances are that not everything covered in it has been implemented by Google, but it provides some great examples of the kinds of things that the search engine might do that may not be very obvious, when it comes to ranking pages within its index.

The newer version of the Document Content Update patent was also interesting reading for a few reasons. One of them is that it shows the importance of keeping older content around and available when the queries it serves are best answered by older content, and that it can be helpful to update content in meaningful ways for queries that might best be served by newer content.

For example, if you want to rank well for the term [world series], you had best be showing fresh new content (try the search) since Google seems to rank pages higher for that query that do.

It can be helpful to know when you’re doing keyword research whether a query term that you may select might favor older content or fresher content.

Share

15 thoughts on “Revisiting Google’s Information Retrieval Based Upon Historical Data”

  1. I’ll take your advice and will discuss it with my SEO partner. This is a broad topic but it’s worth learning about. Thanks! :)

  2. Bill, This is in response to: “Are the search engines looking at whois data to see who owns websites, and if there has been a change of ownership?”

    Found a bug in WMT the other day under “Labs” where Google has enabled custom search (CSE). Using the tool to perform a query on a particular verified site in WMT would return results from our other verified sites which resided on the same C-block AND displayed same ownership in Whois.

    The bug was reported and acknowledged as such by Google. Other sites using shared hosting performed normally.
    I find it fascinating.

  3. Great article. I never new about this patent. It raises some interesting SEO issues for how to approach the optimization of my site. I wonder if the age of incoming links also will play a role in the “historical data” portion of the algorithm.

  4. The age of incoming links should play a role in historical data, search engines looking up whois details along with IP block owners is something they have been doing for awhile now due to spammers link building using dif IP blocks for hosting there websites, you still often find people wanting so many dif class C IP addresses for SEO uses.

  5. so gottickets.com ranks higher than the major league baseball site? a bit jacked, eh?

  6. Thank you! I’m talking to a MLIS (Master’s in Library and Info Science) class this week and this info was super helpful. I hadn’t looked at the patents in years. Kind of reads like a (smart) kid’s time capsule.

  7. Hi GMGMenchie,

    The patent does cover a lot of possible approaches by Google, and I think it’s really a useful topic around which to discuss SEO and how it works. If I were to write a book about SEO, I’d consider centering a chapter around it. :)

  8. Hi Rick,

    That is a fascinating bug, and it makes you wonder about how Google mixed up the wiring there behind their webmaster tools implementation of custom search engines.

    Been thinking lately that Google should provide a separate reporting layer for Google Webmaster tools, so that people who aren’t “owners” of a site can also verify their relationhip with a site and be given access to that information, but possibly not be given access to some of the controls found at Webmaster Tools.

  9. Hi Richard,

    Thanks. It covers a wide range of issues that have been discussed on a lot of blogs and forums – some that were talked about before it was published, and some that people started discussing after they heard or read about the patent.

    The age of incoming links could possibly play some kind of role in how a page or site might be perceived by the search engines. A Microsoft patent that discussed some similar issues instead discussed the age of the sites doing the linking, and how a newer site might be seen as more mature because it had links to it from more mature sites.

    See: Do Domain Ages Affect Search Rankings?

    Sometimes part of the fun behind a search patent isn’t in the actionable steps that it points to that you might take, but rather the questions that it raises that you can think about and experiment with.

  10. Hi Lee,

    I read the leaked copy of Google’s manual reviewers guide that was circulating on the Web for a while, and it suggested that site evaluators look at whois information when available as well.

    I don’t know that using different C Blocks makes a difference, but I’ve seen that point made in a good number of places.

  11. Hi Mike,

    I guess that depends upon the query used. The MLB site ranks pretty well for [baseball], but there are a number of sites that rank higher than the MLB homepage for [baseball tickets], including the gottickets.com page (which I didn’t go check out).

    Do you think that has something to do with historic data, or with something else? I’m not sure that I would go to mlb.com to buy tickets for a ball game, but it’s definitely one that I would go to with more informational questions.

  12. Hi Ann,

    You’re welcome. The Historic Data patent read to me as if a bunch of very well informed people got together and mind-mapped different ways that they might solve problems related to both web spam and old stale content. I like it because it looks at the problem from a number of different perspectives and provides ideas on how different kinds of information might be used.

Comments are closed.