How Google Might Track Changes on Webpages

Many sites on the Web contain elements that change on a regular basis, from advertisements that differ everytime a page shows, to widgets that contain constantly updated information, to blog and news homepages that show new posts and articles hourly or daily or weekly. Ecommerce sites add and remove products regularly, and display updated specials and features on their homepages. Sites including or focusing upon user generated content may consistently change.

Search engines use web crawling programs to discover new pages and sites and to index content that changes on pages they already know about by following links from one page to another. A Google patent granted last week explores the potential problem of a crawler coming across a page that has changed only slightly, such as a change in content or having an advertisement displayed, and deciding whether it should reindex that whole page because of the slight change.

The patent also describes the process behind how Google might check for changes on a page, comparing a newer version of a page to an older version, and associating an age with the older content and with the newer content. I’m reminded a little of a Yahoo patent that I wrote about a few years ago, in a post titled A Yahoo Approach to Avoid Crawling Advertisement and Session Tracking Links.

In the Yahoo patent I described in that post, we were told that the search engine might crawl pages a minute or so after a first crawl to see if there was content or links that changed from the first to second crawls, which might help the search engine in identifying those sections or links as advertisements or URLs that might contain unique session tracking parameters for different visitors. Yahoo might not crawl the newer URLs in those links, which it might consider to be “transient.”

In the Google patent, we’re told that if the differences in ages of the older content and the newer content aren’t that great, the search engine may continue to display the old content rather than updating its index to include the new content as well. After some additional crawls, if Google sees the newer content reach a certain age, it may then index the newer version of the page with the new content.

Updating search engine document index based on calculated age of changed portions in a document
Invented by Joachim Kupke and Jeff Cox
Assigned to Google Inc.
US Patent 8,001,462
Granted August 16, 2011
Filed: January 30, 2009

Abstract

A system receives a document that includes new content and aged content, and compares the document with a prior version of the document that includes the aged content but not the new content. The system also separates the new content and the aged content based on the comparison, determines ages associated with the new content and the aged content, and determines whether the ages of the new content and the aged content are greater than or equal to an age threshold.

The system further calculates a checksum of the document based on the aged content when the age of the aged content is greater than or equal to the age threshold, and the age of the new content is less than the age threshold, and stores the calculated checksum.

The focus of this patent seems aimed at keeping the search engine from reindexing pages after recrawling those pages where it finds some changes to the pages such as new advertisements being displayed or updated lists of related links. It makes sense for a search engine to not reindex the content of a page too quickly after those types of changes since doing so could result in reindexing many pages where there really hasn’t been any substantive changes to those pages.

I suspect that this process acts to throttle how quickly a search engine might update its index when it discovers new content on pages, regardless of whether those changes are slight changes to the content of a page or even possibly the posting of a new blog post. Since many pages on the Web do have components that might show new content everytime they are crawled, allowing a certain amount of time to pass before reindexing the content of a page might make sense, especially when the age differences between the older content and the newer content isn’t that great.

If the new content is still present on a page after a certain passage of time (minuntes, hours, or possibly even days), the page might then be indexed with the new content. The amount of time that a search engine may allow to pass before it will index changes might be based upon a historic view of how frequently some pages make changes to their pages.

I was also reminded of Google’s patent, Document scoring based on document inception date, while reading about this calculated age patent. In that patent, we’re told that for some queries, a fresher document might be preferred, while for other queries, an older document might be a better result, and that the age of a document might be included as part of a ranking score for that document.

The document inception date patent describes how an update frequency score and an update amount score may play a role in determining the score for an age associated with a document. The update frequecy (UF) score might look at the number of changes made to a page overtime, while the update amount (UA) might look at what those changes are. The patent tells us more about how a UA score might be calculated:

UA may also be determined as a function of one or more factors, such as the number of “new” or unique pages associated with a document over a period of time. Another factor might include the ratio of the number of new or unique pages associated with a document over a period of time versus the total number of pages associated with that document. Yet another factor may include the amount that the document is updated over one or more periods of time (e.g., n % of a document’s visible content may change over a period t (e.g., last m months)), which might be an average value. A further factor might include the amount that the document (or page) has changed in one or more periods of time (e.g., within the last x days).

According to one exemplary implementation, UA may be determined as a function of differently weighted portions of document content. For instance, content deemed to be unimportant if updated/changed, such as Javascript, comments, advertisements, navigational elements, boilerplate material, or date/time tags, may be given relatively little weight or even ignored altogether when determining UA. On the other hand, content deemed to be important if updated/changed (e.g., more often, more recently, more extensively, etc.), such as the title or anchor text associated with the forward links, could be given more weight than changes to other content when determining UA.

The document inception date patent is primarily concerned with how changes to a page might impact the ranking of that page where either freshness or the age of a document might positively impact the rankings of that page. What makes it interesting is in the depth of types of changes to a page that it might consider, and how some changes might be less of a concern than others. This calculated age patent focuses more upon when the search engine might incorporate changes to a page into its index, and seems to follow a much less detailed analysis of types of changes, possibly to keep decision making faster during the crawling of pages.

Conclusion

The Yahoo patent I mentioned above told us how it might recrawl a page after a minute or so to see if any of the links listed had changed, to determine whether those were “transient” links or links that would change on a regular basis such as advertisements. Because those links would change upon every crawl to a page, it might not have made sense for Yahoo to add those URLs to its list of URLs to be crawled since they were likely to be either advertisements or links with session tracking IDs.

This Google patent seems to aim at something very similar, in identifying content that changed quickly and changes that aren’t staying around for any length of time, to possibly identify advertising content or other content that has little to do with the actual content on a page.

Where knowing that Google might be doing something like this might be helpful may be in situations like an ecommerce store that wants to display links to specials or featured content on their pages. If the links to those are randomly displayed and changed everytime a search spider comes along, they might be ignored. If they are changed daily or weekly, it’s less likely that they will be ignored.

Share

14 thoughts on “How Google Might Track Changes on Webpages”

  1. Hi Bill,

    Crawl frequency is an interesting one – I suppose it’s the product of a number of factors from the storage of huge amounts of data, the transfer of that data, the relevance of the data to filter the correct sites for the correct terms and ensuring that it remains relevant and consistent with the changes of those sites.

    Getting the crawl frequency and prioritising the type of changes is pretty critical to ensuring that the right page is indexed for the right search query. I’ve tried to read the patent uninterrupted but I’m definitely struggling on the uninterrupted part! The one piece that seems to stick out to me though is the visible part of – “, n % of a document’s visible content may change over a period t”.

    As soon as I get some piece I’ll try and digest this properly – it’s always good when a patent for the google search core product can add some understanding to how they might go about their bread & butter activity.

    nb – I like the little SEO by the Sea image above your picture but it’s still not showing up for me when i try and share with LinkedIn :(

  2. Hi Tom,

    The way I read patents is to copy them into a text file, and then start deleting the stuff that really doesn’t add anything until I have something much smaller and easier to understand left. This way if I’m interrupted, it’s not as great a loss.

    Crawl frequency is definitely one of those things that it’s worth spending some time learning more about. One of the best places to start, if you haven’t read it is this document, which most likely was something Google based some of their approach upon when they first started:

    Efficient Crawling through URL Ordering

    The little image above my photo is actually all text and CSS rather than an image.

  3. has anyone experienced rank drops when changing CMS and redirecting (301) old to new CMS on the same domain and usually then also new URL’s? Any advice on what to look out for the most would be appreciated (as in on-page follow link count maybe?)
    I am pretty sure that too many 301’s will leak juice and drop ranks – the more obstacles for googlebot the worse you will rank, does anyone agree?

  4. Interesting. It seems to me this would also be related to Google’s value of links (not just indexing). Just like they “devalue” (I think) footer links… I would think if they can determine links are not “core” to the page but some supplemental items they could devalue the links in various ways? It seems to me Google gets better and better at figuring out how to programmatically do things that I can look at a page and evaluate well those links there are not really that important, these ones here are very related to the topic of the page…

  5. “..can look at a page and evaluate well those links there are not really that important..”

    Also this depends on the country you are in – In the US or in the UK Google must switch on a lot more filters (like the one you mentioned) whereas in AU or in NZ there is hardly anyone to filter so footer links (even hardcore blackhat like white font on white BG) still do their trick successfully – until the day comes once AU advances with more competition and Google will increase the filter “criteria” count and… BAM! All dodgy & “India” SEO websites >>Gone from page one or even from Google index altogether – I call it a panda slap =)

  6. Hi Ron,

    I think it’s pretty common to see some drops in rankings when moving from one version of URLs to another, including some new URLs. Google doesn’t necessarily capture and follow 301 redirects at the time when it first sees them, but might schedule them for a later crawl, anywhere from days to weeks later. Google might also have to recalculate PageRank for your pages a number of times as it incrementally captures information about your new URLs for old pages and your URLs that are just new.

    It’s not a bad idea to do a little extra to attract some new URLs when you make a change like this, and change the URLs on any other pages that you have some control over (like telecom and directory links and links from other sites that you may also have control over). It also really helps to make sure that you change all internal links to the New URLs instead of relying upon the redirects.

    An old post, but one that I think is still pretty much valid is: Web Decay and Dead Links Can be Bad for Your Site. I raised a very similar question to a group of search engineers at an SES session called “Meet the Crawlers,” and they noted that while an increasing number of internal broken links and redirects might not be a strong signal, it is one they look at. If you also chain a number of internal redirect together, the search engines will stop following them completely when the chains of links start getting past more than a couple.

  7. Hi Ron,

    Interesting thought about Google possibly not applying as many filters in different locales. That sounds like a possibility, though not something that I would want to rely upon too much. Google could turn one of those filters on overnight.

  8. Very interesting to see that Google may be changing the way they cache pages. I very often reffer to analytics frequesntly and tweak the on page content depending on a sites rankings and landing terms- It will be interesting to see if minor changes like that get through the new cache system- If not, these minior changes will have to become more Major.

  9. This is an interesting change as part of Google’s cleaning up house with backlinks, indexing, etc. It seems so hard to keep up on all the changes. I guess as a web owner, it is important to stay up on your internal and external links, check for broken links, pinging updated content, etc. You have a lot of valuable info on your blog – thanks for being a go-to SEO expert. :)

  10. Hi Jon,

    I’m not so sure that this is a change to the way that Google caches pages, but it definitely shows that they are thinking about how to do things more economically, with a process in place to help them decide when to update the content that they are indexing.

    I’ve been paying more attention to how and when they are updating certain information, using Google’s “show last 24 hours” results, and I do sometimes see that they will add a new blog post in their index without updating the result for my homepage to show the new post there as well.

  11. Hi Julie,

    Thank you. One of the things I like about looking at patents is that they sometimes describe things that we’ve been seeing happen for a while, but didn’t have the vocabulary to discuss or an idea of some of the processes behind them.

    This patent answered my question of why it is that sometimes when we make small changes to a web page those might take a little longer than I might expect for them to get into Google’s index.

Comments are closed.