Updating Google’s Historical Data Patent, Part 2 – Changing Content

Come gather ’round people
Wherever you roam
And admit that the waters
Around you have grown
And accept it that soon
You’ll be drenched to the bone.
If your time to you
Is worth savin’
Then you better start swimmin’
Or you’ll sink like a stone
For the times they are a-changin’.

- Bob Dylan

Can the rate of change upon web pages influence how Google might rank pages of a site?

In part one of this series, I looked at how Google’s patent on Information retrieval based on historical data focused upon Freshness.

This second part of the series explores how Google might look at content changes on Web pages, and how the frequency of those changes might influence how those pages may rank at the search engine. Keep in mind that we don’t know for certain whether Google is even using the processes described in this patent. But it is a possibility.

Change Happens on the Web

Change actually happens fairly frequently on web sites. Some examples of sites with rapidly changing pages include:

1. Ecommerce sites that add and remove products
2. Informational portals such as newspapers, newsletter, blogs
3. Pages that rely upon user generated content that gets edited, modified, updated, and added to on a consistent basis

Of course, there are other scenerios. For instance, imagine that you have a web site. Its pages rank well in Google, and have for years. You’re afraid of changing anything. But, you think that if you make some changes, you might get more conversions on your page.

If you update the page, what kind of impact would the fact that you changed your page have on your rankings?

Or, you have a blog that you update almost everyday. You go on vacation for two weeks, and then have a family emergency that keeps you from your web site for another two weeks. You’ve gone a month without a new blog post. Has the failure to update your site influenced how your pages rank in Google?

Content Updates/Changes

Google’s historical data patent recognizes that pages change, and that some of them change more rapidly than others.

It also recognizes that some aspects of change on pages or parts of pages, or a web site as a whole may be considered less important than others.

For example, if someone is showing ads on their site, or using java script to display an RSS feed, and those change on a regular basis, those changes may be considered much less important than a page title change, or a change in the anchor text of a link leading from the page.

The patent gives us a mathematical formula to talk about content change:

U=f(UF, UA)

An “Update score” (U) is calculated using frequency of change, and amount of change.

An “update frequency score” (UF) may be used to calculate how often a document (or page) changes over time. It could be determined by the average time between updates or the amount of updates over a period of time.

An “Update amount score” (UA) represents how much a document (or page) has changed over time. The update amount score looks at a number of possible changes, and gives different weights to different kinds of changes.

The kinds of updates considered in the Update amount (UA) score :

  • The number of “new” or unique pages associated with a document or site over a period of time.
  • The ratio of the number of new or unique pages associated with a document or site over a period of time versus the total number of pages associated with that document or site.
  • The amount that the page or site is updated over one or more periods of time (e.g., a percentage of a document’s visible content may change over a period such as the last month.
  • The amount that the document (or page) has changed in one or more periods of time, such as the last x days.

Some content may have different weights than other when changed. For instance, the following may be considered fairly unimportant and could be given little weight or ignored completely:

  • Javascript
  • Comments
  • Advertisements
  • Navigational elements
  • Boilerplate material, or
  • Date/time tags

Other content, such as a page titles and anchor text associated with links pointing out to other pages may be considered much more important if they are updated or changed, with a consideration of how often, how recent, and how extensive those changes are.

It’s possible that for some queries, pages with content that hasn’t recently changed might be considered more favorable than pages with content that has recently changed.

A search engine may determine a date when the content of each of the pages in a result set for a query last changed, determine the average date of change for those pages, and modify the scores of the pages (either positively or negatively) based on a difference between a page’s date-of-change and the average date-of-change for all of the pages in those results .

Other Implications of Changes to content on Web pages

Since the historical data patent was first published, a number of papers have been written that look at changes to the content of web pages. I’ve collected a list of some of them which provide some interesting ideas about change on the Web.

A three-year study on the freshness of Web search engine databases

Which search engine has the freshest content? This study aims to “analyse the update strategies of the major Web search engines Google, Yahoo, and MSN/Live.com.” It provides a look at how well those search engines capture changes to the content of some specific pages over six weeks spread out over the years 2005, 2006, and 2007.

Recrawl Scheduling Based on Information Longevity (pdf)

Some content is “ephemeral” and may not be worth crawling by a search engine, because by the time it is indexed, it may not be representative of the content of the page it comes from, such as a “quote of the day.” Some content might be considered to be “persistent” and it may persist across multiple page updates because that content remains around for a “sustained period of time.” Blog posts may be considered persistent since they remain around, even though they may be pushed down a blog’s front page, or into an archive.

Can a search engine crawling program distinquish between ephemeral and persistent content, to focus its resources more upon persistent content?

Characterization of the evolution of a news Web site (pdf)

Are there patterns that can be identified by studying the frequency and amount and types of changes to a news based web site? This paper identified a number of patterns involving change to the MSNBC over a period of 19 continuous weeks. Can a model made from that study help describe the behavior of other news sites? The authors of this paper concluded that it could.

Microscale Evolution of Web Pages (pdf)

A Google poster from the 17th International World Wide Web Conference held in Beijing this year explores the rate of rapidly changing web pages to create a model to be used to determine how frequently to revisit those web pages.

The Discoverability of the Web (pdf)

A Yahoo paper which explores the use of “historical statistics to estimate which pages are most likely to yield links to new content.”

Detecting Age of Page Content (pdf)

Parts of web pages may change at rates differently than other parts of web pages. Can page histories using data extracted from external sources help to identify the rate of change of different parts of pages. The authors of this paper describe how this can be done, and how those different objects on pages can be annotated with dates that they were changed.

What Can History Tell Us? Towards Different Models of Interaction with Document Histories (pdf)

What kinds of benefits might we see if we were able to track and see how a web site has changed over time? If a “past web browser” was available to visitors, so that they could view changes to a web page over time, would they? Viewing changes to web pages over time may be as helpful to people as it could be to search engines.

Conclusion

Change happens on the Web, and the rates of change, amount of changes, and types of content being changed on pages may influence the rankings of web pages, and the frequency of crawling of web pages by search engines. Some kinds of changes carry much less weight than others.

This part of the historical data patent focused upon changes to the content of pages, but there are other factors under the patent such as click through rates, changes in links, and in anchor text that may also play a role in rankings of web pages. Those will be explored in future parts of this series.

Share

15 thoughts on “Updating Google’s Historical Data Patent, Part 2 – Changing Content”

  1. What about a site where the pages are numbered, and and the *newest* content is always on page one? Every time a new piece of content is posted, every page of content on the entire site moves the *next* higher numbered page, and the very oldest content now appears at a brand new URL, one digit higher than the highest number that was in use previously.

    I have no idea how Google can properly assign PR when the URL of any page that has an outgoing link on changes all the time: maybe daily or even several times per day. I do know that when you click a link in SERPs to visit the site, you will never find the content you expected to find when you do reach the site. That’s because the content has already moved (maybe several times) to a new URL.

    I see these issues on many sites, designers don’t seem to understand the damage that such a system wreaks on their sites.

  2. Maybe there should be a UT ( update topic ) component to the equation. Since some topics and categories of information are going to change more frequently than others by their very nature.

    For instance: topics like celebrity gossip will change frequently seven days a week. Topics like business and stock information will change more frequently monday – friday and less frequently saturday and sunday. While topics like Genealogy research or some of the slower changing academic disciplines my change much less frequently.

  3. Fun stuff to be sure. Of my many passions, these documents are near the top. There are things for content generators, link builders and more and (like the phrase based) should almost be mandatory reading for SEO peeps. Understanding (potential) historical ranking factors is key IMO.

    L8TR – thanks for the morning reading… always up for HRFs.

  4. I have seen the results of making big changes to sites. Taking a little 5 page site and adding 20 pages of content can throw the search engines for a loop.

  5. Hi g1smd,

    It is surprising the way some people will set up the mechanics behind how a web site might work, and how those can be pretty unfriendly to indexing and search engines. It’s as if the creators of the content systems in use aren’t in touch with the framework of the Web that surrounds their sites.

    A number of the additional papers that I included at the end of my post touch upon some of the challenges that search engines face in trying to keep up with pages that change rapidly, in attempting to keep their indexes fresh with the content they find on the web while crawling.

    For example, the one that studied changes of content at MSNBC, “Characterization of the evolution of a news Web site,” attempted to create a model to help index other news portals. The Google paper also discusses the creation of models to help try to capture information on rapidly changing sites. But those models can be very much challenged by setups like the one that you describe.

  6. Hi People Finder,

    A very good point. It’s possible that kind of component may be included.

    Jon Kleinberg’s paper, Bursty and Hierarchical Structure in Streams was published in 2002, and the ideas in it that some topics may appear out of nowhere rapidly, and then disappear as quickly, on pages that are updated very rapidly on the Web have been floating around likely even before then. And we know that Google keeps an eye on trends.

  7. Hi Dave ,

    Happy to see you stop by and comment on this post. I know that you find phrase based indexing as interesting as I do, and I agree completely that this is the kind of topic that people involved in SEO should be keeping a close eye upon.

    Thanks!

  8. Hi Hayden,

    That’s a great example, and a fairly drastic one, though I can’t say that I haven’t seen that happen too. I guess in instances like that we should expect some significant changes to rankings.

    What I liked about this patent filing where it comes to content changes, is that it gives us some language to use (or at least a mathematical formula) and to think about when it comes to changes on web site, and an idea of how a search engine might treat some changes differently than other.

  9. Hi Robert,

    Good points. We’re given some details in the patent, but it covers such a broad range of changes (which is why I’m returning to it in more than one part), that it doesn’t go into a tremendous amount of depth.

    A blog that is regularly updated probably doesn’t get much of a boost from those changes – though it might get some on certain topics, on the assumption that fresh content may contain more timely information.

    You’re right – freshness is important to news, and I’ve seen at least one reference in another patent filing from Google that it may be one of the ranking signals in determining which news stories show up in search results. (How Google Universal Search and Blended Results May Work)

  10. Much food for thought there. I have no doubt that the frequency of updates will have an impact. But I imagine that for example a blog where the home page is constantly changing that each entry page will get more weight for content.

    The flip side of this I imagine is that it is backward compatible too. By that I mean that if a page is touted as a news page and is never updated then it loses all relevancy. This would be due to news needing to be current and frequently updated. A “news page” that hasn’t had an update in 3 months surely can’t be regarded as current?

  11. William, that is exactly what it is… people do not understand the framework of the web… they are small business owners and have other things to worry about. That is why they hire outside firms to take care of these things. Lucky for us. :)

  12. Hi Hayden,

    I do think that there are some small business owners who do strive to learn what they can of the framework of the Web, and address all of the issues that they can.

    I know that for some, it may not be the best use of their time and resources to dig deeply into how search engines work, and how that knowledge might help them with their online businesses, but it can be nice working with small business owners who do. Anyway, it can be really rewarding working with small businesses and being able to have a positive impact on the success of their business.

Comments are closed.