Google as an Internet Archive?

Interested in what people were saying the day after Barack Obama was elected president in 2008? Or how people reacted on the Web to the Chicago Whitesox winning the World Series in 2005? Or the early news on the Gulf oil spill on April 20, 2010?

When you search at Google, you can click on “more search tools” in the left column, and enter a “from” and “to” date in the custom range section. If you want to see what pages were showing up on Google on a search for Barack Obama on the day after the election, you can enter 11/4/2008 in the from and to fields. To see what pages were ranking on Google on the day after the Whitesox series ended, entering 10/28/2005 into the date range text boxes.

A custom date range search at Google for Barack Obama on November 4, 2008.

If you click on any of the results that appear, you see versions of pages listed in the results as they appear today. If you click on the Google cache links for those entries, you see the most recent cached versions of those pages. But, what if you saw a copy of the page as it appeared within the date range selected? What if Google decided that it would create an archive of the Web, where it showed older copies of web pages, and used the custom date range to help you find those pages?

A Google patent granted on April 20th gives us a glimpse at the possibility of Google being able to show us an archive of the Web.

As part of a series of patent filings from former Google employee Anna Lynn Patterson on phrase-based indexing, it probably shouldn’t come as a surprise (I wrote about the possibility a few years ago in (Google Archives to Appear Soon?). Before joining Google, Anna Patterson developed a search engine for the Internet Archive, so that searchers could view older versions of pages listed in the Archive’s index. That search tool, known as “Recall,” was removed from the Internet Archive around the time that Google was reported to have licensed some technology from Dr. Patterson, and then subsequently hire her.

The patent is:

Information retrieval system for archiving multiple document versions
Invented by Anna Lynn Patterson
Assigned to Google
US Patent 7,702,618
Granted April 20, 2010
Filed: January 25, 2005

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Index data for multiple versions or instances of documents is also maintained. Each document instance is associated with a date range and relevance data derived from the document for the date range.

The patent’s description is mostly dedicated to details about phrase-based indexing, but it also jumps into how archived versions of documents might be stored and ranked.

Presently, Google collects the latest copies of pages it finds to display in a cache copy, and it includes links to the latest cached copy of documents along with search result listings for pages. Google has justified the use of a cached copy of pages as a way of giving people access to a page when there might be a problem with directly accessing the page, such as the server it is hosted upon being down. Google won’t cache copies of pages that use a meta noarchive tag, like the following:

<meta name=”googlebot” content=”noarchive”>

Will Google start showing archived copies of documents?

Many web pages change over time for a wide number of reasons, including news sites that may update a number of times a day, and require subscriptions to see older articles. Other pages change because of new ownership, new designs, new business models, updated content, corrections of older content.

It’s possible that many site owners may not want people to access older versions of their pages, for a number of reasons as well.

How would you feel about Google providing a historic archive of the Web, with the ability to search and view older versions of pages online?

Share

44 thoughts on “Google as an Internet Archive?”

  1. Heyy somebody tells them to wait
    it is my Phd subject :)
    I think they started to work on it a looonng time ago. They will come up with a complete archive framework.

  2. Would these multiple archived copies of pages somehow take up less server space than the current cached copies? Are we talking about server farms the size of Montana?

  3. Hi Bob,

    That is one of the issues that is raised in the patent itself, though it really doesn’t provide an answer to how Google might store all of those pages.

    A snippet from the patent that describes that problem:

    Another problem with conventional information retrieval systems is that they can only index a relatively small portion of the documents available on the Internet. It is currently estimated that there are over 200 billion pages on the Internet today. However, even the best search engines index only 6 to 8 billion pages, thereby missing the majority of available pages. There are several reasons for the limited indexing capability of existing systems. Most significantly, typical systems rely on a variation of an inverted index that maintains for every term (as discussed above) a list of every page on which the term occurs, along with position information identifying the exact position of each occurrence of the term on the page. The combination of indexing individual terms and indexing positional information requires a very large storage system.

    A further problem with many information retrieval systems used for searching the Internet is their inability to archive pages that change over time. Conventionally, most Internet search engines only store the relevance information for a current instance (or version) of a given page, and update this information each time the page is re-indexed. As a result, a given search only returns the current versions of pages that satisfy the query. Users are unable as a result to search for prior instances of pages, or pages that were current in a specific date interval. Also, the search engines are likewise do not use version or date related relevance information when evaluating search queries or presenting search results.

    Accordingly, it is desirable to provide an information retrieval system that can effectively index tens of billions, and eventually over 100 billion pages of content, without the substantial storage requirements of existing systems. Further, it is desirable to provide an information retrieval system that can index and retrieve both current and prior instances of documents and pages.

    Archiving old copies of pages could take up a substantial amount of space on its own. Archiving the relevance and importance data associated with those pages adds considerably to the space requirements. The patent describes a primary and secondary indexing system where the primary might store relevance and importance data for a limited number of pages, and the secondary index sounds somewhat like the “supplemental” index or extended index that Google may have been using for years.

    There are likely some gains in ability to index and archive pages under Google’s work on the newer version of the Google File System, which appears to be part of the Google Caffeine update, but hopefully there’s room in Montana for more servers if Google goes ahead with this. :)

  4. I would rather suspect Google’s intention to cope with the “cemented” results problem :) . They have never really solved this problem after the Big Daddy update.

  5. Hi ZP,

    It does look like Google may have been anticipating an archive like this for a number of years, which may be partially why Anna Lynn Patterson joined Google around 7 years ago after creating one of the largest search engines developed back then for the Internet Archive. Does it make sense for them to make such an archive available from a business stance, a legal stance, an information sharing stance? I’m not sure, but it does look like they are working on some pieces that do make it more feasible to offer in the future.

    For instance, Google does have a News Archive Search, that leads to subscription-based or per-article based fees to view some results.

    Allowing people to search by a custom date range also makes it more likely.

  6. I remember reading several years ago that the archives that are being kept by our government may not be available for future generations due to the fact that many of the media technologies use to store them have a finite shelf life.

    Maybe they should look to a commercial company like Google to archive them since they already have a system in place. I realize that there would be an unmeasurable of red tape associated with something like that and privacy issues for sure, but having archives run by a commercial company for which the preservation of data is looked at in terms of money run something like this seems to be a more long term solution as opposed to some method cooked up by federal employees.

    Ultimately, all information will be on the “net” regardless of whose it is. I think that the ‘handwriting is on the wall’ on this one.

  7. Hi Mark,

    That’s a problem that I have some personal experience with from my days in the 90s and early part of this century working with Delaware’s Court system, and exploring ways to maintain and update our records.

    The technologies used to backup and maintain records can get dated quickly, such as 12 inch magnetic-optical floppies.

    But, many agencies are very much concerned with the security of that data as well, and with limiting the ability of people to access that information. I don’t think that many want to go through a third party like Google and make much of that information available through the net.

  8. This was a rather large problem for using Google for research in years gone by. The search engine will show you information based on what they have in their cache at that moment, rather than what that page was 2 years ago.

    It will be interesting to see if the geniuses at google allow users to dig into the data they have stored and do some lexical comparisons on websites like the new york times website.

  9. I’ve never really took notice of the left-hand sidebar of google search. Probably becaue its a new feature and I’m used to just typing in my search keyword in the search bar and pressing enter. Now that I know that it can be used that way, then I’m now appreciating this new google search bar feature.

    Regarding displaying the older versions online, can’t they just display the updated versions?

  10. I really like what Google has been doing lately. The look is much cleaner in my opinion. It seems like Google has been connecting to all kinds of things lately. Its so much deeper then just the regular search nowadays. So far I think its a good thing. I wonder what will come of G in the future. They are spreading themselves into so many different things.

    I always used the Way Back Machine but I think I am going to have to try out this new archive feature.

  11. Another solid addition for search within Google. I wonder if the rankings will reflect the rankings for certain keywords at that time or if they are just sorted by date.

  12. Hi Robert,

    It would be really interesting to be able to use data from different versions of sites like the New York Times for many different kinds of research. Imagine an API that would allow you to do things like capture that information and analyze trends, build timelines, and visualize the data, and possibly even compare it to information on a topic from other web sites. Thank you.

  13. Hi Andrew,

    I wonder how many casual searchers do end up looking at new features like the ones in the sidebar.

    It’s possible that the new interface has affected the way people using it do search, which makes it worth exploring.

    Presently, if you do a search using the custom date range, you do see results and snippets from the time period that you’ve specified, but if you click the links that are shown, you see the present day versions of those pages, which means that you might not find what you searched for if the pages have changed.

    In my Barack Obama search above, set for a start and end date of November 4, 2008, the third result is for a Washington Post page that likely has changed since the day after the election. Chances are, people clicking on it are going to be disappointed with the result because it doesn’t match up to the description in the page title and the snippet.

  14. Hi CPlus,

    The search by custom range is available now, but Google doesn’t show older versions of pages to go with that search. It’s possible that they might, but I’m guessing that even if they can handle the technological aspects of doing this right now, such as the storage of all those documents, there might be other things holding them back, like issues surrounding copyright and how the owners of those pages might feel about Google archiving their content.

  15. Hi Nathan,

    The patent suggests that Google is showing the rankings for the pages in a custom date range search as the pages were ranked during that date range, rather than as they might be ranked now, with an option for searchers to sort those results by date and time. At least for pages that might have been considered more important during that time. A secondary index might contain less ranking information for pages that weren’t considered as important, like a supplemental index.

    Even if Google limits that ranking information to a certain number of pages that might be considered “important,” it means that the search engine may not only archive versions of documents from the past, but also ranking information for those pages from the past as well.

  16. Hi Mike,

    You’re welcome. Google isn’t showing actual archived pages, but there’s a possibility that they might. It does appear that the rankings for the snippets are using older relevancy signals though – which is pretty interesting in itself.

  17. Hi Alamin,

    Good questions.

    Since many web pages change on a regular basis, it might be helpful to be able to search the web for information that might have been online in the past but is no longer available now.

    For instance, there’s a link to a Washington Post article in my screenshot above about the reaction to Barack Obama being elected President of the United States the day after the election. It’s possible that article is no longer online, or only available to paid subscribers of the site. If Google created this archive, it might either show me a cached copy of the page or lead me to a page were I could pay for access to the article (to the Washington Post).

    One possible reason why Google isn’t showing archived pages, even though they allow you to search for pages using a custom date range, and will show you snippets for those older pages, and older versions of pages, is that there might be a concern that web publishers might claim that an archive like this might be copyright infringement. If the search engines could work out some kind of way to get web publishers involved, like a way for those content providers to get paid for access in some cases, that might make it more likely that Google would start showing archive pages.

    It’s possible that there are other reasons as well, beside copyright concerns, but I think that is one of the bigger stumbling blocks they face.

  18. Whats the main befefit of that archive. And why google isn’t showing archive page if they already done it?

  19. Thanks Bill that exactly what i thought. You have teach me a lot. so now i am another Bill. lol

  20. I imagine we could argue for or against the potential benefits and dangers of archiving. I think the fact that there would be a tag to prevent archiving would address copyright issues (same as the noindex works now). I think Google has recognized that there is an opportunity to provide a relevant social service (similar to the way back machine at archive.org, but useful :). It’s not like they don’t already have the information – I for one appreciate that they would share it with the rest of us!

  21. Hi Aaron,

    Copyright law on the web is still in a pretty gray area. Many of the arguments I’ve seen made by Newspapers is that websites, like Google, should ask for permission rather than be forced to take some action like adding a noarchive meta tag to their pages. Traditionally, that’s the way that copyright worked.

    I do think that providing this kind of archive could potentially be very beneficial as a social service, but it has some serious issues as well, concerning copyright, privacy, protecting consumers from malware that may have been introduced to older versions of sites, and more.

  22. With the recent completion of Caffeine to deliver faster and the giving of more weight in the SERPs for real-time results, I think an internet archive could have it’s place by the use of the custom date range without any significant confliction on the indexed results delivered by default as date published.

    Up until recently, not so much for the competent searcher or for those in the industry such as ourselves, it has been somewhat frustrating when searching for information and content only for it to be out of date and often invalid. With the rapidly, ever increasing amount of information, changes to regulations, laws and procedures, I am hoping that real-time results can overpower the dated content ranking highly by page/domain age and weighted authority as per relevancy to the search term whilst also providing an internet archive or previous versions of content and data which will undoubtedly prove very useful for various reference and students/studies.

  23. Hmmm. One bill is enough. You are the one friend. Really you are great asset in seo world.

  24. Wow, great point. I knew about the “more tools” button, but didn’t think to search certain dates and use it as an archive tool.

    I really like to look in the past week or few days especially when looking up info about programming. Info that is 2 years old dealing with computers or programming is ancient–searching for more recent info in that area in Google is really helpful.

    I wonder if high school kids are aware of this tool when doing research on certain topics or time periods?

    -Guy

  25. Hi Geoff,

    Trying to have the most recent information shown in search results, while also trying to show the most relevant/important results might present a challenge at the search engines. Is a more recent result better than one that presents a higher quality and/or quantity of information about a topic? Or is it better to show the more relevant result, and make it easy to show the more recent result as well, but with a little work such as choosing upon a date range? I’m not sure of the answer, but I do like many things about the idea of an archive where someone can see older versions of pages, in time periods that they choose.

  26. Hi Guy,

    I like the ability to use the “more tools” and look at recent results as well.

    The actual “archive” piece of this isn’t functioning, and we don’t know if Google will add it in the future, but I agree that it could be a great educational resource.

  27. Hi Bill,

    You raise a good point, it would be difficult for search engines to determine whether the weight of recent version of content(a) outweighs the quality/authority of content(a). Ultimately this could heavily rely on the nature of the content and speed at which the topical content changes. A simple algorithmic formula to determine ranking, I cannot imagine would work in this instance.

  28. I personally wouldn’t be okay with Google doing this simply because things change for a variety of reasons. Being able to find older pages could lead to misunderstandings between a site owner and his visitors.

  29. Hi Kevin,

    I agree with you on that point. There are many reasons why someone might change the content of a web page, and some of those changes are inspired by things like removing content that causes harm to someone or content that infringes copyright in some manner. Pages change for more reasons than new designs, or the addition of updated material, and enabling people to view older versions of pages may not always be a very good idea.

  30. Hi Geoff,

    Ranking pages becomes a lot more complex when you try to find ways to measure the kinds of things like that. Google has published a few patent filings that consider a probabilistic approach to ranking pages that is fairly complex, using a considerable amount of user-behavior data involving looking at “instances” or triplets of features involving users, queries, and pages.

    I recently referred to one of those patents and the kinds of data that might be collected in my post How Google Might Suggest Topics for You to Write About, in the section with the heading “Query Statistics and Document Statistics.” Using that kind of data to decide things like whether people might prefer to see the most recent information on a topic, or the most authoritative, may be an approach that Google would use, but as you note, it wouldn’t be simple.

  31. I’ve been using the leftbar “date range” tool in my research – it’s invaluable when finding the first article written about a subject, or the original source of a commonly held view. You can enter just an “end date” in the range and list all pages created before that date, etc.
    I did not realise that this tool listed pages as they “used to be ranked” – I thought it just listed pages by the date they first appeared in the index. Are you sure it’s a history of ranking, and not just a trimming of current results based on the age of the page?

  32. The British Library has been archiving uk websites for quite a few years now. They do it in a ‘selective’ way, rather than just a blanket approach – they do make a valid point, that in maybe 50 to 100 years time, there will be a huge amount of data that was only ever published online which could just disappear forever, as there are no copies once a website doesn’t exist anymore. If google do manage to pull this off, it would be a fantastic resource, maybe not so much now, but in twenty or thirty years time – it does beg the question, where are they going to store it, I can feel ‘google island’ being bought by them sooner or later, just to fill with data centres!!

  33. Hi Alex,

    I’m finding the date range tool pretty useful, too.

    You actually have two choices when you view the date range result. You can order them by ranking or by the age of the page, or as Google refers to them “Sorted by relevance” or “Sorted by date.” By default, you see the results sorted by relevance.

  34. Google has been nicely positioning itself as a middleman between you and what you’re actually looking for. I think it’s next move will be to edge out webmasters, without totally alienating them.

  35. Hi Will,

    We’re probably going to see the search engines continue to provide even more “direct” information in response to what people are searching for in the future, such as providing answers to factual questions, or showing definitions. The search engines do provide links to the sources of that information, and people interested can follow those to find more. You’re right though – Google needs to be careful not to alienate webmasters when they do something like that.

  36. I love how an organization like the ICSC can say the “low quality” content can threaten the jobs of journalists. Sure article generation software like article factory (http://bit.ly/aVcMgc) produce low quality content that litter the web but now journalists jump in the mix claiming we are in a content crisis!? Sorry but I have been reading the drivel in the print media for years and want to say that content crisis isn’t a new thing. Quite frankly I’d rather suffer through pages of drivel for at least one or two gems. Demand Media’s eHow site actually contains quite a few good bits of information. I don’t care if their incentive to create content was a link or if half their low cost content articles are pure drivle. ICSC – you and yours had your chance – admit it, the public is taking back the printing press and you don’t like it.

  37. Hi Johnnyhouse,

    I’m of the opposite bent here. I don’t mind sites like Demand that churn out content, but I hope that the higher quality content does find its way to the top rather than being buried in pages that are “just good enough.” I do like the possibility that Google might make public the queries and topics they find that inadequate results so that people who are willing to create great content can for those.

  38. I didn’t know that you could search by dates on Google! They don’t keep an archived version of old websites do they? I don’t think that many website companies would like that feature.

  39. Hi Peter,

    That’s ability has been around for a while. I can’t say that I can pinpoint exactly when Google made that available, but I believe it’s been more than a couple of years now.

    No archived copies of old websites at this point, but I wouldn’t be surprised if they offer that sometime in the future.

Comments are closed.