Google Aiming at 100 Billion Pages?

What would it take for Google to include in its index 100 billion pages?

Could they develop a way for people to search for, and look at older versions of web pages, and also simultaneously improve the quality of their search results? Would indexing words within conceptually related phrases make the search process better?

A recent patent application from Google estimates the web to contain around 200 billion pages, and guesses that the largest index from the major search engines hold around 6-8 billion pages. The document is Multiple index based information retrieval system, US Patent Application 20060106792, which was published May 18, 2006, and originally filed on January 25, 2005.

In addition to providing us with a rough estimate of the size of the web, and the amount of pages indexed by search engines, it also tries to answer the questions I asked at the top of this post.

The inventor listed in the patent filing is Anna Patterson, who has already built a search engine that holds more than 55 billion pages (The Internet Archive). Part of the process described in the document was the subject of a blog post here back in February – Move over pagerank: Google’s looking at phrases?

This method would enable a Google to hold so many documents through the use of multiple indexes. One with relevance information and likely a secondary, or supplemental index, with much less information.

The information retrieval system described in the patent application would:

  1. Comprehensively identify phrases in documents on the web,
  2. Index those documents according to phrases,
  3. Search and rank documents in accordance with their phrases, and;
  4. Provide additional clustering and descriptive information about the documents.

But it wouldn’t include information about all of the documents. Instead, it would store higher ranked documents in a primary index, ranked in order of the relevance scores, and store lesser ranked documents in a secondary index, by numerical order of document identifiers assigned to each document.

The relevance score that would be used is a page rank based type score, and a number of relevance attributes for each document in the primary index would also be stored.

The kinds of relevance attributes would include at least one of the following:

  1. Total number of occurrences of the phrase in document,
  2. A rank ordered list of anchor documents that also contain the phrase and that point to the document,
  3. The position of each phrase occurrence in the document,
  4. A set of one or more flags indicating a format of the occurrence, or;
  5. A portion of the document containing the occurrence.

The secondary index described, would contain only document identification information. This secondary index reminded me a little of what many people refer to as the Google Sandbox these days, where new sites have trouble ranking for many keyword phrases.

Share

11 thoughts on “Google Aiming at 100 Billion Pages?”

  1. My only question would be how really neceassry would this really be?

    There is always a lot of talk about why Y/G/MSN only shows X amount of results

    For all the engines the best way to improve results would be the ability to drill down and drill down again to give not only the best result for the searcher but also the advertiser, in terms of quality and CPC

    But to be consumer friendly it would need to be no more than a 4 click process

    If you look at the top search terms on goog, in the financial area words like Insurance/Credit Card/Loans are in there 000’s of thousands of clicks a day

    Its so pointless and all it does is fund there business model because most agencies/advertisers are buying broad based terms

    4 search model

    – you search for credit card
    – results shown – but giving options – bad debt/platinum etc….
    – click platinum (I wish :) ) and given further options as well as existing search results based on rate/charges/etc……

    I am well aware this has been tried before my much smaller engines and Yahoo has tested this but for me it would be more relevant

    If you to want make search more effective the engines could learn a lot from price comparison engines as oppose to just relying on existing algos

    Thats my very long winded answer to why does it matter 1 billion results :)

  2. Thanks, Devil Fish. You make some great points.

    I think that you’re right about the ability to drill down into different categories that meet the intent of the people searching. I suspect that we are going to start seeing some smarter search results pages in the future, that more commonly cluster results and show categories that allow us to drill down.

    I like some of the approaches that they are taking with this phrase method, like the idea of trying to present results from different categories (based upon phrases)as a mix, presented in the top ten results. That’s discussed a little more in my older post on phrases, which I linked to above – Move over pagerank: Google’s looking at phrases?.

    Making it obvious what those categories are, and associating them with an excellent respresentative result, and including a “more like this” link would be one way to enable people to drill down into different clusters of categories. It would be more like, as you suggest, one of the very good price comparison engines.

    I agree with your statement about including so many more results – what good does it do if you can only see the first 1,000, if you can’t find a good match for what you are looking for? There are some nice aspects of this approach, and I could see Google taking some of these steps.

    The only way adding so many more results makes sense is by making it easier to find and see only the results you are really looking for. I think that this patent application might be aimed at doing some of that.

  3. well the drill down has now already started to happen

    I have seen lots of posts on different forums saying they are now clamping down on “pharma spam” so if you type in Viagra you now get a drill on G

    http://www.google.co.uk/search?sourceid=navclient&ie=UTF-8&rls=SNYG,SNYG:2004-46,SNYG:en&q=viagra

    Why is a pharma spammer so different from an “Insurance” spammer – I think people are missing a much bigger issue and long term plan

    It probably hasnt help the issues with the expedia.fr blogs and the spamfest fiasco they had but you now can see that G is at least sticking its toe in the water with real relavancy

    I know this is a first step

    But you can see now

    – User searchs for “Viagra” on G and provide more options as well as existing natural search

    – User search for “Buy Viagra” provide natural result set first no additional options

    As they say “dont be evil”

    Do they really mean

    dont be evil to advertisers we really need and makes up the lions share of our revenues- financial, travel, ecommerce (so dont mess with these results)

    Either way bring on more relevancy, Goog and all the other engines has creamed too much $$$ from the broad based advertising keyword spend. fighting for the small pot of search terms

    Extra relevancy is brilliant for the user. Now that google is moving to the next level in the search field we might see
    more interesting developments such as Yahoo writing an algo that was not based on the logic of a 14 year old skateboarder

  4. Where would these additional pages be obtained from. What vast goldmine is out-there in cyber space that has not been exploited?

    Are they considering getting access to “subscription” only material eg: Science Journals

    or, more dynamically created webpages

    or, public government documents

    or, perhaps excerpts/preview pages from BOOKS will be include as SERPS, not just existing on AMAZON or Google Books

    Probably, some of the potential content will be of a very esoteric nature or real-time dynamically updated info – valuable for immediate access for researchers.

    Perhaps in the next decade – university students will just go to Google to download textbooks or buy specific pages as oppossed to visiting the college bookstore.

  5. But you can see now

    – User searchs for “Viagra” on G and provide more options as well as existing natural search

    I’ve been getting a few questions on those types of query refinements from people. It is a step in the right direction.

    Where would these additional pages be obtained from. What vast goldmine is out-there in cyber space that has not been exploited?

    You mention some possibilities. The patent application points to older versions of sites, and giving people a chance to see those. (Not a surprising suggestion from someone who worked on a search engine for the Internet Archives.)

    Will sites get indexed more deeply than they are now? Maybe, though I see a number of people recently complaining about having less pages indexed now, instead of more. Though, it probably wouldn’t take too much to tweak an indexing program to make spiders consider sending some deeper pages back to the index for inclusion.

  6. What about mashups? Google already creates mashup snippets to support a query. Why not make mashup web “pages” pulled form a domain? Call it a page…why not? It fits within the “cached snippets” realmof copyright.

    Every SEO knows how to make 2^n permutations of a list (or something like that :-) so if Google wants users to click through ads, why not create mashups with ads?

Comments are closed.