Estimations of the sizes of Google, Yahoo!, and MSN

How much information is included in the databases of the different search engines? How do these numbers strike you?

Google 53 Billion Pages
Yahoo! 8.4 Billion Pages
MSN 3.7 Billion Pages

Those are estimates from four researchers at the Stanford University Dept. of Computer Science, who have come up with a method of Estimating the Index Sizes of Search Engines (the article has been removed from the Stanford pages – see comments below for more details) based only upon information that could be gathered from the public interfaces of the search engines.

There are a number of questions about the results that they received, but they consider and include some discussion about potential errors that may throw off their numbers.

Share

14 thoughts on “Estimations of the sizes of Google, Yahoo!, and MSN”

  1. Those look ridiculous. I would say MSN’s index is much larger, Google’s much smaller and Yahoo!’s… probably about right.

  2. Well, in their paper, they clearly say that it’s far from being accurate.
    But it’s hard to believe that Google has over 40 Billion pages more than Yahoo!
    The paper is very long but I think it’s worth reading it to understand how they came up with these numbers.

  3. When I wrote the post, Google’s most popular and least popular top level domains, I did something very similar to their initial step. From the comments on that post, and at Circle ID, some issues arose that may be things they are facing in this attempt at sizing up Google.

    They don’t mention the possibility of hitting different data centers when collecting numbers from the public interface of Google. I saw differences of more than a billion results for top level domains from one data center to another. I’d imagine that there could be considerable differences from one data center to another on different data centers.

    Let’s look at a query at two different data centers:

    site:www.microsoft.com

    19,200,000 from 66.102.11.99
    120,000,000 from 64.233.161.105

    Let’s try the same with an eight-digit number:

    for the query

    454554545

    29 from 66.102.11.99
    32 from 64.233.161.105

    So the results are closer when looking at results for eight digit long numbers. What do you do with “supplemental results?” Do you count them? What about the results that aren’t showing on the results pages? Are they not showing because they are being filtered out as duplicates?

    They do mention some different issues in using numbers from one search engine to another. But I don’t seem them factoring in the possibility that they are getting results from the different Google data centers.

  4. What disturbs me about this paper is that Motwani was an early investor in Google, and this paper is extremely favorable to Google, and yet I don’t see any mention of his affiliation in the paper.

    I also completely do not believe the numbers. (My own disclaimer: I work at Yahoo! Search.)

  5. Ummmm do birds of a Stanford feather flock together? Tim’s point is actually more than just interesting.

    Unless there was a disclaimer about an author’s vested commercial interest in the outcome of the study this is a serious violation of academic ethics.

    Tim are you sure?

  6. I followed up with Dr. Motwani and found out that the report is preliminary and will change. Based on this and his rather spectacular research reputation I think it’s unfair to suggest any conflict of interest here and I apologize for implying that earlier:

    Rajeev Motwani of Stanford to me on Feb 16:
    This is an extremely preliminary report of our work that was only supposed to be
    circulated amongst our immediate research group for their comments and criticism.
    The results reported there are not yet in a form appropriate for public consumption
    and definitely will change as we conduct more precise experiments

  7. Thanks for following up, and letting us know more, Joe.

    I guess one of the risks of publishing something in a publicly accessible space is that people will read what you’ve written. :)

  8. In response to comment #9, “I guess one of the risks of publishing something in a publicly accessible space is that people will read what you’ve written,” wasn’t this the problem discussed in the now declassified U.S. government’s “psyops” report (reported by the BBC news in January 2006, http://news.bbc.co.uk/2/hi/americas/4655196.stm )– that U.S citizens have access to the entire web and accept as gospel data published abroad that wasn’t intended for state-side Americans to read?

  9. Hi Tony,

    I was thinking of some of the companies that published financial information on their websites without linking to that information, so that only a very few people could access that information. The URLs became known to a much larger audience than they were supposed to, and data about the financial status of those companies were shared before they were supposed to be, making the Securities and Exchange Commission somewhat unhappy.

    I’ve noticed some new documents appear on the Stanford pages with the word “Draft” appearing at the top of the front page in very large letters. Probably a good idea.

    The BBC article raises a good point though. I don’t think that effort by the US Military was very wise.

    I also don’t think the approach by the authors of the paper was a very good one. Search engines presently index the URLs of pages that they find in crawls through the web even if there aren’t corresponding pages to match those URLs. That distortion alone can skew their results significantly.

  10. I agree with you last cokment too William ay the bottom of comment #12, “I also don’t think the approach by the authors of the paper was a very good one. Search engines presently index the URLs of pages that they find in crawls through the web even if there aren’t corresponding pages to match those URLs. That distortion alone can skew their results significantly.” However, until web publishers became aware of the numerous ways to block redirection to a more current page, it often remains a great way to capture data in Adobe Acrobat before redirection schemes have a chance to do their crafty little manuevers. Now more have caught on to such easy to use tricks and it’s harder to do such retrievals except in cases where webmasters haven’t read that they can request data in Google cache be removed. My opinion as an infobroker is if it’s not totally unretrievabe by any search engine in existence or data cache in exustence it is still in the public domain and fair gane for downloading and sharing.

  11. I think its possible to find some interesting stuff by spending a little time reverse hacking some URLs, on some sites.

    If people don’t want something shared with the public, they should really think twice about uploading it to a web server.

    I’ve accessed a number of documents and pages that were only available through a search engine cache or the wayback machine over at the Internet Archive, become paranoid enough to start saving some pdf documents on my computer for fear that they might disappear.

    An example I can think of are the white papers from Applied Semantics on their CIRCLA technology, which disappeared shortly after Google purchased the company (except in the Google cache, for a while.)

Comments are closed.