How Do You Estimate the Size of A Search Engine?

And, how to you grab a random page from that search engine?

A new Google employee, Ziv Bar-Yossef, gave a presentation at Google on August 17th answering those questions, which is available on a Google Techtalk video: Random Sampling from a Search Engine’s Index (video).

Ziv Bar-Yossef was most recently at Technion – Israel Institute of Technology, Israel, and as noted in the video, became a Google employee a couple of weeks ago. Before Technion, he was a researcher at the IBM Almaden Research Center.

The presentation is based upon a paper which won the 2006 International World-Wide Web Conference Best Paper Award: Random Sampling from a Search Engine’s Index

Being able to grab random pages from a search engine’s index can provide some interesting information about that search engine. The presentation compares things such as the number of dead pages in Google, MSN, and Yahoo, as well as the freshness of text on each, and what percentage of dynamic pages they have indexed.

Patents and patent applications in the US from Ziv Bar-Yossef:

Share

2 thoughts on “How Do You Estimate the Size of A Search Engine?”

  1. Seems to me Google still have the largest collection of indexed pages, even compare to MSFT’s new Bing. Which I feel hasn’t lived up to the hype.

  2. Hi Alex,

    I’m interested in seeing how Bing grows in the future – competition amongst the search engines is probably a good thing.

    There was a very interesting post at the Cuil Blog recently, which discussed Bing, and how it appeared that when Microsoft launched Bing, they increased the number of ranking signals that they use to find relevant pages, but they also reduced the number of pages in their index substantially. Here’s a snippet:

    A fairly quick test shows that Bing is now around 20% the size of Google, so it has replaced half its index with additional ranking signals. This means Bing is indexing less than 10B pages, which is pretty sparse. To put this in perspective, Google has not been that small since 2003.

    I do expect the size of Bing’s index to grow over time. The question that I have is that when it does, will it show more relevant results that some of the other search engines, such as Google. Index size alone isn’t as important as the final objective behind search – showing us pages that are actually helpful in our question for information and pages that let us perform the tasks we want to online.

Comments are closed.