And, how do you grab a random page from that search engine?
A new Google employee, Ziv Bar-Yossef, gave a presentation at Google on August 17th answering those questions, which is available on a Google Techtalk video: Random Sampling from a Search Engine’s Index (video).
Ziv Bar-Yossef was most recently at Technion – Israel Institute of Technology, Israel, and as noted in the video, became a Google employee a couple of weeks ago. Before Technion, he was a researcher at the IBM Almaden Research Center.
The presentation is based upon a paper which won the 2006 International World-Wide Web Conference Best Paper Award: Random Sampling from a Search Engine’s Index
Being able to grab random pages from a search engine’s index can provide some interesting information about that search engine. The presentation compares things such as the number of dead pages in Google, MSN, and Yahoo, as well as the freshness of text on each, and what percentage of dynamic pages they have indexed.
Patents and patent applications in the US from Ziv Bar-Yossef:
- Method and system for improving data quality in large hyperlinked text databases using pagelets and templates (US patent 6,968,331)
- System, method, and service for using a focused random walk to produce samples on a topic from a collection of hyper-linked pages (US Patent Application 20060122998)
- Methods and apparatus for assessing web page decay (US Patent Application 20060112089)
- Method and system for improving data quality in large hyperlinked text databases using pagelets and templates (US Patent Application 20030140307)
Seems to me Google still have the largest collection of indexed pages, even compare to MSFT’s new Bing. Which I feel hasn’t lived up to the hype.
Hi Alex,
I’m interested in seeing how Bing grows in the future – competition amongst the search engines is probably a good thing.
There was a very interesting post at the Cuil Blog recently, which discussed Bing, and how it appeared that when Microsoft launched Bing, they increased the number of ranking signals that they use to find relevant pages, but they also reduced the number of pages in their index substantially. Here’s a snippet:
I do expect the size of Bing’s index to grow over time. The question that I have is that when it does, will it show more relevant results that some of the other search engines, such as Google. Index size alone isn’t as important as the final objective behind search – showing us pages that are actually helpful in our question for information and pages that let us perform the tasks we want to online.