How much information is included in the databases of the different search engines? How do these numbers strike you?
Google 53 Billion Pages
Yahoo! 8.4 Billion Pages
MSN 3.7 Billion Pages
Those are estimates from four researchers at the Stanford University Dept. of Computer Science, who have come up with a method of Estimating the Index Sizes of Search Engines (the article has been removed from the Stanford pages – see comments below for more details) based only upon information that could be gathered from the public interfaces of the search engines.
There are a number of questions about the results that they received, but they consider and include some discussion about potential errors that may throw off their numbers.
Google isn’t the biggest search engine that Anna Lynn Patterson has worked upon. That distinction probably falls to the Internet Archives, which she worked on before joining Google, and likely has a few billion more pages in its database than Google (the archive has 55 billion web pages right now).
In addition to that feat, Anna is the writer of a pretty good article on search engines, over at ACM Queue, titled Why Writing Your Own Search Engine is Hard.
The latest search engine description from Anna Patterson, published yesterday, involves a search engine immune from Google Bombing. It could be said to reward authors for well written html, and good punctuation. It can find relevant pages that don’t include the query terms on those pages, even though immune to Google Bombing. She also finds a way to perform personalization with the search engine, and detect and eliminate duplicates.
The search engine that she has conceived of can also be set to serve a mix of relevant pages from different topics in search results to searchers. For example, a search for “Blues” could easily be set to display pages on the first page of search results that lead to:
Search Engine Watch editor Gary Price will be joining Ask Jeeves as Director of Online Information Resources. He will be leaving his editorial position at Search Engine Watch, to lead an outreach program at Ask, where he will work with the library and education communities, and provide advice on new search products for the company.
It’s a terrific move for Ask Jeeves, and I wish Gary much joy in his new role. His participation at Search Engine Watch will be missed. Gary has more on the change over at ResourceShelf.
We often focus on how search engines respond to queries here, but don’t often look too closely at the pages of the search engines themselves.
How important a role does usability play in determining which search engine a person will use?
One important aspect of search is how quickly results are retrieved. That amount of time seems insignificant these days, but I remember a time not too long ago when you would have to watch your screen for a number of seconds before a list of results appeared in front of you.
Is it important to still see something like this after getting some results from Google:
Results 1 – 100 of about 38,000,000 for search usability. (0.36 seconds)
I will be joining Rand Fishkin of SEOmoz and Jon Glick of Become.com on a panel focusing upon Search Algorithm Research at the New York Search Engine Strategies Conference on February 28, 2006.
If you are going to attend the conference, or are in the New York City area during that week, and want to meet up, or say hello, please let me know. I had a great time at last year’s SES in New York, and met lots of great folks. I’m looking forward to attending this year.
Google has added another top search scientist to their team.
The Seattle Post-Intelligencer noted yesterday that Udi Manber, chief executive of Amazon’s A9, will be leaving the Seattle-based company to become a Vice President at Google.
Both John Battelle and Gary Price have more on the move.
Udi Manber was a professor of Computer Science at the University of Wisconsin, and the University of Arizona. During his academic career, he co-developed a number of popular search software packages.
He joined Yahoo! as their chief scientist in 1998.
A new patent application from Microsoft considers ways to present search results to searchers in clusters, with meaningful names.
Published on February 2, 2006, it was originally filed on July 13, 2004, and is assigned to Microsoft Corporation.
Query-based snippet clustering for search result grouping
Inventors: Hua-Jun Zeng, Qicai He, Guimei Liu, Zheng Chen, Benyu Zhang, and Wei-Ying Ma
US Patent Application 20060026152
A clustering architecture that dynamically groups the search result documents into clusters labeled by phrases extracted from the search result snippets. Documents related to the same topic usually share a common vocabulary. The words are first clustered based on their co-occurrences and each cluster forms a potentially interesting topic. Keywords are chosen and then clustered by counting co-occurrences of pairs of keywords. Documents are assigned to relevant topics based on the feature vectors of the clusters.
Some recent research I’ve been doing had me looking at the Infoseek search engine, and its part in the history of search engines. I remembered an old book I have on search engines which has a couple of chapters on Infoseek, and started to reread it.
The book is the Web Developer.com Guide to Search Engines, from February of 1998. It’s been a while since I’ve picked up a book about search engines which hasn’t mentioned Google. This one focuses upon the search engines on the web at that time, and on adding a search feature to your site.
I didn’t get much past the first section of the first chapter of the book, titled Bow Down and Give Thanks to Archie, before I hopped on the web and started looking at Archie’s role on the net. As it notes there:
The grandfather of all search engines was Archie, created in 1990 by Alan Emtage, a student at McGill University in Montreal.