The Entropy of Search Logs and Personalization with Backoff

Does it matter that Google knows about a trillion Web Addresses?

Is it important that new search engine Cuil has 120 Billion pages indexed, according to them, “three times more than any other search engine”?

The more pages that are known about by a search engine, the more difficult it might be to provide the “best” pages in response to a search, or personalized results according to a searcher’s interests.

But what if a couple of search engineers told you that a study of a major commercial search engine’s log files showed that while there are “a lot of pages out there, there are not that many pages that people actually go to.”

Their paper discusses the amount of useful pages on the web, or pages that people actually search through as opposed to all of the pages that are indexed, and explores the concept of information entropy related to search indexes.

Continue reading “The Entropy of Search Logs and Personalization with Backoff”

How Using Categories for Queries Can Help Searchers, Writers, and Search Engines

You search for “Foo Fighters,” and the search engine takes your query and starts searching its databases to identify results. It might look through a video database, to see if there are any good videos to show you. It may dig through a News database to see if there was any recent news tied to the phrase, or an Image database to see if there were any popular pictures of the band. The search engine may see if any advertisers were running campaigns using the band’s name.

Some of that searching is done by trying to take the exact phrase that you used in your search, “Foo Fighters,” to find a set of results that you might be satisfied in seeing. But, there are steps that a search engine could try to take that might give you even better results.

Associating Search Terms with Categories

Continue reading “How Using Categories for Queries Can Help Searchers, Writers, and Search Engines”