Does it matter that Google knows about a trillion Web Addresses?
Is it important that a new search engine Cuil has 120 Billion pages indexed, according to them, “three times more than any other search engine”?
The more pages that are known about by a search engine, the more difficult it might be to provide the “best” pages in response to a search, or personalized results according to a searcher’s interests.
But what if a couple of search engineers told you that a study of a major commercial search engine’s log files showed that while there are “a lot of pages out there, there are not that many pages that people go to.”
Their paper discusses the number of useful pages on the web, or pages that people search through as opposed to all of the pages that are indexed, and explores the concept of information entropy related to search indexes.
Entropy involves the uncertainty involved in the amount of information content that might be missed by someone seeking information. Knowing that most searches could be answered by millions of pages rather than billions in a search index means that the entropy involved in search indexes is smaller than we might imagine and that it is easier to predict which pages might be useful to a searcher.
The additional information gained through personalization might decrease that entropy even more, by making it easier for a search engine to predict which pages might be useful to a searcher.
They discuss how personalization might even help searchers when there isn’t enough information collected about an individual user’s interests, by “backing off” to look at larger groups that the individual might be a member of, based upon such things as people who share some of the same marketing demographics, or from information based upon a collaborative filtering approach based upon shared searches and pages clicked upon as well as other user behavior.
If there isn’t enough data about market demographics or click-through information for aggregated or individual searches, some other data more easily obtained, such as location information obtained through IP addresses that might provide helpful and meaningful results.
The Microsoft paper presented this year at the First ACM International Conference on Web Search and Data Mining (WSDM 2008) provided some interesting and thoughtful questions about the size of the Web that people use and a way of providing personalized search results when there is limited information available to a search engine about a searcher’s interests regarding a specific topic.
The paper is Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff? (pdf). Greg Linden wrote up an overview of the paper in his post Does the entropy of search logs indicate that search should be easy?, and a video presentation on the topic is available online.
A couple of papers cited in the references section to this document that are interesting reading:
Adaptive Web Search Based on User Profile Constructed without Any Effort from Users (pdf)
Personalizing Search via Automated Analysis of Interests and Activities (pdf)
The statement about there being lots of pages but only a subset being accessed is the expected outcome from a search engine and user behaviour I would have thought.
Consider what an average person would do if they were provided a pile of 1,000 individual pieces of paper about a topic in no specific order. They will rifle through the first n items and then probably get bored with the exercise. Now imagine that someone had sorted the stack of paper into what they consider to be ‘most relevant first’. The tolerance of the reader probably hasn’t changed (willing to view n pieces of paper), however now that they are viewing the most relevant items first – it is even more likely that they’ll view a number less than n before their need (researching topic y) has been met.
As always, interesting read, thanks Bill!
Hi Al,
Great points, and illustration.
Over the past few years, there were more than a couple of statements from search engines (Google, Yahoo, Microsoft) competing for viewers about how many pages their search engines had indexed. This paper provides a slightly different perspective; thinking about how people actually view and use search results, and how many pages are actually needed to answer most queries posed to a search engine.
Reducing the total number of pages that a search engine has to search through, and rank, means a more efficient search engine – one that can cache results, and one that can spend less time looking for results that don’t match personalization requirements.
Hi Rob,
Thank you.
The problem with just offering ONE number as the amount of Web pages in an index is, that number is not sub divided into the TYPES of Web pages that are being indexed.
Because society is flooded with changing news as well as opinion blogs – having an exorbitant amount of indexed Web pages is valid if those pages are for example:
current news
news archives
dynamic shopping pages from large sites
news group archives/blog posts
foreign language versions etc
Google, Yahoo and MSN have already began weeding out duplicates or banning obvious spam pages – so essentially many of those indexed web pages will eventually be viewed
Hi PR NY,
Good points. It’s really tempting to try to treat the Web as if it were a nonchanging, static collection of documents and objects, but it’s just as easy to consider it a stream of data that doesn’t stay still, and doesn’t stay at the same amount of pages.
Old pages are removed from the Web, new pages arrive all the time, and existing pages change. An Official Google Blog post from a couple of days ago, To infinity and beyond? No! addresses some of the duplicate content/spider traps that appear on Web pages that a search spider can get trapped within, and find what may appear to be an infinite number of pages.
But, the sad news is that there are a lot of pages on the web that aren’t spam and aren’t duplicates, that won’t get indexed by search engines, or viewed by visitors.
And, as the Microsoft paper notes, the amount of pages that people end up going to, at least from a search engine, is a much smaller number than the amount of indexable pages on the Web. I’m not sure that will change even after duplicates and spam are handled better by search engines.