I hate doing the very same task over and over and over again. I’d bet search engines do, too.
Increasing the efficiency of search
The chances are that your choice of a search engine, whether it’s Google, or Yahoo!, or MSN, or another, uses some methods to try to make what they do a little more efficient and a little less costly to run.
Of course, popular searches are ones that lots of folks search for. If a search engine would process every search request as if it were a new one and try to grab results from its index, or indexes, including searches repeated often, it could be reinventing the wheel frequently. But what if there was a way to speed that up?
What if that method created significant savings in terms of time and processing power? What if it didn’t do a full search for the most popular terms over and over?
Caching results pages from popular searches
One method might involve using a cache file, like the cache file that your browser uses to store temporary internet pages. The idea behind the browser cache is that if you return to a page, your computer doesn’t have to ask for the page all over again. It can instead just grab the local copy of the page that is in the cache file.
So, imagine a search engine doing the same thing. It has a separate computer or set of computers that contain cache files. When a request for a certain word or phrase gets entered by someone using the search engine, a look around in those cache files may be the first stop instead of the search engine doing a lookup in its index.
If a set of results for the phrase or word searched for is there, the search engine could serve it to you. If it isn’t, the search engine might then perform a full search. That could save some processing power and some time.
Now keep in mind that the most popular searches can account for a large percentage of the requests that a search engine gets. If you look somewhere like the Google Zeitgeist pages, you can see some of the terms that the search engine might be keeping in a cache.
Deciding how many results to cache
This is a tricky area. The efficiency gained by caching some results pages for popular searches could be lost by not caching enough results or caching too many. So how do you figure out how many results to cache? Well, you could look at how people use search engines. For example, if people typically only look at the first page of results – ten links and titles and snippets – then you might only want to keep track of those ten results for a certain query.
A paper that looks like it was written in 2004 describes how to make a search engine run more efficiently. One of the authors’ pages notes that it will be published in 2006 in the ACM Transactions on Information Systems, Vol. 24, n. 1, January 2006.
It’s impossible to tell, without seeing that issue, whether the version that appears will be the same one that appears here:
By Tizano Gagni, Raffaele Perego, Fabrizio Silvestri, and Salvatore Orlando
It could be that this is the version that will be published, or they could be submitting an updated version. Interesting paper, regardless.
If you like thinking about how a search engine does what it does, this is an article worth spending a little time with.
It discusses some of the hardware and software that might make such a system work and how it would be organized. It also looks at other papers that have studied how usage data might make a search engine run better.
Two levels of caching are described – a read-only static set of information, and a dynamic set of information, along with replacement policies that determine when a result might make it into the dynamic cache. It also discusses prefetching and a way of anticipating what people might be looking for and how many results to cache or to prefetch, based on the made query.
How often should that static cache be refreshed? They note that topic as something for further study. However, they also state that the freshness of results can suffer when you use statistics from the past to predict the future.
Bill, every major search engine to my knowledge already caches results to some degree and long have done so. My understanding is that they tend not to hit disk unless they have a good reason to do so. Those same popular queries that happen over and over, my understanding is they are pulled out of memory and they’ve got ways to ensure that’s refreshed when needed.
Hi Danny,
Thanks for your comment. This post was pretty much meant as a simple restatement of what search engines do when responding to queries, but it might not be so obvious to some folks who use search engines.
When I wrote it, I didn’t think I was going into any groundbreaking new territory. Rather, I considered it as possibly laying the foundation for some future posts involving some of the mechanics of what we see when we see a search engine in action.
With this type of caching and prefetching, where things start to get a little interesting are the replacement policies that determine when results within a cache or caches are refreshed, and new documents replace old ones in results pages.
I’ve written a couple of recent posts about patent applications describing predictive queries, and how they are used in places like Google Suggest, and possibly on mobile devices to assist in the entry of data on smaller or constrained keyboards, or by the use of a stylus.
I remember when I first looked at Google Suggest, I was impressed with the Ajax interface that updated the suggestions shown, without giving much thought to where they were getting those results. The patent application that described how that might work also gave us some insight into where those results were coming from. Caching and prefetching of results was covered pretty well in that document.
I’ve trained a few people who had extensive web design backgrounds, but haven’t had much experience with how search engines work, and have seen them slap themselves on the forehead when I mentioned something like this, saying that it’s so obvious that they should have realized that was going on.
Hi folks,
I’m one of the author of the paper…
I would like to just point out some items that, I hope, should clarify the picture.
Our SDC policy is actually a policy that is aimed at deciding whether an entry in the cache should be evicted or not. The key point of SDC is that it uses stats on usage information in order to fill a static-section of the cache with the most likely-to-be-requested-in-the-future pages. Differently from previous policies this section doesn’t change, allowing us to keep up with queries that are frequently submitted but not submitted at high rates… The dynamic-section, which is a classic cache, instead is concerned with queries that are submitted at an high rate.
About the refresh rate of the static-set… In the paper that will appear in the journal (that is a little bit different from the tech rep linked by the article) we also measured how often the static-set should be refreshed… We have nice results about that… People curious about that may e-mail to me and I’ll send a copy of the camera-ready paper.
Thank you, Dr. Silvestri.
The static section of the cache that the paper describes did sound like something unique to me. I appreciate your taking the time to point that out here. I’ll likely be in touch soon to hear more about your results.