Making Search More Efficient: A little about caching and prefetching

Sharing is caring!

I hate doing the very same task over, and over, and over again. I’d bet search engines do, too.

Increasing the efficiency of search

Chances are that your choice of search engine, whether it’s Google, or Yahoo!, or MSN, or another, uses some methods to try to make what they do a little more efficient, and a little less costly to run.

Of course, popular searches are ones that lots of folks search for. If a search engine would process every search request as if it were a new one, and try to grab results from its index, or indexes, including searches that were repeated often, it could be reinventing the wheel frequently. But what if there was a way to speed that up?

What if that method created a significant savings in terms of time and processing power? What if it didn’t do a full search for the most popular terms over and over?

Caching results pages from popular searches

One method might involve using a cache file, like the cache file that your browser uses to store temporary internet pages. The idea behind the browser cache is that if you return to a page, your computer doesn’t have to ask for the page all over again. It can instead just grab the local copy of the page that is in the cache file.

So, imagine a search engine doing the same thing. It has a separate computer, or set of computers that contain cache files, and when a request for a certain word or phrase gets entered by someone using the search engine, a look around in those cache files may be the first stop instead of the search engine doing a lookup in its index.

If a set of results for the phrase or word searched for is there, the search engine could serve it to you. If it isn’t, the search engine might then perform a full search. That could save some processing power, and some time.

Now keep in mind that the most popular searches can account for a large percentage of the requests that a search engine gets. If you look somewhere like the Google Zeitgeist pages, you can see some of the terms that the search engine might be keeping in a cache.

Deciding how many results to cache

This is a tricky area. The efficiency that is gained by caching some results pages, for popular searches could be lost by not caching enough results, or caching too many. So how do you figure out how many results to cache? Well, you could look at how people use the search engines. If people typically only look at the first page of results – ten links and titles and snippets – then you might only want to keep track of those ten results for a certain query.

A paper that looks like it was written in 2004 describes how to make a search engine run more efficiently. One of the authors’ pages notes that it will be published in 2006 in the ACM Transactions on Information Systems, Vol. 24, n. 1, January 2006.

It’s impossible to tell, without seeing that issue whether the version that appears will be the same one that appears here:

Sharing is caring!

4 thoughts on “Making Search More Efficient: A little about caching and prefetching”

  1. Bill, every major search engine to my knowledge already caches results to some degree and long have done so. My understanding is that they tend not to hit disk unless they have a good reason to do so. Those same popular queries that happen over and over, my understanding is they are pulled out of memory and they’ve got ways to ensure that’s refreshed when needed.

  2. Hi Danny,

    Thanks for your comment. This post was pretty much meant as a simple restatement of what search engines do when responding to queries, but it might not be so obvious to some folks who use search engines.

    When I wrote it, I didn’t think I was going into any groundbreaking new territory. Rather, I considered it as possibly laying the foundation for some future posts involving some of the mechanics of what we see when we see a search engine in action.

    With this type of caching and prefetching, where things start to get a little interesting are the replacement policies that determine when results within a cache or caches are refreshed, and new documents replace old ones in results pages.

    I’ve written a couple of recent posts about patent applications describing predictive queries, and how they are used in places like Google Suggest, and possibly on mobile devices to assist in the entry of data on smaller or constrained keyboards, or by the use of a stylus.

    I remember when I first looked at Google Suggest, I was impressed with the Ajax interface that updated the suggestions shown, without giving much thought to where they were getting those results. The patent application that described how that might work also gave us some insight into where those results were coming from. Caching and prefetching of results was covered pretty well in that document.

    I’ve trained a few people who had extensive web design backgrounds, but haven’t had much experience with how search engines work, and have seen them slap themselves on the forehead when I mentioned something like this, saying that it’s so obvious that they should have realized that was going on.

  3. Hi folks,

    I’m one of the author of the paper…

    I would like to just point out some items that, I hope, should clarify the picture.

    Our SDC policy is actually a policy that is aimed at deciding whether an entry in the cache should be evicted or not. The key point of SDC is that it uses stats on usage information in order to fill a static-section of the cache with the most likely-to-be-requested-in-the-future pages. Differently from previous policies this section doesn’t change, allowing us to keep up with queries that are frequently submitted but not submitted at high rates… The dynamic-section, which is a classic cache, instead is concerned with queries that are submitted at an high rate.

    About the refresh rate of the static-set… In the paper that will appear in the journal (that is a little bit different from the tech rep linked by the article) we also measured how often the static-set should be refreshed… We have nice results about that… People curious about that may e-mail to me and I’ll send a copy of the camera-ready paper.

  4. Thank you, Dr. Silvestri.

    The static section of the cache that the paper describes did sound like something unique to me. I appreciate your taking the time to point that out here. I’ll likely be in touch soon to hear more about your results.

Comments are closed.