When we talk about the results that show up in search engines, we often do so in terms related to relevance and importance of those results.
Sometimes the results we see, and that we don’t see, are influenced by other factors, such as steps taken by the search engines to reduce the amount of work that they have to perform in order to return results to searchers.
Using Two Tiers of Search Results
If a search potentially returns thousands of results, and people only look at the first few pages of those results, it would make sense for a search engine to serve results in batches, and perhaps only initially use a modified (and much smaller) version of their database to answer search queries.
A first index tier may have a number of potential results pruned, so that documents that are more likely to be returned at top answers to searches are kept. The first batch of results returned to searchers may be taken from this pruned index.
While this approach allows a search engine to quickly return results for a search, it may provide a result sets page that miss some results that should have been included if those weren’t in this top tier of the index – with those documents appearing behind pages that are returned first.
A new paper from Alexandros Ntoulas of Microsoft and Junghoo Cho of UCLA, Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee, looks at avoiding “any degradation of result quality due to the pruning-based performance optimization, while still realizing most of its benefit.”
Adding a Correctness Guarantee
The paper provides suggestions for search engines on how they could use a “correctness guarantee” to make sure that top results are included in the pruned index:
How can we avoid the potential degradation of search quality under the two-tier architecture? Our basic idea is straightforward: We use the top-k result from the p-index only if we know for sure that the result is the same as the top-k result from the full index.
The problem with that approach is that calculating both the top results of the pruned index and the full index is more work than just calculating the top results of the full index. Of course, that correctness guarantee doesn’t need to be run everytime someone searches for a particular query, and that’s where there’s potential savings in computational resources.
The paper delves into how often a correctness guarantee should be run for different queries, and policies for pruning certain keywords and documents.
It’s a nice discussion of how a search engine’s inverted index may be managed and optimized. It also covers the assumptions that the authors make concerning how modern commercial search engines rank documents.
3 thoughts on “Why Sometimes Best Search Results aren’t Always Top Search Results”
I gone down the whole list of search results and have found myself finding a dead end when I get to about a thousand results. So are talking along the line of about the first 200 results or are you thinking the first 3 pages?
A very good question. This paper really is about “search engine optimization” where the thing being optimized is a search engine itself.
It’s about efficient use of the resources available to the search engine, and the likelihood that people will look at search results past the first page.
There is a statement in the paper that surprised me a little:
If most people (80%) performing searches are looking at the first three pages of results, it might make sense to try to come up with this first tier of database results for the first 30-60 results, where there are 30-60 results that are relevant for specific queries. It might also make sense to only do something like this for the most popular of results, and only include them within such a first tier if people are actually searching for them.
So, it may depend upon how often certain queries are searched for as to whether there’s a first tier of results. It might depend upon whether there are even 30-60 results for some queries (sometimes there aren’t). The amount of results that might be included may depend upon the query itself. If more results are needed, it is always possible to dip into the second tier to retrieve those.
Comments are closed.