Missing Content Can Lead to Bad Search Results
There are times when you perform a search in a search engine, and the results just aren’t very relevant.
When you don’t get the results that you expect from an internet or intranet search engine, is it because the search engine isn’t very good, or is it because there isn’t much indexable information on the web or intranet document repository that contains content related to that search?
A new patent application discusses how the folks who run search engines might identify difficult queries where there may not be much content collected by the search engine on certain topics. The process in the patent filing provides search engines the chance to offer searchers suggestions for queries where they may find an answer to questions that they may be searching for or to allow indexing efforts from the engines to work on filling those gaps.
The best introduction to the patent filing is probably a couple of pages from IBM which discuss the efforts of the researchers who came up with this process:
- Estimating the difficulty of queries submitted to search engines
- Machine Learning for Information Retrieval
The missing content patent application:
Detection of missing content in a searchable repository
Invented by Andrei Z. Broder, David Carmel, Adam Darlow, Shai Fine, Elad Yom-Tov
Assigned to IBM
US Patent Application 20070016545
Published January 18, 2007
Filed July 14, 2005
Abstract
A method and system for the detection of missing content in a searchable repository are provided. A system includes: a missing content query identifier (401) for identifying queries to a search engine (102) for which no or little relevant content is returned; a missing content detector (110) which clusters missing content queries by topic; and an output provider for providing details of a missing content topic.
While the focus of this patent application is on enterprise search and IBM’s efforts in providing a robust search feature, it provides some insights into how and why a search engine might be fine-tuned to provide more relevant results to searchers. The testing of search quality means taking developing ways to test the content and coverage of the searchable information within a search engine.
The process for detecting missing content involves:
- Identifying queries to a search engine for which no or little relevant content is returned,
- Clustering missing content queries by topic, and;
- Providing details of a missing content topic.
Identifying missing content in response to queries could be done by looking at explicit responses from searchers who provide user feedback. It’s more likely in an enterprise setting that people will provide feedback about not being able to find things, but user feedback about search results can be used by web-based search engines, too.
More implicit indicators of irrelevant searches can be found by looking at how people respond to searches. Do people click through the results of searches? Do they scroll down those pages and spend time on them? If in response to certain queries, people rarely click on the results they are shown or don’t spend much time on those pages, there may be an issue with those results.
The third approach is one which relies upon machine learning, that focus on indications of low satisfaction with queries.
Missing Content Conclusion
The writers of this patent application note a few benefits from this method for enterprise search:
1. Query suggestions can be offered to searchers to help them find what they are looking for.
2. Intranet administrators may be able to identify information that may not be presented in a search engine friendly way.
3. Document creators may be able to locate topics that they should have more information about upon the intranet and can add.
That last benefit is something that creators of web pages should pay attention to also. If the information in a certain field or market tends to be hidden behind user logins or appears upon pages that aren’t very search engine friendly, search results for queries for that information may be not very competitive.
There are areas where a Web search engine may be returning results that aren’t very relevant. The fault may not be with the search engine, but rather that the information is missing content that isn’t out on the web in search engine friendly form.
I would think it would also give them insight into what content could be produced to satisfy that demand. But they are probably just to busy to bother.
In the enterprise arena, Stever, I would think that it might provide that kind of insight, and that people might act on it.
On the web, it might be an indication to search engineers that they need to do some kind of focused crawling to see if they can’t get more documents indexed which cover those areas with missing content.
Search 2.0 might just be taking some indexing ideas developed within the constraints and confines of enterprise search, and adding them to the indexing processes of web search.
For example, there have been a lot of OneBox searches developed expressly for organizations using Google’s Enterprise search. I might be more inclined to use Google’s Personalized search if I could include a list of sites in a custom search engine that might show up in OneBox results.
Sounds like Search 2.0
Good Evening Bill,
This is both to test my IP again on your blog, and also to say that this is a topic of great interest. Your concluding points about how web page developers might make use of these holes in available content was where my mind was going as I was reading along. Excellent points and a very interesting patent application!
Miriam
Hi Miriam,
Really good to see that your post came through.
As much information as we may think is out on the web, there are gaps in what search engines have indexed, or in what is available to index. In areas where that information tends to be in books, or behind database logins instead of upon search engine friendly pages, it’s less likely that we will see relevant results in a search query related to those gaps. Kind of interesting to peer at this from the perspective of those working on search engines.
I type a request on a search engine today and it gives me a result and check again tomorrow it’s no more there.. How could you get that back.
Hi Gylances,
There are a large number of reasons why a page that appears in search results today might not be there tomorrow, or may rank lower than it had previously.
Here are a few of them:
1) The page that was listed may have changed in some way:
a) Its owner may have blocked it from search results using robots.txt or a meta noindex tag (sometimes this even happens accidentally)
b) The page may have been removed from the web site
c) The owner of the page may have changed the way that pages on the site link to each other, and it may not rank as highly as before
d) The content on the page may have changed, and it may no longer be relevant for the query term that you used
e) The anchor text in links to the page, from within the site or outside of the site might have changed, and the text in those links what was helped make it relevant for the search query previously.
f) The page may have lost some links pointing to it (from within and/or from outside of the site), and it may not have as high a ranking score.
g) The page may have been receiving some kind of boost in search rankings based upon any number of factors, such as freshness, number of links to the page from other pages in the top results for the query term, or others, and it is no longer being boosted for that query.
h) The site might have been doing something possibly against a search engines guidelines, and the page may have been penalized in search results.
2. Other pages may have changed in some manner:
a) Other pages and sites may have been optimized for the query and rose in the search results above the page.
b) New pages may have been added to the Web that are relevant for the search query
c) New pages may have been boosted in the search results above that page, for one reason or another, such as freshness.
3. The search engine may have changed the way that it ranks pages or may have applied a filter
a) Search engines show results from different data centers which may be using slightly different ranking algorithms, and a search engine will sometimes show you results from different datacenters when the one that you might normally see results from is busy – so you might see different results based upon the data center you’re viewing.
b) Search engines regularly update and change the algorithms that they use – sometimes in small subtle ways, and sometimes in ways that can change lots of rankings.
c) Search engines sometimes make changes intended to filter out web spam, and this can sometimes affect a small percentage of pages that aren’t web spam.
d) Search engines may sometimes filter out pages that seem very similar to other pages, on the same site or on other sites, and sometimes in some instances, they filter out the original page, and keep a duplicate.
e) The search engine may have had a hardware/software problem that caused a page or pages to be accidentally removed from its database.
Google has a page that lists a number of steps a webmaster can take when their pages loses rankings or may have been removed from Google’s index. It’s worth a look if something like that might have happened to you:
My site isn’t doing well in search