You search for “Foo Fighters,” and the search engine takes your query and starts searching its databases to identify results. It might look through a video database, to see if there are any good videos to show you. It may dig through a News database to see if there was any recent news tied to the phrase, or an Image database to see if there were any popular pictures of the band. The search engine may see if any advertisers were running campaigns using the band’s name.
Some of that searching is done by trying to take the exact phrase that you used in your search, “Foo Fighters,” to find a set of results that you might be satisfied in seeing. But, there are steps that a search engine could try to take that might give you even better results.
Associating Search Terms with Categories
The search engine might attempt to associate your search term with categories, that can provide you with richer additional results, shortcuts to helpful alternative searches, and other information that might either broaden or focus the search results that you see. Using categories might help broaden results if there aren’t very many results using the terms you searched with, and it may focus results if there are too many results for the term which you used in your search.
Using categories in such a way might enable the search engine to offer related concepts to a search for “Foo Fighters,” such as “Dave Grohl,” or “tickets,” or “tour dates,” or “Nirvana.”
Putting your search phrase within a hierarchical set of categories might also help the search engine show more relevant or more diverse advertisements.
Why Use Categories?
When someone searches, the intent behind their search may be really difficult to determine – especially since most searches are fairly short – often only a couple of words – and the context of a search can be difficult to determine.
When a search engine associates a category label to a query, the context and the user intent might be a little easier to understand.
In my example with the Foo Fighers, a search engine could just try to find results that use the phrase on their pages,and return those, rank them, and display results to a searcher. Or, it could place the query into a category, and look in associated hierarchical categories to find relationships with other related terms.
A high level of hierarchy could be fairly general, such as entertainment, travel, sports, etc., and would be followed by lower levels of hierarchy with categories that get more specific, such as a second level hierarchy containing the category “music,” a third level hierarchy containing the category “genre,” a fourth level hierarchy containing the category “band,” a fifth level hierarchy containing the category “albums,” a sixth level hierarchy containing the category of “songs,” etc.
Placing the query “Foo Fighters” into such a hierarchical set of categories, and seeing what else is located within related categories enables both organic search and paid search to consider the names of band members, albums and songs from the band, bands that perform in the same genre, and much more. That kind of categorization can provide more context and more information that could help in determining the intent of someone searching for Foo Fighters.
A patent application from Yahoo explores how a searcher’s query terms can be classified into categories, and how related query terms in associated higher and lower level categories can be found and used to return more meaningful search results, suggestions, recommendations, and advertising to searchers.
System for classifying a search query
Invented by Xiaofei He, and Pradhuman Dasharalhasinh Jhala
Assigned to Yahoo
US Patent Application 20080183685
Published July 31, 2008
Filed: January 26, 2007
How Does This Categorization Of Queries Work?
A rough overview might be:
Your query phrase is submitted to the search engine, and the query is reviewed to see if it has been classified before.
If it has, it is assigned the category label, and if it hasn’t, then a category is calculated for it
To calculate that category, Web pages are returned in response to the query, and a predefined number of the top returned web pages are identified to “represent” the query.
A model about related terms and information that is meaningful to the original query is created by looking at how the original term is used on the web pages, and how other terms found on those pages are used.
Some of the other terms found on the web pages may be filtered out of the process for one reason or another. For instance:
a) Unwanted terms and/or symbols, and numbers, may be removed to decrease the “amount of noise.” These could be things like “articles, prepositions, conjunctions, etc., e.g. ‘the’, ‘a’, ‘with’, ‘of’, etc.”
b) Terms that are based upon the same root or stem might be removed because they could be considered duplicates of each other (sing, sang, sung, singing, etc.).
c) A term might be reduced to the simplest version that it can be (a “standard” canonical form) presented as, by removing prefixes, suffixes, plural designations, and so on, so that there is only one version of the term used from the web pages to relate to the original query term.
The other terms found on the web pages may have categories assigned to them already. Based upon how frequently those other terms are used on those web pages, and other criteria about the terms, the original query term may be placed into a category, or assigned a new category.
How We Use Categories
We perform a similar kind of categorization on our own.
For example, if you think about the old blues musician Muddy Waters, you might create a number of categories for him:
He was a musician.
He played a genre of music known as the blues.
The kind of blues he played is a sub genre known as Chicago Blues.
He was a guitarist.
He was a blues innovator, playing the electric guitar.
He was a singer.
He was born in Mississippi.
He was a resident of Chicago.
He was a resident of Illinois.
He was a strong influencer of rock bands like the Rolling Stones.
If I wanted to perform searches for Muddy Waters, I could expand my queries by looking at related terms within those categories, and many more that I could create. In essence, the search engine is trying to do the same thing, by identifying other terms that appear on pages that rank well for the original query, and seeing what categories those other terms fit within, to assign a category for the original query. Then it explores those other categories, including the higher(or broader) and lower (or more specific) level ones, to find related terms and to expand search results, offer suggestions, and provide advertising.
The patent application goes into a great amount of detail on the process involved in assigning category labels to query terms, including many examples, and explores how user data can be incorporated in the process to check up on the semantic relationships between terms identified in this classification process.
An example – if people searching for “Muddy Waters” click on links to pages about the Chicago Blues, people searching for another bluesman, Howlin Wolf, also click on the same links to pages about the Chicago Blues. That provides some verification that both individuals (and queries using their names) belong to that category, and that both are related.
Thinking about categories that could be created for queries can be helpful to searchers, to people who write content for web sites, and for search engines trying to deliver search results to people.
Regardless of whether you’re a searcher, a writer, or an indexer of web pages, classifying and categorizing topics and queries can be a helpful process to use in finding information, creating new information, or helping others to locate something. It makes sense for a search engine to explore how categories can help it provide results.
Do you do something similar with categories when you search on the Web, or when you create content for Web pages?