Query Classification can be challenging for a search engine based upon the query itself.
For example, How would you do query classification based on the query “lincoln?”
President Abraham Lincoln
The location, Lincoln, Nebraska
The Lincoln brand of car (shown with old-time Hollywood star Tom Mix).
Query Classification with User Behavior Data
Search engines usually store information about pages on the Web by visiting many pages, where they retrieve information from the pages with a web crawler that follows hyperlinks from those pages.
The web crawler grabs the content of those pages, and it is analyzed to index the pages, looking at words from page titles, page headings, page contents, alt text and captions and file names from images, and other content. It stores that information in an index database for use with queries performed on the search engine.
When a query is performed, the index is used to find a listing of web pages that best match the query, and a result for each page returned is presented on a search result page with a page title linking to the page, a snippet that is a short description of the content of the page, and a URL for the page that sometimes shows the pages place in a hierarchy of content on the page.
These pages are ranked in terms of the pages with the best mix of popularity, relevance, and authority appearing highest.
This is a query answering result rather than a question answering result. The search engine tries to identify the pages that might help best satisfy a situational or informational need of a searcher rather than providing a fact-based answer to a particular question.
There are some reasons to try to classify a query.
One of them is if the query has potentially a number of meanings, just returning the page that is most relevant for one of those, or most authoritative, or most popular might not be very satisfying to searchers. If someone searches for Java, hoping to find out more about the programming language rather than the drink or the country, it isn’t a very good search engine result if either the drink or the country or both show up before the programming results. How does a search engine increase the quality of its results?
One way is to pay attention to what its users seem to show when they react to a search.
Determining a query classification starts by identifying a number of search entities associated with a query. It then moves on to collect data about how satisfied searchers are with the different search entities.
So our search entities for the [Lincoln] search are Abraham Lincoln, Lincoln Nebraska, and Lincoln cars. There may be other results that are good ones for a search of [Lincoln], but they would need to meet a certain threshold of some kind of association with a query. They also need to be internally consistent as well. So, if there are 20 different manufacturing companies who have made Lincoln cars over the years, there’s less of a probability that people mean Lincoln cars when they search for [Lincoln].
This approach to performing query classification can mean that if most people searching for [Lincoln] choose Abraham Lincoln-related pages, and tend to stay on Abraham Lincoln pages longer than other pages, that Google will tend to show pages about Abraham Lincoln at the tops of search results when people perform searches for [Lincoln]. Google has to decide which is first among the best results for pages about the President, the City, and the Lincoln brand of car. Letting your users decide which to show first is probably a good idea.
The patent is:
Propagating query classifications
Invented by Henele I. Adams and Hyung-Jin Kim
Assigned to Google
US Patent 8,838,587
Granted September 16, 2014
Filed: April 19, 2010
In general, one aspect described can be embodied in a method for determining a classification for a query. The method can include receiving a request to determine whether to assign a classification to a first query, identifying a plurality of search entities that are associated with the first query-based upon data associated with each of the plurality of search entities and the first query, and determining whether to assign the classification to the first query-based upon classifications for the identified search entities.
Using ‘Quality of Result Statistics’
As a website owner, the people at Google want to have an idea of how well they are doing, and to improve when they can. One of the best measures might be if they are helping people find what they are looking for. There are a number of ways to use to try to measure how satisfied searchers might seem to be with search results they see. According to the patent, they can include a number of the following:
Long Clicks and Short Clicks
A quality of result statistic for a document can be derived from user behavior data associated with the document, such as “click data.”
How long does someone view, or “dwell” on a query result document after selecting it from a document results list at a query?
A longer time spent dwelling on a document termed a “long click”, can indicate that a user found the document to be relevant for their query. See the section on long clicks in: Does Google Use Reachability Scores in Ranking Resources?
A brief period viewing a document, “short click”, may be interpreted as a document without much relevance.
Tracking Eye Movements
Another type of user behavior data is based on tracking eye movements of users as they view search results. You can see which results appeared to be most interesting, and which people spent the most time reading.
Purchase Decision Data
Another example of user behavior data is purchase decision data. Such user behavior data can be based on:
- Products searched for by consumers
- Products viewed by consumers
- Details regarding the viewing of products products purchased by consumers
Query Record Information
Google’s patent introduces the kind of data record-keeping in this patent that we see in the Knowledge Web, kept in a form that could be pulled into many different ways of tracking and using the data:
Each record (herein referred to as a tuple: <document , query, data>) comprises a query submitted by users, a document reference indicating the document selected by users in response to the query, and an aggregation of click data for all users or a subset of all users that selected the document reference in response to the query. </document>
Keep in mind that Google is building their own knowledge graph, or knowledge base, about entities that people might search for, and how those are treated by searchers. This information can be used to help the search engine predict which other pages a searcher might be most interested in based upon probabilities when performing another query that might be related in some way.
Extensions to this record based approach
The patent also tells us that extensions to this tuple-based approach to user behavior data are possible, such as keeping track of other kinds of data, such as:
1. Country-Specific or Language-Specific identifiers
Geographic and linguistic information associated with a query classification can be used in building a probability model on future queries. These would include a country-specific tuple showing which country the query came from and a language-specific tuple would tell the language of the user query.
2. Low, Medium, or High Favorable user behavior Data
How frequently is a page with a certain query classification selected, and how long do people dwell there?
3. Document Classification Data
Each document can have more than one classifier associated with it or can have information about associations with other sites, or metadata self-indicating a slightly different classification. A site selling ice skates can be classified as a commerce store, and as one associated with sports
4. greatest amount of associated user behavior data associated with the query
While the process described in this patent attempts to understand how different sites might be treated differently by searchers, having an idea of how much actual information you have, or numbers of user interactions have been involved is helpful
This patent shows how useful query classification can be, especially since people tend to only use a few words in their searches that may tend to answerable by a wide range of sites. This is one way for Google to try to understand the intent of sites, and they do that by treating queries that use the same words but different classifications as different search entities.
We see Google using a Semantic Web approach in the patent, using tuples to track searches that may use the same words but evidence different intents by searchers and those searchers’ reactions to the pages they see that show up in search results.
When someone searches for [pizza] at Google, they most likely want to order some Pizza for a meal, but some people might want to find how to make their own Pizza and want to see recipes. Some people might be curious about the history of Pizza – did it really original in Italy, or the United States? Who invented Pizza?
It is quite possible that Google has a data store full of information about different query classifications around a search for [pizza], the pages that people go to, and user behavior associated with each. That data store likely has similar information about a lot of other subjects, and could potentially be used to predict what people tend to search for after they’ve eaten their Pizza, or cooked some themselves, or finished reading about the origins of Pizza. (OK, now I’m hungry).
Classifying queries can help Google decide what to show a searcher for an individual search.
Collecting knowledge about what people tend to select after performing a search, and how they respond to it can help to determine how satisfied searchers are with specific searches and search results.
Query Log Data is often a source for meaningful substitutions for queries, as I wrote about in Google Search Synonyms Are Found in Queries