When visitors to search engines use abbreviations or expand abbreviations in queries, it’s possible that they might be missing out on some pages worth visiting.
For example, use Yahoo to search for [NASA Moon bombing] and compare the results to a search for [National Aeronautics and Space Administration moon bombing]. You’ll see some very different results.
Should those search results be more similar? NASA and National Aeronautics and Space Administration are the same organization. Then again, NASA is also an abbreviation for:
- North American Saxophone Alliance
- National Auto Sport Association
- National Association of Students of Architecture
There’s also a Nasa mountain in Sweden, which is home to the Nasa Silver mine. There’s a Swedish band named a hip-hop artist, a DJ collective, and the Nasa people in Columbia.
How should a search engine handle abbreviations in queries? Should it expand those queries to include the longer expanded version when doing so is likely to improve search results? If a page would rank well for the phrase “National Aeronautics and Space Administration” but not very well for “NASA,” should it be displayed to searchers who use “NASA” in their query?
That’s the question asked in a patent application published by Yahoo last week.
Abbreviation Handling in Web Search
Invented by Xing Wei, Fuchun Peng, and Benoit Dumoulin
Assigned to Yahoo
US Patent Application 20090259629
Published October 15, 2009
Filed April 15, 2008
A method for handling abbreviations in web queries includes:
building a dictionary of a plurality of possible word expansions for a plurality of potential abbreviations related to query terms received or anticipated to be received by a search engine;
accepting a query including an abbreviation;
expanding the abbreviation into one of the plurality of word expansions if a probability that the expansion is correct is above a threshold value, wherein the probability is determined by taking into consideration a context of the abbreviation within the query, wherein the context including at least anchor text;
and sending the query with the expanded abbreviation to the search engine to generate a search results page related to the query.
One of the first steps in associating abbreviations with the words that they are an abbreviation for is to look at anchor text that points to a page that might include both the abbreviations and word expansions that might correspond to those abbreviations. If we found pages with “NASA” in anchor text pointing to pages, do we also see “National Aeronautics and Space Administration” in anchor text pointing to many of the same pages? If so, the following might be added to an abbreviation dictionary:
- NASA = National Aeronautics and Space Administration
But how would the members of the North American Saxophone Alliance feel about that when they search for “NASA jazz competition” and the search results are filled with races to land something on the Moon?
This abbreviation dictionary may include information from other sources as well. Still, if there’s a high enough level of probability that the expanded version is being referred to by use of the abbreviation, then it’s possible that a query using an abbreviation might include results from the expanded word version as well.
Care has to be taken by a search engine when doing this type of query expansion. The patent gives us the following examples of queries where the abbreviation/word “aim” means different things:
- aim download – it’s likely that “aim” stands for “AOL instant messenger.”
- aim stock – aim is probably an abbreviation for “alternative investment market”
- aim at improvement – aim is probably being used as the word “aim” rather than being used as an abbreviation.
There are at least three ways a search engine might learn about abbreviations:
- Query Sessions – If people searching for “aim download” don’t see a relevant result, they might re-write their query as “aol instant messenger download.” This type of user data in query session log files from the search engine can help build that dictionary, and how abbreviations might be used in different contexts.
- Anchor Text – If the same pages are linked to with different text, including abbreviations and expanded word versions of those abbreviations, a connection between the abbreviation and expansion can be noted, as well as the context of the use of the words – such as “aim download” and AOL instant messenger download” pointing to the same page.
- Click Logs – People clicking on the same page when it appears in search results for different queries can mean that those queries may be related. If it happens more frequently, it’s more likely that they are.
All three of these methods are based upon actual human involvement, whether it involves linking, choosing pages in search results, and refining queries during a search for information on a topic. All of the information is easily accessible to the search engine, and these resources can be used to build a statistical model that can tell the search engine when it might be a good idea to expand an abbr.
The patent filing contemplates handling different forms of abbreviating phrases, such as acronyms, which are pronounced as a group of letters and containing the first letter of each word in a phrase, such as “SARS,” or initialisms which are pronounced wholly or partly using each letter, such as “IRS” which is pronounced as “I,” “R,” “S,” or portmanteaus which is a word formed by combining two or more words, such as “Don’t,” or a pseudo-blend, which is a kind of abbreviation that has extra or omitted letters, such as “UNIFEM,” for “United Nations Development Fund for Women.”
When a search engine finds a word in a query that might be one of these types of abbreviations, it may do one of three things:
- Expand the query term to include pages that include the abbreviation, pages that include the expanded version of the abbreviation, and pages that include both.
- Offer a searcher a query suggestion for the expanded version of the abbreviation.
- Ignore the expanded version, and just return results for pages with the abbreviation.
If you are searching for something, and you are using an abbreviation amongst your query terms, it isn’t a bad idea to try the same query with the abbreviation expanded, especially if you think there’s a chance that you might miss something. If you’re searching for information about a space agency, searching for NASA without searching for National Aeronautics and Space Administration might not be as bad as if you were searching for information about the North American Saxophone Alliance, and you only used “NASA” in your search instead of the expanded version of the abbreviation.
If you are publishing something on the Web that contains abbreviations, it’s often not a bad idea to use the abbreviation and the expanded version on the same page, and to see what else that abbreviation might stand for, and what search results for it look like. In the first 20 results, I see for “NASA” in Google, all pages returned to refer to the space agency except for the 9th result, which is about the DJ collective N.A.S.A, and the 10th result, which is about the racing organization, the National Auto Sport Association. No Saxophonists in a quick look at the top 50 results.
This patent application is from Yahoo, but researchers at Google and Bing may be considering many of the same ideas.
Be careful with your abbreviations as you search, and as you write for searchers.
19 thoughts on “How Search Engines Might Expand Abbreviations in Queries”
Bill, do you think that users who login to perform searches would have more of a chance to find exactly what they were looking for, when an abbr. or acronym is involved? Or on any search, for that matter. Didn’t search engineers create a lot more work for themselves by not requiring a login from the very beginning?
That was a great insight about the Google’s treatment to abbreviations. I don’t have any abbreviations in my blog though this information can be useful for my fuure use.
Thanks for sharing it.
Interesting – my field is patent and prior art searching, where people are generally quite careful about the search terms they choose – but web search is still a part of our tool box. The smarter web search gets, the more we lose the ability to control exactly what is being searched for – that may be a good thing for quick searches but a bad thing for in-depth investigations. I suppose that the lesson is this: it’s always good to stay aware of what your search engine might be “helping” you with!
I don’t think that it makes a difference whether or not someone is logged in to their Google Account and personalized search when it comes to Google handling abbreviations one way or another, much like it doesn’t seem to make a difference for misspellings.
I also think that if Google had started out requiring people to login before searching that there would be a lot less searches, much like we often see abandonment at many shopping carts on ecommerce sites that require someone to create an account before buying something.
Thank you. There are a lot of abbreviations that I think we often take for granted, such as “scuba,” which is short for “Self Contained Underwater Breathing Apparatus” that seem to have evolved into words with meanings of their own.
I know I often use abbreviations when I write that I don’t even think about using without providing an expansion of the abbreviation or acronym. For instance, when I write about getting some new music, I might mention a CD instead of a compact disc. It’s likely that a search engine won’t confuse that CD with a “certificate of deposit,” but the methods described in the patent filing provide some ways of understanding how a search engine might be able to make that distinction.
I agree. One of the reasons that I don’t like using Google’s patent search is that the algorithm used isn’t transparent enough for me to understand what I might be missing when I make searches there. I don’t care about the most popular patents, or the “highest ranked” patents or the most cited patents when I do that search. I just want to see all of the patents that have the words in them that I’m looking for.
Exactly! Although, on the other hand, there are colleagues of mine who love using Google Patent Search as a supplement to a more exhaustive investigation, since it seems to magically pull up a few really good patents that can sometimes be used as good starting points to explore classifications, etc. I guess as long as they keep developing quick fixes we’ll keep using them!
It’s also interesting that every once in a while you will hear that semantic searching will completely solve all our patent search problems – but it’s kind of the same issue. If you can’t see exactly what the algorithm is doing, how do you know what might have been left out? Until it becomes 100% foolproof, patent searching requires too much caution to give up Boolean searching entirely.
On Google, if I type:
the 10th result is
NASA Human Space Flight
if I type:
the 10th result is
NASA Earth Observatory: Home
So Google returns different results for abbreviations and their expanded counter parts and plurals and so forth. I think that’s appropriate because it allows the searcher to decide literally and exactly what they want to look for. Your point Bill about writing for the web is well taken. If you want to be picked up for the abbreviation and the expanded phrase then by all means, include them both in your page text. If you are searching for something, it’s best to use a variety of similar phrases and even “reversals” as well. Particularly in phrases where there is a geographic location. You might search for South Florida scuba training and also scuba training South Florida. Regards –
That is truth, everyone could sometimes be misguided by the search engine but I think that it would be extremely difficult for those searching programs to make this problem disappear. Again, some search results are fully matching the requested phrase, I think that the key to proper results lies in using proper keywords. Sometimes you can add to NASA e.g. Sweden, or Saxophone. And the results will come out properly.
The chance of missing something important does mean sticking with boolean searching, though I can see how some people might like using something like Google’s patent search to augment their searches. Semantic searches are going to face some of the very same issues.
Hi NJ Web Design,
Interesting difference in results that you received based upon the use of allcaps. It definitely is worth exploring variations of words and phrases, when searching and when writing. Sometimes the results you see in search results for seemingly minor variations are very different – another area where this can happen is with compound words and decompounded words.
Good points. We sometimes ask a lot from search engines, but it interesting to see how they might try to address potential problems like this one. Sometimes knowing the proper keywords is difficult, especially when you are searching on a topic that you don’t know much about – which is often a reason for searching in the first place.
I guess I was thinking along the lines of a login form with a checkbox below reading something like “remember me/keep me logged in”, where logging in would have been a one time event. Wouldn’t that have given any search engine more insight into what someone wanted as a search result?
For example — if I searched for and showed interest in buying a saxaphone, while logged in, let’s say two weeks ago, and then searched for “NASA” today, shouldn’t they lean towards showing me North American Saxophone Alliance instead of the more popular NASA search results…if not, I’m not sure I fully understand the purpose of a Google Account and personalized search.
I’m not sure that it would be helpful. Imagine the programmer who tends to do lots of searches for Java tips and suggesitons, who gets an offer to visit an island in Indonesia named Java. He starts searching for information about Java, and all his past searches indicate that he wants programming results. Most of the time the intent behind a search depends upon the circumstances and situation of that present search rather than a past history.
Google’s approach to personalized search may be helpful in searches, but it favors a past history of searches and browsing over a situational need for information, and in the instances where the situational need differs from the past history, it stands a chance of being less helpful than if a searcher wasn’t using personalized search.
Thanks for the excellent explanation, Bill. It almost sounds like people need to be taught how to search before they’re given a box to fill in. Not one of the major engines offer any obvious links to help one learn about how they should search for something. I wonder how many people click on the ‘Advanced Search’ link on Google. I’ve often thought that the page that follows would scare off most of the searching population. Sorry if I’ve drifted off-topic a bit.
You’re welcome. Thanks for asking some interesting questions – I don’t think they are too far off topic. In this patent, the search engine is attempting to understand the intent behind a search based upon the past results of many other searchers and the query refinements they used in search sessions, as well as which results those searchers clicked upon, or in the anchor text used by publishers of content. So, we could ask, does this help our programmer looking for information about the island of Java when he starts doing searches for phrases like “java history”? Would a tutorial on search help that searcher – I’m not sure.
Comments are closed.