A Taxonomy of Rewriting Search Terms

When I’m looking for information on a topic, I’ll rarely stop at one search regardless of how good or poor the information I find on the topic might be.

I’ll look at some of the results that I receive from my search, and possibly change the words I use in my search based upon what I see in those search results. Sometimes I’ll ignore those results and try out other terms. I might add a word or two to better focus my search, or remove some words to better target what I’m looking for. I might use an advanced search operator, such as a minus sign immediately in front of a word, to try to filter out some results that aren’t relevant to what I’m trying to find.

A couple of researchers from the University of Washington have published a paper to be presented at The 18th ACM Conference on Information and Knowledge Management (CIKM 2009) in November 2009, that takes a close look at how people search on the Web, and how those searchers might reshape and rewrite the query terms they use when trying to find information on a subject.

If you’re a searcher, knowing some of these strategies might help you find information on topics that you might be having troubles finding. If you’re a site owner, having some knowledge about how people search might help you think about how people might find your pages through search engines.

The paper is Analyzing and Evaluating Query Reformulation Strategies in Web Search Logs

The authors, Jeff Huang and Efthimis N. Efthimiadis, looked closely at query logs from AOL released a couple of years ago, to capture information about search sessions from individual searchers, to come up with classifications on how people might change the words that they use when going from one search to another at a search engine. Those query logs contained records of 36,389,567 queries, and the classification method that the researchers used identified 3,411,706 of those as reformulations of previous queries in the log files.

These classifications are presented as a list of “reformulations” or re-writings of query terms, though it’s not a complete list. For example, the authors tell us in the paper that they didn’t try to differentiate between query terms where searchers may have added or included geographical information in their searches. I also don’t see included within the list attempts by searchers to refine a query by including temporal information – such as adding a “2005″ to a search for “world series.” The paper also doesn’t discuss the use of advanced search operators, such as a minus sign to filter out some search results.

We are told in the paper that some of the ways that people re-wrote their queries took place more often when searchers didn’t find much in search results that were helpful to them, and that other ways of reformulating query terms were used after it appeared that they did find useful information during their search.

Query Reformulations

Here are the classifications that the authors of the paper came up with for reformulations of search queries that they saw happen commonly in the query log data that they studied:

Word Reordering
Does the order in which words you type as a query matter for a search? People searching for [seattle pizza palace] might change their query to [pizza seattle palace] to find results that they might not have seen with the first search.

Changing Whitespace and Punctuation
Might changing how whitespace and punctuation in your search show you different results? If you search for [ice-cream new york] or [icecream new york] or [ice cream new york], will you see different search results?

Removing Words
If you type in a three or four word long query, and after looking at the results you see perform the search again after removing one or two words, you might see a broader range of results. For example, if I search for [cincinnati bengals ohio], I might miss out on a good number of results that I would see if I just searched for [cincinnati bengals].

Adding Words
Sometimes search queries can be too broad, and it might be helpful to add a word or more to focus a search better. A search for [virginia mortgage] probably isn’t as focused at [virginia mortgage rates], and would give me too broad a set of results if it was mortgage rates that I was interested in exploring.

URL Stripping
Sometimes people type or copy the URL, or web address, of a page into a search box rather than their browser address bar. They may then remove things such as “.com”, “www.”, and “http” from their original query. If you do this on Google these days, it will usually deliver you to the page for the URL that you’ve type in rather than showing you search results. If you want to actually search for the URL, you need to put quotation marks around it.

Stemming
Stemming means stripping a word down to its roots, For example, a search for [fishing over bridges] might be rewritten as [fish over bridges] or [fish over bridge]

Using Acronyms
Someone searching for the [National Aeronautics and Space Administration] might decide to use the acronym for the organization (NASA) in their next search.

Expanding Acronyms
Someone searching for [NASA] might decide to use expand that acronym for the organization [National Aeronautics and Space Administration] in their followup search.

Using Substrings
Where something may be removed to the front or back of a search as a prefix or suffix. For example, the query [is there spyware on my computer] might be reduced to a smaller string such as [is there spyware].

Using Superstrings
Where something is added to the front or back of a search phrase as a prefix or suffix, such as expanding a query for [nevada police rec] to [nevada police records 2008]

Using or Expanding Abbreviations
Where words within queries may be lengthened or shortened, such as changing a query of [shortened dict] to [short dictionary].

Substituting Words
Words within a query might be substituted with semantically related words. Those relationships might be synonyms, hyponyms, hypernyms, meronyms, or holonyms. Synonyms are words that have the same meaning, such as “car” and “automobile.” A hyponym is a word that is a specific instance of an original word (or query term), such as the word “scarlet” instead of “red.” A hypernym is where you have the more narrow term, such as scarlet, and replace it with the broader related term, such as “red.” A meronym is a word that names part of some larger whole, such as “finger” for “hand.” A holonym is a word that names a larger whole, rather than the smaller part, such as “hand” for “finger.”

Correcting Misspellings
While this seems evident on its own, the researchers only counted misspellings when the amount of editing those spellings was fairly small.

In addition to the classifications above, the researchers also noted that sometimes searchers will change more than just one of the things listed above at a time, such as adding new words, changing the order of words, and others. Some reformulations of queries can be too difficult for a computer algorithm to capture as well, and may require more context or knowledge of popular culture. They give the example of a query reformulation from [how to calculate nutritional values] to [weight watchers calculator].

Conclusion

The patterns in query reformations that are seen in these classifications might help searchers, site owners, and even search engines to find or to provide better results to searchers.

If you find yourself searching for information about FEMA, you might want to try a followup search for [Federal Emergency Management Agency] to see if you can find some results you otherwise might have missed. Adding words to a query can help better focus a search. Removing words from a query can make an orginal search that might be too narrow become broader and possibly more useful.

If you’re a site owner, trying to use words on your site that you think your visitors will expect to see, or may use to search for and find your site, understanding that searchers may rewrite their query terms in the ways described above may give you some ideas while you’re writing or editing the content for your pages. For example, if I’m writing about NASA, I’m going to make sure that I include the full name of the agency as well as the acronym.

I mentioned above that the classifications above don’t include the addition of geographic terms, or terms that might add some sense of time to a query. I like to use the advanced search operator minus sign in front of some words, to filter out some search results, and that kind of query reformulation also isn’t included. (I’d love to see a study from one of the major search engines on how often people use advanced search operators in their searches, such as a minus sign or quotations around a phrase.)

What strategies do you use when you search that might not have been included in the classifications above?

Share

39 thoughts on “A Taxonomy of Rewriting Search Terms”

  1. I’m a big SEOer and our department collects info on all our possible search terms, phrases, keywords, etc. When collecting we use many of the strategies above as well as searching different categories within search engines –such as Google News or Yahoo! Sports. These strategies collect info from a more focused category. Sometimes we step back and search more generally going from specific to less specific like you talked about with substitute words. This post is great for re-strategzing and SEOs in general.

  2. Hi Case,

    Welcome to SEO by the Sea, and thanks for sharing a little about how you and your department looks at keywords. I like exploring a range of keywords from general through very specific, and in between as well.

    I really enjoy looking at what shows up in search results for keyword terms and phrases, and seeing what else appears in those results, including query suggestions, other phrases in page titles and snippets within the results, what kind of fresh information shows up in news searches, whether or not images and blogs and books show up from blended/universal search and more.

    Taking it to the next step, and comparing that kind of information with what shows up for variations such as singular/plural and stemmed versions of terms, compounded words, and others can sometimes be eye opening. I liked the framework, or classification scheme, that this paper provided of potential ways that queries might be rewritten. It’s interesting to see how search engines might attempt to adjust to some of those as well, such as spelling corrections, abbreviations, and so on.

  3. This paper used the AOL query logs. They no longer seem to be available on the AOL page (anyone know why?), but I found a mirror on http://www.gregsadetsky.com/aol-data/

    Anyways, has anyone tried using these logs to detect reformulations for SEO? Basically I think of them as alternative search terms. The paper links to a python script for classifying reformulations (apparently with high precision, whatever that means).

  4. Hi Tyler,

    There was a lot of publicity about the AOL query log information, but unfortuately, it wasn’t positive. Many were concerned that it contained too much information, and that it might be possible to find personally identifiable information about individuals from the logs.

    The research that I’ve described above is exactly what you are asking about – trying to find reformulations of queries. In this case, the researchers don’t discuss specific reformulations for specific terms, but rather attempted to come up with a classification of how people might try to reformulate the queries that they use, and what the context of those reformulations might be.

    I’ve seen people write about using the AOL data to learn more about specific keywords, so I know some of that is going on. There are limits to using this data for other search engines since each search engine applies its own filters and processes and algorithms and re-ranking processes to show the results that they show.

    Precision is often defined as the number of documents that might be returned for a search that are relevant for the search, divided by the total number of results returned for a search. For example, I search for [ice cream] looking for pages about the dessert. I get results from my search which include those pages, but which also might have the words “ice” and “cream” on them, such as a page that might have the sentence, “I slipped on the ice when I went to the store to buy cream.” So the precision of that particular search might be looked at as the number of pages that are actually about ice cream compared to all of the pages that contain the words “ice” and “cream” on them.

  5. Thanks, K.

    How do you get a hold of unpublished research?

    I usually don’t unless someone sends me something, usually under some kind of embargo or limitation of usage. But, many papers that are created to be presented at places like the CIKM are published elsewhere before the conference, and it’s possible to use sources like Google Scholar to find them, or check to see if the conference has published a list of accepted papers, and then search for the papers, or for the authors of those papers.

  6. but still sometimes you will not find good information about the topic you are searching it happens with me many times

  7. Hi Vikram,

    Sometimes there isn’t any good information online that’s been indexed by search engines on some topics. It’s possible that information does exist, but hasn’t been indexed, or that there just isn’t any information about a topic. There’s also a great number of pages and databasess that require paid subscriptions and memberships to access that haven’t been indexed by search engines, from scientific journals to newspaper archives and many others.

    One strategy that can sometimes be helpful, when you can’t quite find what you are looking for on a topic, especially one that you don’t know too much about, is to find resources pages, such as a wikipedia entry or a directory that you can drill down categories through, to find other terms and related topics that might reveal other terms that you can use to search with.

  8. I really think Google and possibly some of the other Engines do a lot of large scale testing on exactly this. This really plays into why there are related search terms. This can be seen on the bottom of the results when doing a Google Search. It is with this kind of information that Google helps to compile the best and most relevant information.

  9. When searching, I have found myself looking at the ad copy that pops up, so I can use phrases found which might lead to other results. Mainly, I go on a tag surfing style search with the help of an engine like Semager. Sometimes, semantic search engines producing really unexpected results ( such as a specific firm’s name). However, I take the path most often of looking at one specific item, like “spark arrestor” which leads to chimney, fireplace, burner, vent, water heater, stove, and so on.

  10. I don’t think there’s any substitution for just opening up Google Analytics and looking at the keywords, country of origin, bounce rate, ect. You have to draw your own conclusions from that.

  11. I only vary my search phrase if the results I get back don’t turn up something I’m satisfied with in the first page results.

    As a site owner, I simply use the adword keyword tool, to get an idea of the variants used for search terms.

  12. Tremendous post. This type of article interests me more than most. I love information that assists in helping us understand human interaction via search. This is the linchpin of quality webmasters and SEO’rs.

  13. I am searching almost daily and it becomes very tedious when you are searching over and over again for drug rehab sarch term and trying to find new info.

  14. Great input! Taking these ideas in to consideration will help immensely when optimizing a website (to capture the largest possible audience) and then also with selecting keywords to target for organic & PPC campaigns. Also, obviously, this is valuable for searchers to find what they are looking for as quickly as possible.

  15. Hi Frank,

    Very interesting approach, looking at the words and phrases from ads. I’ve seen some references in a few patents and whitepapers about search engines looking at the terms they choose to advertise with, related to the landing pages that they are showing ads upon, to try to understand relationships between the words used.

    I also like the idea of using different search engines, that might provide some different and unique results. Haven’t tried out Semager yet, but it sounds like I should. Thanks.

  16. Hi Alfred,

    Just using Google analytics can be limiting – it can tell you information about the words that people have used to find your site, but not about the words that you haven’t used on your site, that perhaps you should have.

  17. Hi Crystite,

    I like to look at more than one source of information if I can, regardless of how “trusted” the first might be. Sometimes doing searches on more than one phrase can produce some contradictory information on the same topic from sources that might seem equally “trusted.”

    The adwords keyword tool can sometimes be helpful, and I think it’s good to use. I find approaches like the ones listed in my post can sometimes give me a wider range of possibilities.

  18. Hi Lee,

    It is nice to see research that actually looks closely at what people are doing, and how they are using a search engine. I like these topics as well.

  19. Hi MelissaH,

    It can be helpful to find some other query terms, and expand your searches. Hopefully the methods described in my post might give you some ideas.

  20. Hi Joel,

    Thanks. One of the other things that I liked so much about this research was that it was helpful to searchers, site owners, and search engines. I’d love to see more papers provide helpful information to a wide audience like this.

  21. Hi Adam,

    The major search engines have done a lot of experimentation and testing in this area. I’ve written a number of posts recently about search suggestions and search refinements, based upon patents and whitepapers, many of which can be found in my category on search queries. Some of those queries can be pretty useful, especially if you don’t know much about the topic that you are searching within. Though sometimes, when a query term might have a number of different meanings, I’m not sure that the suggestions being offered are diverse enough.

  22. Hi People Finder,

    That is a very good source of synonyms and related terms that could really help searchers. I wonder if Google would consider offering something similar to searchers as a “query suggestion tool,” that could be accessed directly from a search result page for the query they typed into the search box.

  23. @ Bill,

    That would be a great search option for Google to offer its users and easy to implement.

    On a search engine side note – Have you heard about http://www.80legs.com/ – the site is a web crawler “for hire”? I saw it on Mashable.com today.

    You can input the domains you want and choose options and for $2.00/ million pages and $.03 / CPU hours 80Legs will crawl the sites for you and then you can build your own custom search engine – albeit with a fair amount of your own Java programming to handle the queries and the returned data.

    I just wonder how this really differs from Google Custom Search ( which is free ), where you enter your own domains to search and then easily copy-and-paste the code into your own site. I suppose having your own crawler would give you more flexibility on handling the results and the relevancy assigned to those results.

  24. Hi People Finder,

    I came across 80legs recently too, from a post at alt search engines last Friday

    It looks like using 80legs would give you some more control over what you are actually crawling and showing in an index than if you were to use Google custom search. Google Custom Search allows you to specify the sites that you want to crawl, and is very easy to set up. 80legs lets you specify seed sites to start with, and looks like it can go beyond those to other sites that aren’t specified, from those seed sites. Yes, it does appears to involve a lot more work, from the FAQ, but as you mention, it also looks like if offers a lot more flexibility.

  25. This is good to know considering how i have been punished in search results for bad mistakes on my writings. I think this was a result of one of my sites getting a lot ranking after those errors that you speak of were never corrected. Great information thanks for sharing!

  26. Hello Bill Thank you for the article. I am writing a school paper about people’s searching behavior. I was looking for similar papers on the authors’s homepages: http://jeffhuang.com/ and http://faculty.washington.edu/efthimis/pubs/pubs.chrono.html

    But unfortunately I cannot access the papers on Dr. Huang’s page because there is no link to them. Bill do you have a copy of these papers? I found a paper on query abandonment on Dr. Efthimiadis’s pagee: http://faculty.washington.edu/efthimis/pubs/Pubs/ecir10-stamou-efthimiadis.user.inactivity.results.pdf

  27. Hi Deepti,

    Thank you. I’m not sure that the formatting of those copies of the paper is any different from one site to the other, but it can sometimes be nice to have an alternative location for a paper – sometimes they disappear.

    There are a number of new papers cited on Jeff Huang’s page, but I don’t have any copies of them yet. He does list his email address on the page, and you might want to try to contact him and introduce yourself, and let him know that you are possibly interested in using them for a school paper. I would imagine that there’s a good chance that he would be willing to share, and you might be able to ask questions about the papers as well.

  28. I love how this post breaks down all the refinement types. I think it all boils down to observing the way the a web surfer educates themselves or “gets smarter” about what they’re really looking for. In writing new content, it would be wise to include some of the possible refinements for the topic being written about. Additionally, looking at internal site search terms, entrances by keyword sources(in web analytics) and measuring them against conversion rate, you can get an even more accurate view of customer intent.

  29. Hi Rex,

    The paper did a great job of describing and illustrating those refinement types, and I personally learn a fair amount from thinking about how people might search to find information that I try to offer.

    We don’t always know the intent behind a search when someone refines a query, but knowing something about the many ways that they might makes it easier to make educated guesses, and plan ahead for the possibility.

Comments are closed.