When someone searches at one of the major search engines, they often type in keyword phrases, composed as if they were written for human readers. Those phrases may contain words or phrases that show up very frequently in pages on the web, and have little to do with the information being sought by the searcher.
Search engines that focus upon retrieving search results based upon keywords found in queries have often ignored those frequently appearing and irrelevant words contained within search query terms.
Those words have been referred to in the past by Google as “stopwords,” and could be words like: a, and, is, on, of, or, the, was, with. Similar groups of words that appear very common on web pages, and are also unconnected to an actual search could be referred to as “stop-phrases.”
The word “a” in the query “a London hotel” is a stopword.
The phrase “show me” in the query “show me London hotels” is a stop-phrase.
Both “a” and “show me” don’t provide much meaning in a searcher’s intent to find information about hotels in London.
Meaningful Stopwords and Stop-Phrases
Sometimes words and phrases that might be considered stopwords or stop-phrases may actually be meaningful or important. For example, the word “the” in the phrase “the matrix” could be considered a stopword, but someone searching for the term may be looking for information about the movie “The Matrix” instead of trying to find information about mathematical information contained in a table of rows and columns (a matrix).
A search for “Show me the money” might be looking for a movie where the phrase was an important line, repeated a few times in the movie. Or a search for “show me the way” might be a request to find songs using that phrase as a title from Peter Frampton or from the band Styx.
A Google patent granted this week explores how a search engine might look at queries that contain stopwords or stop-phrases, and determine whether or not the stopword or stop-phrase is meaningful enough to include in search results shown to a searcher.
Are Stopwords Important Anymore?
In January, I wrote a post titled New Google Approach to Indexing and Stopwords, which explored a new approach to indexing the content of Web pages, and compressing and uncompressing parts of an search engine index that appears to allow for better indexing and retrieval of phrases in a search index.
In the past, Google would sometimes tell searchers in the space above a set of search results that their search queries contained “stop words,” and that the stopwords were ignored in the search that was just performed by the search engines. In some queries that did contain stopwords that were “meaningful,” Google may not have show that notification. How did Google know whether the stopwords were meaningful or not?
Also in January, it appears that Google stopped showing notifications about queries containing stop words. Does the search engine still look for stop words and stop phrases, and attempt to determine whether they are meaningful or not?
Using Lists of Known Stopwords and Stop-Phrases and Exceptional Phrases
One way that a search engine could handle stopwords and stop-phrases is to use a list of known stopwords and stop-phrases, and strip those out from a search query before performing a search and presenting search results to a searcher.
That approach might ignore meaningful stopwords and stop-phrases. To avoid that problem, a search engine might then build a list of “exceptional” phrases when determining whether stopwords are included in a query. That list might include phrases like “the matrix” or “show me the money.” Identifying those exceptional phrases, and keeping a list of those phrases up to date might be difficult.
Alternative Approaches to Using Known Lists and Expections
Another approach might be to identify when a query contains stopwords and stop-phrases, and then to perform searches on queries that contain stopwords with and without the stopwords, so that the results, or lists of categories associated with the search results, could be compared to see if they are substantially similar.
If the sets of data are substantially similar, the removal of the potential stopword or stopwords may not be material to the search. If the results or the categories aren’t substantially similar, the stopword may be considered material to the search, and shouldn’t be removed from the query.
The patent is:
Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
Invented by Simon Tong; Uri Lerner, Amit Singhal, Paul Haahr; and Steven Baker
Assigned to Google Inc.
US Patent 7,409,383
Granted August 5, 2008
Filed: March 31, 2004
Abstract
A stopword detection component detects stopwords (also stop-phrases) in search queries input to keyword-based information retrieval systems. Potential stopwords are initially identified by comparing the terms in the search query to a list of known stopwords. Context data is then retrieved based on the search query and the identified stopwords.
In one implementation, the context data includes documents retrieved from a document index. In another implementation, the context data includes categories relevant to the search query. Sets of retrieved context data are compared to one another to determine if they are substantially similar.
If the sets of context data are substantially similar, this fact may be used to infer that the removal of the potential stopword(s) is not material to the search. If the sets of context data are not substantially similar, the potential stopword can be considered material to the search and should not be removed from the query.
Comparing the Similarity of Results or Categories from Multiple Sets of Queries
The patent explores this stopword process in more depth, including such things as how a list of stopwords might be identified manually, or in an automated fashion by looking at term frequencies on the web, with the most frequently appearing words or phrases likely to be stopwords or stop-phrases. It also brushes over how categories can be assigned to query terms. Term frequencies and categories can play a role in determining how similar the results are when looking a search query results with and without the stop words.
Whether the two sets of results, or context data, are “substantially similar” can be determined by looking at such things as:
1) Word frequencies of terms that show up in search results pages from queries with the stopwords and the same queries withoug the stopwords. If the frequencies are relatively equal, the sets of results could be considered substantially similar.
2) The percentage of documents that appears in the two different sets of result could also be used.
3) Sets of categories from the different search results could be compared, by calculating the portion of the categories that are in both sets.
4) Category relevance scores between both sets of queries could be compared.
Placeholders
When a search is done on the version of a query that doesn’t include stopwords, the stopwords might be replaced by placeholders, indicating the presence of a world without regard to the actual word being replaced.
Take the search query “show me the way lyrics.” The search engine might identify “show me” and “the” as stopwords. To compare search results for the term both with, and without the stopwords, the search engine might use “way lyrics” or it might use placeholders, such as “* * * way lyrics,” where “*” represents the placeholder words.
Multiple queries might actually be used and compared, with place holders for some of the identified stopwords, as well as including some stopwords or stop-phrases and not others.
Example
Original query: “show me the way lyrics”
Alternative queries:
way lyrics
show me * way lyrics
* * the way lyrics
* * * way lyrics
Conclusion
It’s interesting to see how Google may have attempted to understand whether stopwords were meaningful or not when they appeared in search queries, by comparing results sets with and without the stop words (and by using placeholders in some of the comparisons).
Searches on Google with queries that contain stopwords do seem provide results that focus upon returning pages that have phrases which contain the stopwords within the query – much of the time. Sometimes, results that show other words where the stopwords originally were appear instead. For example, a search on “a room for a view” (without the quotation marks) shows results for the phrase “a room with a view.”
Is Google still following the comparison process above, with placeholders for stopwords, or is it doing something else, such as providing a result by expanding a query based upon user data such as looking at query revisions during individual’s search sessions? Or something else completely?






Interesting! Now that you mention it, I do notice that the prompt is no longer showed, I am continually amazed at how much control search engines have over what people see and ultimately purchase, if due to stopwords they are unable to find a result it brings a strong consequence.
Always exciting to hear the constant evolution of Search Engines, thanks for sharing!
Bill great article tons of valuable information! I never knew that recently Google stopped showing notifications about queries containing stop words. And my answer to your question above is I think they are still following the comparison process from what I see, time will tell!
Being someone new to the use of personal computers, I have experienced the problem of trying to find products, nightclubs, movies, etc.
It can be frustrating and in the end , there are times I have just given up!
and (as you stated) there are times when the stop word is a necessary part of the search.
at least I know now what part of the problem could be. Thank you!
How does Google’s ignoring STOP words in searches reconcile with Google giving ‘the’ and ‘in’ authority site status?
Really interesting patent.
I’ve got a better understanding how gg manage to rewriting title tag using those stop(-phrases|words)
on serps.
Thanks for sharing,
Thomas.
Hi Garrett,
Thank you. It’s funny when something like the stopwords notifications disappear, because it can be a while before you notice their absence. In the place of stopwords indicators, we may have better phrase-based matching from Google. Honestly, I’d like to see the results of testing that Google may have performed to compare search results from the days where they were telling us that they were ignoring stopwords in results to today’s results.
I remember back when Google would tell me that it had removed stopwords in a search for the “to be or not to be” phrase from Hamlet (without the quotes), and then performing the search a few months later and getting much more relevant results, as if the phrase had been added to an “exceptional phrases” list.
It’s funny that now, a top ten result for that phrase I’m seeing is a page titled “2-Bee or Not-too-bee,” where the alternative “to be or not to be” doesn’t appear on the page. I’m not sure if that is the result of a slightly different comparison process, or an attempt to diversify results, and include a commercial result amongst the references to shakespeare.
Hi Rob,
Search engines do seem to have more and more control.
They are increasingly becoming alternatives to the navigation that sites contain. Not only is there a decent chance that a search engine might deliver someone to a page other than the home page of your site, but with things like the site links that appear under some first results for queries, Google is attenpting to provide navigational shortcuts directly to pages within a site, appearantly when it believes that showing site links will help someone arrive quicker at a final destination page, that will help them with queries that the search engine might believe are navigational in nature.
In some search results, Google is even showing a search box under a first result, a search within a site, that people can use to “likelihood of finding the exact page they are looking for.” Or that Google thinks they might be looking for.
Better phrased-based indexing should mean that a search engine shouldn’t have to rely as much on comparing query results with and without stopwords in them, to see if the stopwords are meaningful. I do wonder if that kind of comparison is still being used. It is amazing to see how search engines are evolving.
Hi Dave,
Thanks. I know what you’re talking about.
Trying to search when you don’t know the right words to use in your query, because you don’t know the correct name for something, or because you don’t know enough about the topic you’re trying to learn about is one of the very real challenges that search engines face.
I’ll often turn to a directory when that happens, so that I can find out more about the topic. I find that can help. So, if you’re looking for a movie name, or a song title, or a place name, or a product, finding a directory geared towards that topic might be helpful.
Thanks for the analysis and update Bill. Stop Words are pretty interesting, imho. My big question is: how is this actionable? Or is this patent more on the personal enrichment side?
Hi John,
I’m not quite sure that I understand your question. I’ll try to give you an answer, but if I’m off the mark on what you’re asking about, please feel free to rephrase your question, and I’ll give it another try.
The phrase “authority status” could potentially have a number of different meanings. One form of “authority status” could potentially be sites identified by Google relevance raters as “vital” pages when it comes to a particular query (Nice rundown of the handbook at Pandia in their post The Secret Google Quality Raters’ Handbook).
That really doesn’t have much to do with a stop word similarity analysis.
Hi Thomas,
Google may have used this stopword process in the past, but it’s questionable about whether or not they are now. Look at my post here:
New Google Approach to Indexing and Stopwords.
Hi Gab,
Good question. If Google was using this kind of “substantially similar” analysis, we don’t know that they’ve stopped. It may still be helpful to see if stopwords in queries are meaningful, or to perform that kind of analysis with placeholders.
The patent provides us with some insights that we didn’t possess before about some processes that the search engine may have been using. Does that information give us some possible insights into other processes? It might.
The discussion about using categories of query terms was interesting, too. A recent Yahoo patent application discussed that topic, which I wrote about in How Using Categories for Queries Can Help Searchers, Writers, and Search Engines. How much attention are you paying to how a search engine might categorize a query term?
One actionable item is to look at the tool that you might be doing for keyword research, and see how it treats stopwords.
Nice article, but not sure that reference to “Styx” was necessary /snicker.
I tried a few queries and could not find any really good examples of stopwords. Tough debate how that patent would actually work sitting here trying to figure it out in my head.
Hi Mack,
Thanks. I’m not sure the Styx reference was necessary, either.
I’ve been doing a number of searches with stop words in them since January, to see how Google might be handling searches with stop words. It does seem like the kind of phrase matching process I pointed to in the post I linked to above (New Google Approach to Indexing and Stopwords) has replaced this stopword process, but it’s interesting to think about how the two methods may have differed.
It’s a good summary of the patent, thank you, I’ve directed people to it from my blog, no need to reinvent the wheel
There’s quite a bit of work and people are divided on whether to use stopwords or not. A lot of people report less accurate results and of course there’s the huge index size that is an issue. This is an interesting paper though by someone who found that keeping stopwords in actually improved things:
http://dspace.mit.edu/bitstream/handle/1721.1/30604/nuggeteer.pdf?sequence=1
Although it wasn’t specifically used for IR as such. Gregory states:
“Removing stopwords significantly hurt precision among description-only runs because many of the descriptions were now so short that recall became more coarse-grained, and thus more difficult to threshold.”
I’d like to see loads more testing in different environments. It’d be really interesting.
Hi CJ,
You’re welcome.
Thanks for the kind words, and for the link. I agree that keeping the stopwords in the index is probably a move in the right direction, and that a phrase-based search which includes those will likely provide more precision.
Chances are that Google was following something very similar to the process described in this patent for a while. The analysis to see whether the stopwords in a query were meaningful, and should have been included in the results shown to searchers was an interesting attempt to try to avoid such a loss of precision.
I’d love to be able to compare results under the old approach which removed stopwords in queries when they weren’t considered meaningful, and the new approach that Google is now using.
“One actionable item is to look at the tool that you might be doing for keyword research, and see how it treats stopwords. ”
Ah – thanks, good idea. Although based on how they seem to approach it, wouldn’t you say that perhaps a niche-oriented tool would be more useful?
[...] Qualche giorno fa sbirciando fra i miei feed update degli utlimi giorni mi sono imbattuto in questo interessante articolo scritto da Bill Slawski che parla di un nuovo sistema brevettato da Google per migliorare l’esperienza di ricerca, lavorando sulle keyword inserite. [...]
Very good article Bill. Frankly speaking I have really not noticed about these stop-words or stop-phrases till I read your article and realized the story behind it. but one thing from my personal experience says that a search without stop words or phrases gives more accurate results then search with them.
Hi Eva,
Thank you for your kind words. I do like the idea behind this patent from Google, attempting to try to understand if the stop words or stop phrases in a query are meaningful and important, and should be included in a search.
But I think that it’s better for the search engine and searcher to return results based upon the searchers query, stopwords and all, and if the person searching decides that the stopwords in their query aren’t important, they can then revise their query to not include those words.
Hi William,
what you’re saying makes sense. However…how many people outside of the seo and computer science etc…industry do you know how are familiar with stopwords?
I can image a lot of people deciding that all the stopwords are important all the time. Or just ignoring the whole thing because they don’t really understand it, and quite frankly have no intrest in it either.
There are quite a few not so nice quotes and saying about users (which actually are quite funny if you’re into that kind of thing) but there is one that says:
“Unix is user-friendly. It’s just very selective about who its friends are.” – switch unix to search engine commands and so on.
There is also one that says something like “the user should do as little work as possible”.
I think the point of working towards an algorithm to detect meaningful stopwords is really the way to go. Not to mention all the computer scientists who would like to read the research and learn from it…
Hi CJ,
I completely agree that search should be as easy as possible for the people doing the search.
I’m not completely sure that Google needs to go through the analysis of whether or not a stopword is meaningful anymore, if it follows the processes described in Document compression scheme that supports searching and partial decompression.
That process does allow for the possibility of query refinements and expansion based upon things like user behavior metrics that might, for instance, determine that a search for something like “the matrix” is more likely a search for information about the movie than about mathematical matrices, without going through the substantially similar comparison and analysis described in the stopwords patent. And the process would not only be transparent to a searcher, but would be influenced by searcher behavior in choosing search results, and interacting with those results.
Hey,
yes, I’m familar with that method, although I haven’t tried it. I think it’s really important for Google to automate something like this as much as possible because such methods can be used elsewhere, for other systems that use nlp. It’s not just for user searching. It would help help any text classification system, or conversational system definitely.
But, I haven’t seen any results yet or anything like that so I’m not sure even whether “meaningful” stopwords will actually improve anything to be honest. And language is so fluid and ambiguous too.
I guess user refinement (or repair) is a way of getting the system to go through supervised learning anyway so it can’t be bad. There will be a lot of data to collect. Only something like Google could do that because of their huge user base.
You have a good point.
Hi CJ,
Thanks. We can’t be certain of the approach that Google might be following, but it does seem to make some sense for them to try to follow a method that can be used elsewhere, like you point out.
Google does collect a lot of data, but so does Yahoo.
Perhaps google tries to get as much information as possible about an user and tries to determine the user’s intentions. stop words provide a meaning a purpose for a search phrase. without them its not possible to get the true meaning in some cases. there are some language usage issues also. maybe google may be categorizing users and queries. this is a guess.
Hi Jaya,
I think those are good guesses.
Ignoring stopwords was a shortcut for a search engine because they are words that appear so frequently that searching for them would mean that a search engine would have to use a lot of resources in their search…
This algorithm for finding whether stopwords were “meaningful” was because there are times when the meaning of some queries is lost when stopwords within those query terms are ignored.
The newer algorithm that may have replaced this one uses a different approach – looking for the least frequently appearing word or words in a search query, and then checking to see if the more commonly ocurring word (the words formerly known as stopwords) are nearby. And when those phrases are found, they may be ranked based upon some influence of rankings based upon user behavior, and upon categories for users, for queries, and for web pages returned.
Try searching google for – Web design and development solutions – you will come up with Contactbridge.com
Search Google for web design development solutions you will come up with a different #1 and Contactbridge.com will come up at about the 3rd page.
Apparently “and” is a stop word.
so then why the difference in the number one ranked site between the two phrases?
contactbridge.com has been heavily optimized in general and specifically for these words. but the point being the same if and is a stop word and theoretically ignored it should have the same rank with either phrase.
Hi Ravi,
This patent was aimed at determining whether a stopword was actually meaningful enough so that it shouldn’t be ignored as a stopword. It appears that Google is not ignoring stopwords like they once did, however.
The results of your search, with and without the “and” should probably produce differences in ranking because the search engine doesn’t appear to be ignorning words like “and” anymore.
Hi William,
Great article!
I actually was wondering the same. Using i.e. “and” in a search produces slightly different results than when I would not use “and”. When I was briefly looking through the meta/keywords these different sites were giving I noticed that it obviously does matter whether you use stop words in your keywords or not.
As most people out there will use stop words in their search my conclusion is that I should revisit my already defined ‘old’ keywords and stuff them again with stop words. But this will make it easier to use the keywords as organic keywords in the content as the stop words embed the key words / key phrases better in text. But then again it is work to do without getting paid for…
Regards,
Christoph
Hi Christoph,
Thanks. You raise a really great point.
I’m not sure that it’s a bad idea to return to the keywords that you’ve optimized pages for on a regular basis anyway, and see if they are still what you should be using for the pages that you may have optimized. And, if you’re being paid for SEO services, perhaps that is something that should be included in your services.
It really might be worth considering including ongoing keyword research and review in your SEO practices, and watching for trends and changes in language and search behavior.
If you have site search on a site, are there terms that people are always looking for?
What kinds of changes are your competitors making to their sites in terms of keywords?
In places on the web where people talk about the kinds of things offered on your site, are there new words and phrases in use?
Are the terms that you’ve targeted drawing the kind of attention that you’ve hoped for, or expected?
How much have the search results for those phrases changed since optimizing those pages?
From the phrase “stop words”, does that mean the spiders don’t scan anything after the stop word or just that it skips over it and keeps on scanning?
Hi Rob,
The idea behind stopwords was that they appear so frequently on pages on the web that it would be pretty time consuming to try to index them everywhere they show up, and not very helpful to finding a resource on the web. One problem though is identifying when words that are often considered stopwords might actually be meaningful terms, such as in the phrases “to be or not to be,” or “The Matrix.”
Bill, what are you thoughts on questions and how Google might use stop words or not use them to bring back relevant results. “When was the Olympics” and “Where were the Olympics” are two different questions, yet removing stop words might give them similar results.
Hi Answer Blip,
I mentioned in my post a new approach from Google, where they seem to be finding phrases regardless of whether or not a query contains stop words. I think to some degree, that may be true with some question and answer type queries. Some question and answer type queries may also trigger Googles Q&A feature, such as “Where was Derek Jeter Born?”
It’s possible that when you ask a question, that Google might try to match the phrase in your question with the results that it shows. It might also try to expand the query that you entered, and results for other similar questions might show up in search results. So a question asking “When was the Olympics” might also include results that show “Where were the Olympics.” Not because the search engine is ignoring stopwords as much as because it might see the questions as very similar.
[...] um post “antigo” (agosto/08) no site SEObytheSEA que reportava uma patente do Google sobre como a sua search engine pode [...]
[...] un post “antiguo” (agosto/08) en el sitio SEObytheSEA que reportaba una patente de Google sobre como su search engine o el motor [...]
[...] Google Stopwords Patent – When someone searches at one of the major search engines, they often type in keyword phrases, composed as if they were written for human readers. Those phrases may contain words or phrases that show up very frequently in pages on the … [...]