What are Stopwords?
When someone searches at one of the major search engines, they often type in keyword phrases, composed as if they were written for human readers. Those phrases may contain words or phrases that show up very frequently in pages on the web and may have little to do with the information being sought by the searcher.
Search engines that focus upon retrieving search results based upon keywords found in queries have often ignored those frequently appearing and irrelevant words contained within search query terms.
Those words have been referred to in the past by Google as “stopwords,” and could be words like: a, and, is, on, of, or, the, was, with. Similar groups of words that appear very commonly on web pages, and are also unconnected to an actual search could be referred to as “stop-phrases.”
The word “a” in the query “a London hotel” is a stopword.
The phrase “show me” in the query “show me London hotels” is a stop-phrase.
Both “a” and “show me” don’t provide much meaning in a searcher’s intent to find information about hotels in London.
Meaningful Stopwords and Stop-Phrases
Sometimes words and phrases that might be considered stopwords or stop-phrases may be meaningful or important. For example, the word “the” in the phrase “the matrix” could be considered a stopword, but someone searching for the term may be looking for information about the movie “The Matrix” instead of trying to find information about mathematical information contained in a table of rows and columns (a matrix).
A search for “Show me the money” might be looking for a movie where the phrase was an important line, repeated a few times in the movie. Or a search for “show me the way” might be a request to find songs using that phrase as a title from Peter Frampton or the band Styx.
A Google stopwords patent granted this week explores how a search engine might look at queries that contain stopwords or stop-phrases, and determine whether or not the stopword or stop-phrase is meaningful enough to include in search results shown to a searcher.
Are Stopwords Important Anymore?
In January, I wrote a post titled New Google Approach to Indexing and Stopwords, which explored a new approach to indexing the content of Web pages and compressing and uncompressing parts of a search engine index that appears to allow for better indexing and retrieval of phrases in a search index.
In the past, Google would sometimes tell searchers in the space above a set of search results that their search queries contained “stop words,” and that the stopwords were ignored in the search that was just performed by the search engines. In some queries that did contain stopwords that were “meaningful,” Google may not have shown that notification. How did Google know whether the stopwords were meaningful or not?
Also in January, it appears that Google stopped showing notifications about queries containing stop words. Does the search engine still look for stop words and stop phrases, and attempt to determine whether they are meaningful or not?
Using Lists of Known Stopwords and Stop-Phrases and Exceptional Phrases
One way that a search engine could handle stopwords and stop-phrases is to use a list of known stopwords and stop-phrases, and strip those out from a search query before performing a search and presenting search results to a searcher.
That approach might ignore meaningful stopwords and stop-phrases. To avoid that problem, a search engine might then build a list of “exceptional” phrases when determining whether stopwords are included in a query. That list might include phrases like “the matrix” or “show me the money.” Identifying those exceptional phrases, and keeping a list of those phrases up to date might be difficult.
Alternative Approaches to Using Known Lists and Expections
Another approach might be to identify when a query contains stopwords and stop-phrases and then to perform searches on queries that contain stopwords with and without the stopwords, so that the results, or lists of categories associated with the search results, could be compared to see if they are substantially similar.
If the sets of data are substantially similar, the removal of the potential stopword or stopwords may not be material to the search. If the results or the categories aren’t substantially similar, the stopword may be considered the material to the search, and shouldn’t be removed from the query.
The patent is:
Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems
Invented by Simon Tong; Uri Lerner, Amit Singhal, Paul Haahr; and Steven Baker
Assigned to Google Inc.
US Patent 7,409,383
Granted August 5, 2008
Filed: March 31, 2004
A stopword detection component detects stopwords (also stop-phrases) in search queries input to keyword-based information retrieval systems. Potential stopwords are initially identified by comparing the terms in the search query to a list of known stopwords. Context data is then retrieved based on the search query and the identified stopwords.
In one implementation, the context data includes documents retrieved from a document index. In another implementation, the context data includes categories relevant to the search query. Sets of retrieved-context data are compared to one another to determine if they are substantially similar.
If the sets of context data are substantially similar, this fact may be used to infer that the removal of the potential stopword(s) is not material to the search. If the sets of context data are not substantially similar, the potential stopword can be considered the material to the search and should not be removed from the query.
Comparing the Similarity of Results or Categories from Multiple Sets of Queries
The patent explores this stopword process in more depth, including such things as how a list of stopwords might be identified manually, or in an automated fashion by looking at term frequencies on the web, with the most frequently appearing words or phrases likely to be stopwords or stop-phrases. It also brushes over how categories can be assigned to query terms. Term frequencies and categories can play a role in determining how similar the results are when looking at search query results with and without the stop words.
Whether the two sets of results or context data are “substantially similar” can be determined by looking at such things as:
1) Word frequencies of terms that show up in search results pages from queries with the stopwords and the same queries without the stopwords. If the frequencies are relatively equal, the sets of results could be considered substantially similar.
2) The percentage of documents that appears in the two different sets of result could also be used.
3) Sets of categories from the different search results could be compared, by calculating the portion of the categories that are in both sets.
4) Category relevance scores between both sets of queries could be compared.
When a search is done on the version of a query that doesn’t include stopwords, the stopwords might be replaced by placeholders, indicating the presence of a world without regard to the actual word being replaced.
Take the search query “show me the way lyrics.” The search engine might identify “show me” and “the” as stopwords. To compare search results for the term both with and without the stopwords, the search engine might use “way lyrics” or it might use placeholders, such as “* * * way lyrics,” where “*” represents the placeholder words.
Multiple queries might be used and compared, with place holders for some of the identified stopwords, as well as including some stopwords or stop-phrases and not others.
Original query: “show me the way lyrics”
show me * way lyrics
* * the way lyrics
* * * way lyrics
It’s interesting to see how Google may have attempted to understand whether stopwords were meaningful or not when they appeared in search queries, by comparing results sets with and without the stop words (and by using placeholders in some of the comparisons).
Searches on Google with queries that contain stopwords do seem to provide results that focus upon returning pages that have phrases that contain the stopwords within the query – much of the time. Sometimes, results that show other words where the stopwords originally appeared instead. For example, a search on “a room for a view” (without the quotation marks) shows results for the phrase “a room with a view.”
Is Google still following the comparison process above, with placeholders for stopwords, or is it doing something else, such as providing a result by expanding a query based upon user data such as looking at query revisions during individual’s search sessions? Or something else completely?
34 thoughts on “Google Stopwords and Stop-Phrases”
Bill great article tons of valuable information! I never knew that recently Google stopped showing notifications about queries containing stop words. And my answer to your question above is I think they are still following the comparison process from what I see, time will tell!
Being someone new to the use of personal computers, I have experienced the problem of trying to find products, nightclubs, movies, etc.
It can be frustrating and in the end , there are times I have just given up!
and (as you stated) there are times when the stop word is a necessary part of the search.
at least I know now what part of the problem could be. Thank you!
Interesting! Now that you mention it, I do notice that the prompt is no longer showed, I am continually amazed at how much control search engines have over what people see and ultimately purchase, if due to stopwords they are unable to find a result it brings a strong consequence.
Always exciting to hear the constant evolution of Search Engines, thanks for sharing!
How does Google’s ignoring STOP words in searches reconcile with Google giving ‘the’ and ‘in’ authority site status?
Really interesting patent.
I’ve got a better understanding how gg manage to rewriting title tag using those stop(-phrases|words)
Thanks for sharing,
Thank you. It’s funny when something like the stopwords notifications disappear, because it can be a while before you notice their absence. In the place of stopwords indicators, we may have better phrase-based matching from Google. Honestly, I’d like to see the results of testing that Google may have performed to compare search results from the days where they were telling us that they were ignoring stopwords in results to today’s results.
I remember back when Google would tell me that it had removed stopwords in a search for the “to be or not to be” phrase from Hamlet (without the quotes), and then performing the search a few months later and getting much more relevant results, as if the phrase had been added to an “exceptional phrases” list.
It’s funny that now, a top ten result for that phrase I’m seeing is a page titled “2-Bee or Not-too-bee,” where the alternative “to be or not to be” doesn’t appear on the page. I’m not sure if that is the result of a slightly different comparison process, or an attempt to diversify results, and include a commercial result amongst the references to shakespeare. 🙂
Thanks. I know what you’re talking about.
Trying to search when you don’t know the right words to use in your query, because you don’t know the correct name for something, or because you don’t know enough about the topic you’re trying to learn about is one of the very real challenges that search engines face.
I’ll often turn to a directory when that happens, so that I can find out more about the topic. I find that can help. So, if you’re looking for a movie name, or a song title, or a place name, or a product, finding a directory geared towards that topic might be helpful.
I’m not quite sure that I understand your question. I’ll try to give you an answer, but if I’m off the mark on what you’re asking about, please feel free to rephrase your question, and I’ll give it another try.
The phrase “authority status” could potentially have a number of different meanings. One form of “authority status” could potentially be sites identified by Google relevance raters as “vital” pages when it comes to a particular query (Nice rundown of the handbook at Pandia in their post The Secret Google Quality Ratersâ€™ Handbook).
That really doesn’t have much to do with a stop word similarity analysis.
Google may have used this stopword process in the past, but it’s questionable about whether or not they are now. Look at my post here:
New Google Approach to Indexing and Stopwords.
Good question. If Google was using this kind of “substantially similar” analysis, we don’t know that they’ve stopped. It may still be helpful to see if stopwords in queries are meaningful, or to perform that kind of analysis with placeholders.
The patent provides us with some insights that we didn’t possess before about some processes that the search engine may have been using. Does that information give us some possible insights into other processes? It might.
The discussion about using categories of query terms was interesting, too. A recent Yahoo patent application discussed that topic, which I wrote about in How Using Categories for Queries Can Help Searchers, Writers, and Search Engines. How much attention are you paying to how a search engine might categorize a query term?
One actionable item is to look at the tool that you might be doing for keyword research, and see how it treats stopwords.
Search engines do seem to have more and more control.
They are increasingly becoming alternatives to the navigation that sites contain. Not only is there a decent chance that a search engine might deliver someone to a page other than the home page of your site, but with things like the site links that appear under some first results for queries, Google is attenpting to provide navigational shortcuts directly to pages within a site, appearantly when it believes that showing site links will help someone arrive quicker at a final destination page, that will help them with queries that the search engine might believe are navigational in nature.
In some search results, Google is even showing a search box under a first result, a search within a site, that people can use to “likelihood of finding the exact page they are looking for.” Or that Google thinks they might be looking for.
Better phrased-based indexing should mean that a search engine shouldn’t have to rely as much on comparing query results with and without stopwords in them, to see if the stopwords are meaningful. I do wonder if that kind of comparison is still being used. It is amazing to see how search engines are evolving.
Thanks for the analysis and update Bill. Stop Words are pretty interesting, imho. My big question is: how is this actionable? Or is this patent more on the personal enrichment side?
Nice article, but not sure that reference to “Styx” was necessary /snicker.
I tried a few queries and could not find any really good examples of stopwords. Tough debate how that patent would actually work sitting here trying to figure it out in my head.
Thanks. I’m not sure the Styx reference was necessary, either. 🙂
I’ve been doing a number of searches with stop words in them since January, to see how Google might be handling searches with stop words. It does seem like the kind of phrase matching process I pointed to in the post I linked to above (New Google Approach to Indexing and Stopwords) has replaced this stopword process, but it’s interesting to think about how the two methods may have differed.
“One actionable item is to look at the tool that you might be doing for keyword research, and see how it treats stopwords. ”
Ah – thanks, good idea. Although based on how they seem to approach it, wouldn’t you say that perhaps a niche-oriented tool would be more useful?
Very good article Bill. Frankly speaking I have really not noticed about these stop-words or stop-phrases till I read your article and realized the story behind it. but one thing from my personal experience says that a search without stop words or phrases gives more accurate results then search with them.
Thank you for your kind words. I do like the idea behind this patent from Google, attempting to try to understand if the stop words or stop phrases in a query are meaningful and important, and should be included in a search.
But I think that it’s better for the search engine and searcher to return results based upon the searchers query, stopwords and all, and if the person searching decides that the stopwords in their query aren’t important, they can then revise their query to not include those words.
Perhaps google tries to get as much information as possible about an user and tries to determine the user’s intentions. stop words provide a meaning a purpose for a search phrase. without them its not possible to get the true meaning in some cases. there are some language usage issues also. maybe google may be categorizing users and queries. this is a guess.
I think those are good guesses.
Ignoring stopwords was a shortcut for a search engine because they are words that appear so frequently that searching for them would mean that a search engine would have to use a lot of resources in their search…
This algorithm for finding whether stopwords were “meaningful” was because there are times when the meaning of some queries is lost when stopwords within those query terms are ignored.
The newer algorithm that may have replaced this one uses a different approach – looking for the least frequently appearing word or words in a search query, and then checking to see if the more commonly ocurring word (the words formerly known as stopwords) are nearby. And when those phrases are found, they may be ranked based upon some influence of rankings based upon user behavior, and upon categories for users, for queries, and for web pages returned.
Try searching google for – Web design and development solutions – you will come up with Contactbridge.com
Search Google for web design development solutions you will come up with a different #1 and Contactbridge.com will come up at about the 3rd page.
Apparently “and” is a stop word.
so then why the difference in the number one ranked site between the two phrases?
contactbridge.com has been heavily optimized in general and specifically for these words. but the point being the same if and is a stop word and theoretically ignored it should have the same rank with either phrase.
This patent was aimed at determining whether a stopword was actually meaningful enough so that it shouldn’t be ignored as a stopword. It appears that Google is not ignoring stopwords like they once did, however.
The results of your search, with and without the “and” should probably produce differences in ranking because the search engine doesn’t appear to be ignorning words like “and” anymore.
I actually was wondering the same. Using i.e. “and” in a search produces slightly different results than when I would not use “and”. When I was briefly looking through the meta/keywords these different sites were giving I noticed that it obviously does matter whether you use stop words in your keywords or not.
As most people out there will use stop words in their search my conclusion is that I should revisit my already defined ‘old’ keywords and stuff them again with stop words. But this will make it easier to use the keywords as organic keywords in the content as the stop words embed the key words / key phrases better in text. But then again it is work to do without getting paid for… 🙂
Thanks. You raise a really great point.
I’m not sure that it’s a bad idea to return to the keywords that you’ve optimized pages for on a regular basis anyway, and see if they are still what you should be using for the pages that you may have optimized. And, if you’re being paid for SEO services, perhaps that is something that should be included in your services.
It really might be worth considering including ongoing keyword research and review in your SEO practices, and watching for trends and changes in language and search behavior.
If you have site search on a site, are there terms that people are always looking for?
What kinds of changes are your competitors making to their sites in terms of keywords?
In places on the web where people talk about the kinds of things offered on your site, are there new words and phrases in use?
Are the terms that you’ve targeted drawing the kind of attention that you’ve hoped for, or expected?
How much have the search results for those phrases changed since optimizing those pages?
From the phrase “stop words”, does that mean the spiders don’t scan anything after the stop word or just that it skips over it and keeps on scanning?
The idea behind stopwords was that they appear so frequently on pages on the web that it would be pretty time consuming to try to index them everywhere they show up, and not very helpful to finding a resource on the web. One problem though is identifying when words that are often considered stopwords might actually be meaningful terms, such as in the phrases “to be or not to be,” or “The Matrix.”
Bill, what are you thoughts on questions and how Google might use stop words or not use them to bring back relevant results. “When was the Olympics” and “Where were the Olympics” are two different questions, yet removing stop words might give them similar results.
Hi Answer Blip,
I mentioned in my post a new approach from Google, where they seem to be finding phrases regardless of whether or not a query contains stop words. I think to some degree, that may be true with some question and answer type queries. Some question and answer type queries may also trigger Googles Q&A feature, such as “Where was Derek Jeter Born?”
It’s possible that when you ask a question, that Google might try to match the phrase in your question with the results that it shows. It might also try to expand the query that you entered, and results for other similar questions might show up in search results. So a question asking â€œWhen was the Olympicsâ€ might also include results that show â€œWhere were the Olympics.â€ Not because the search engine is ignoring stopwords as much as because it might see the questions as very similar.
Google has taken patents of almost every other thing I search, I am just confused about the nature of patents. Suppose Google has this patent of stop words, so if I want to use the same list in my site, do I have to take permission for it from Google? I mean what is the need of registering such patents? Ranking patents are somewhat understandable and make some sense but such generalize patents are confusing for me, How can they benefit Google?
The idea of stop words themselves has been around for years. What this patent covers isn’t so much the idea behind stopwords, but rather that sometimes stopwords are meaningful and should be indexed. The following words are stopwords: to, be, or, not. But when you put them together into the phrase, “to be or not to be,” they are words that should be indexed. The phrase “the matrix” could be about a particular mathematical structure, especially if you don’t index the word “the” because it’s often treated as a stop word. But the phrase is the title to a fairly popular movie, and it if is treated as a stopword instead of an actually meaningful word in that instance, then Google wouldn’t be able to show search results about the movie very easily when someone searches for [the matrix].
The patent doesn’t cover the “idea” of stopwords, and you’re free to use them if you want to if you decide to set up a search engine. What it does cover is the process of determining whether a word that might be considered a stopword is instead meaningful in some way, so that it shouldn’t be ignored when Google includes it in its index.
Comments are closed.