We can make your web site easier to find, and easier to use.

New Google Approach to Indexing and Stopwords

screen shot of Google results on a search for a room with a view.

Not too long ago, if you entered in Google the phrase (without quotation marks) “a room with a view,” you might have received some warnings that your query contained “Stop Words.”

Stop words are words that appear so frequently in documents and on web pages that search engines would often ignore them when indexing the words on pages. These could be words like: a, and, is, on, of, or, the, was, with.

Good bye to stop words?

In that search for “a room with a view,” you might have received results like “a room for a view,” or “room to view,” or other phrases that replaced some stop words with others. That made it less likely to find exactly what you were looking for when you searched for a phrase with stop words in it.

I’m not seeing Google ignoring stop words any more. Last week, Dan Thies asked Stop Words Are Dead! Did I Miss Another Memo?

This newly granted Google Patent seems to hold some answers to the disappearance of stop words, and to potentially a number of other indexing issues from Google:

Document compression scheme that supports searching and partial decompression
Invented by Olcan Sercinoglu
Assigned to Google
US Patent 7,319,994
Granted January 15, 2008
Filed May 23, 2003

The abstract isn’t easy reading, but it’s the summary that the inventor gave to the patent, so it’s worth looking at:

One embodiment of the present invention provides a system that facilitates accessing a compressed representation of a set of documents, wherein the compressed representation supports searching and partial decompression.

During operation, the system receives a search request containing terms to be searched for in the set of documents. In response to the search request, the system identifies occurrences of the terms in the set of documents by following pointers through the compressed representation.

This compressed representation encodes occurrences of a term as a pointer to the next occurrence of the term to facilitate rapid enumeration of the occurrences of the term. Moreover, the compressed representation maintains sequential ordering between adjacent terms in the set of documents, which allows fast access to neighboring terms.

There are lots of implications behind this beyond stopwords disappearing. The patent does directly address indexing using stop words:

Typically, given a query, the performance bottleneck is the time it takes to decode the occurrences (which are typically delta encoded to save space, and thus have to be followed from the beginning) of the most frequently occurring term, especially if this term is a so-called stop-word such as “the”.

The system would look for the less popular terms that appear in the query, and then look to see if the stop words in the query are nearby.

We are also told that searches for phrases under this system would become much quicker:

Note that in particular, phrase matches would become much faster since we would only need to decode a limited number of terms that are immediately after or before the least-popular term. This operation would have the time complexity O(K*L*N) where K is the term identifier encoding frequency (discussed earlier), L is the length of the phrase, and N is the number of occurrences of the least-frequent term in the phrase.

Related patent filings

I’ve written before about some related patent documents that explore some other process that work with aspects of the compression method described in this patent.

Google looks at multi-stage query processing, which describes a way that searches could be processed in a number of stages, under the patent application: Multi-stage query processing system and method for use with tokenspace repository

Google on Multi-Tiered Indexing and Multi-Staged Query Processing explores Google’s patent System and method for encoding and decoding variable-length data

A reason for the loss of supplemental results, too?

Back in December, a post at the Official Google Blog told about The Ultimate Fate of Supplemental Results. In that, we were told from Google that “rather than searching some part of our index in more depth for obscure queries, we’re now searching the whole index for every query.”

Use of the indexing processes in these three patent filings might explain some changes to the results that we see in Google, if they are being used. Might they also account for the disappearance of supplemental results? What do you thinK?

LinkedInPinterestStumbleUponShare

36 comments to New Google Approach to Indexing and Stopwords

  • Hey Bill,
    I think you’re on to something here! Great post and deconstruction of this information!

  • 5ubliminal

    Supplementals are still there. No loss in their numbers.
    Actually their count grew in the past few days for me.

    PS: To find them sure site:domain.com/* to find non-supplementals.
    Some disagree with this method but I keep a close eye on them and only pages that show up for this search (/*) do get any kind of traffic.

  • Thanks Bill!

    Interesting take on the whole stop words issue, thanks for sharing!

  • What a great movie, too. Sweet, sweet Helena…

    Excellent find, Bill. Very revealing.

  • Great insight Bill, thanks for sharing.

  • spostareduro

    Bill I’d like to see how the future of the supplemental indexing may change as a result of this..even if it begins to take shape with time..I can see that this may be very valid.

    Thanks!

  • Hi Kim,

    What they describe in this patent is a way of compressing the data in their index in a new way, and only decompressing parts that they want to look at when they need to. Sounds complicated, but what it might mean is that the size of the index is smaller while containing more information, organized in a different way – and there may not be a need to split up the index into a main section and an extended, or supplemental database.

    The Official Google Blog post that I linked to above, The Ultimate Fate of Supplemental Results states that they are searching their whole index for every query. If there is still an additional database, which there possibly might be, it’s being searched to.

    So one impact would be deeper searches. But there are others, like the ability to index phrases better, and to include results that pay attention to what used to be stop words.

    Hi Christopher.

    Thanks. I used to laugh in my undergraduate English classes when we would talk about “deconstructing” literature. Now I find myself doing that same thing to patents.

    Hi Subliminal,

    There definitely is something to what you’re describing with that search. I’ve done some testing on stop words, and that does seem to have changed from the old method. If the “multi-staged query processing” described in one of the other related patent filings has kicked in, that could provide some interesting results, too.

    If Google claims to have gotten rid of supplementals, like they did in their Blog post, then just what do we call those results that aren’t showing up in the modified search that you suggest? And why would the numbers be getting larger? It’s worth exploring.

    Hi Wil,

    It’s good to see you. Hope that you are doing well.

    Hi Dan,

    Definitely some interesting stuff here. I’m exploring some of the results, keeping the stop words, and the multi-stage query processing stuff in mind.

    I think that one is worth exploring in more detail now, too.

    Hi Zak,

    Thanks for stopping by and commenting.

  • 5ubliminal

    Indeed. But I keep an eye out for traffic on those pages.
    Any page that shows for /* query brings traffic, the others are ‘dead’. Once a ‘dead’ page pops out, it starts getting traffic and when it ‘dies’ again traffic vanishes. And numbers fluctuate chaotically with no obvious pattern.

    Not explicitly advertising pages as supplementals does not mean they are gone. Look at most scraper sites. They have less than 1% in the main index. It’s cool to supplemental those sites but so many whitehat sites suffer from this too … .

    The actual truth is that it seems Google goes any length to take down new (not yet popular) websites and move them to hosted solutions (like blogger … and many other such services) on strong domains in order to rank and I for one will never agree with that.

    Regards.

  • Another great find Bill. Particularly liked Google’s return for SEO BY THE SEA

  • I agree on Google don’t show anymore the warnings on stop words, but is still ignore some if it’s not use in a popular key-phrase.

    If you make a query on “the best things on the net” the first part “The best things” the word the is not consider as a stop word but the words “on the” are still considered as stop words. Just make that query and look at the snippet on result number 8 here :

    This is the best thing that’s ever happened to me!” says Brian Clark on Page Rank Decreases. … Here’s to longevity on thenet.

    You don’t see in bold the “on the” have been shown in bold here and it’s considered as stop words.

  • [...] Other changes Google makes regarding its indexing algorithm and stopwords. __________________ Design | Community | Blog | Directory by [...]

  • Hi Eric,

    The multistaged query processing approach does do some query expansion, and may include some other variations which would include other phrases that may appear similar. That might make it appear like they are treating those words as stop words.

    But, a search for “the best things on the net” with the quotation marks included only shows 79 results. So, I wonder whether, if there just aren’t that many results for a phrase (searched for without quotation marks) Google might consider other wordings as more relevant, especially if all of the words searched for appear on those pages.

  • Thanks Charlie,

    I like Google’s results for SEO by the Sea, too.

  • Hi 5ubliminal,

    Yes, there are aspects of what Google is doing that are difficult to explain, and some of the potential reasons could appear questionable.

    But I’m not sure that we just know enough, to make claims about their intent. They do bear watching, though.

  • [...] For starters, Google doesn’t always ignore stopwords. The Fly and Fly produce different search results. Beyond that, “or” is sometimes assumed to be a word you’re searching on, not an operator — for an example, try live free or die and see the line of text that comes back under the search box. (I’m not sure whether this ever works for “and” as well — even Sanford and Son returns the usual harangue that “the AND operator is unnecessary”.) This is all a pretty clear indicator that Google is looking at phrases. Bill Slawski’s patent-analysis-heavy SEO blog has a lot more to say on that subject, specifically on an indexing scheme that addresses the problems that indexing stopwords in might otherwise cause. [...]

  • This has answered a lot of pending questions. Thanks.

  • [...] Novo por aqui? Descarregue o ebook e subscreva o feed do blog. Obrigado pela visita! Stop words são artigos, proposições, pronomes, palavras curtas e comuns que os motores de busca ignoram nas pesquisas dos utilizadores, ex: o, a, de… Recentemente, vários blogs notaram que o Google deixou de incluir avisos do género “a” é uma palavra comum e não foi incluída na sua busca. Isso pode significar que o Google esteja já a incluir as relações de proximidade das palavra nos resultados que apresenta aos utilizadores. Existe até uma patente do Google que discute a presença destas palavras nos resultados: The system would look for the less popular terms that appear in the query, and then look to see if the stop words in the query are nearby. [...]

  • [...] Vor ein paar Tagen hatten bereits SEO in den USA bemerkt, dass Google bei einer Suchanfrage nicht mehr darauf hinweist, wenn die Suche Stoppworte wie “in” oder “und” enthlt. Allerdings waren sich die Quellen noch nicht sicher, ob nur der Hinweis fehlt, oder ob auch das Suchergebnis dadurch beeinflusst wird. [...]

  • [...] New Google Approach to Indexing and Stopwords [...]

  • Thanks for the article.. i would have never thought i would be sitting here pondering about stop words for the last 30 minutes.

  • You’re welcome.

    It’s interesting to see the way that search engines may be transforming how they work, and not stripping out stop words from searches may mean that phrase searching will work better in the future. :)

    I think that’s a big step forward…

  • [...] In fact, the latest advancement of search has Google recently putting stop words (I, and, in, the, etc) back into Search. Barry discusses it here with links to Bill and Dan. [...]

  • [...] As reported in detail by Dan Thies and Bill Slawski, formerly labeled stop words aren’t causing enormous stirs in the SEO world.  But for copywriters, there is reason to take note. [...]

  • [...] De hecho, el último avance de búsqueda la tiene Google recientemente al estas palabras (Yo, y, en, el, etc) otra vez en las búsqueda. Barry explica esto aqui con enlaces a Bill y Dan. [...]

  • [...] En fait, la dernière promotion de recherche sur Google inclue désormais les articles, apostrophes et mots tels qu’etc., avant, dans les résultats de recherche. Barry discute à propos de ce nouvel ajout ici avec des liens vers Bill et Dan. [...]

  • [...] In January, I wrote a post titled New Google Approach to Indexing and Stopwords, which explored a new approach to indexing the content of Web pages, and compressing and uncompressing parts of an search engine index that appears to allow for better indexing and retrieval of phrases in a search index. [...]

  • I noticed the Stop words change but wasn’t sure about it, it seem that a couple of improvements to search are going on in Google inc.

  • Hi Elite,

    I think Google is continuously making changes to how its search algorithms work. Some may have a large impact on how search works, while others may only impact a small number of searches, but it is happening. One of Google’s engineers stated in an interview this summer that Google had made more than 450 changes to its ranking algorithms last year. Chances are good that they’ve continued to make changes this year as well.

  • [...] Stop Words for search engines Are Stop Words Dead? New Google Approach to Indexing and Stopwords Stop Words Are Dead! Did I Miss Another Memo? __________________ The Penn State Ticket Man [...]

  • Clinton Barett

    Sorry for bumping this old thread, but I think stop words is an interesting topic, and I’m currently doing some research in this area, trying to understand the algorithm’s logic in dealing with them, this patent did shine some light.

  • Hi Clinton,

    This was an interesting patent. I’m not sure if you’ve looked into some of Google’s patents on phrase-based indexing as well, but you might find those interesting. Words that might be considered to be stopwords may also be considered to be parts of “good phrases” that would be indexed. For example, if you search at Google for “the matrix,” you stand a good chance of seeing many more references to the movie “the matrix” than you to do mathematical matrices.

  • Is there any official reference/article on Google regarding a comprehensive list of Stopwords. It makes sense to me to avoid using stopwords in title tag of article but not using it in description of text is almost impossible.

    It is now 2010, I searched “a room with a view,” (without quotes) but I didn’t got any warning on Stopwords.

    Have Google updated their policy on Stopwords? If any do reply.

    Your guidance will help me learn new things.

    Rds, Gaurav.

  • Hi Gaurav,

    Google hasn’t shown warnings about stopwords being included in a query in a few years now.

    I don’t believe that there is any official reference anywhere on which terms Google might consider stop words, but that’s fine – if you read through this post you might get the idea that Google does now index words that might be considered stopwords, and that it has a way of analyzing those to see if they are meaningful.

    To use an example that I’ve included in a few comments above, the word “the” might be considered a stop word, but when you include it in a query such as “the matrix,” it becomes much more meaningful as part of the name of a specific movie. This approach in determining whether or not certain very frequently appearing words might be meaningful or not is a much better approach than just excluding “stop” words from a search.

  • I learned about the stop words issues several years ago and can understand the indexing issues. One solution I use is using the “&” in the place of “and”. I’ve gotten so used to writing page titles without stop words that it’s like second nature! :-0). Thanks for the awesome post!

  • Hi Alexander,

    The problem isn’t so much that Google is or isn’t indexing stop words, but rather that sometimes stop words are meaningful. For instance, when the word “the” is used in the phrase “the matrix,” chances are that the movie of that name is being referred to, rather than some random mathematical matrix. Years ago, when you performed a search for the phrase “to be or not to be,”(without quotation marks), you wouldn’t get very good search results. Now, you get some great Hamlet references for that search.

  • Well thanks you so much for the tutorial, it is helpful to better understand SEO.

Comments Policies

  • Relevant comments on the topic of a post are very much appreciated.
  • Please use your personal name rather your business name or keywords in the name field.
  • Comments filling the name field with anchor text to spam this site and search engines (in English or any other language) may be edited, have URLs removed, or deleted entirely.
  • If you include a link in the website field, please choose one about you rather than some product or service or site or blogpost that you are promoting.
  • No signature links in comments, please.