New Google Approach to Indexing and Stopwords

screen shot of Google results on a search for a room with a view.

Not too long ago, if you entered in Google the phrase (without quotation marks) “a room with a view,” you might have received some warnings that your query contained “Stop Words.”

Stop words are words that appear so frequently in documents and on web pages that search engines would often ignore them when indexing the words on pages. These could be words like: a, and, is, on, of, or, the, was, with.

Good bye to stop words?

In that search for “a room with a view,” you might have received results like “a room for a view,” or “room to view,” or other phrases that replaced some stop words with others. That made it less likely to find exactly what you were looking for when you searched for a phrase with stop words in it.

I’m not seeing Google ignoring stop words any more. Last week, Dan Thies asked Stop Words Are Dead! Did I Miss Another Memo?

This newly granted Google Patent seems to hold some answers to the disappearance of stop words, and to potentially a number of other indexing issues from Google:

Document compression scheme that supports searching and partial decompression
Invented by Olcan Sercinoglu
Assigned to Google
US Patent 7,319,994
Granted January 15, 2008
Filed May 23, 2003

The abstract isn’t easy reading, but it’s the summary that the inventor gave to the patent, so it’s worth looking at:

One embodiment of the present invention provides a system that facilitates accessing a compressed representation of a set of documents, wherein the compressed representation supports searching and partial decompression.

During operation, the system receives a search request containing terms to be searched for in the set of documents. In response to the search request, the system identifies occurrences of the terms in the set of documents by following pointers through the compressed representation.

This compressed representation encodes occurrences of a term as a pointer to the next occurrence of the term to facilitate rapid enumeration of the occurrences of the term. Moreover, the compressed representation maintains sequential ordering between adjacent terms in the set of documents, which allows fast access to neighboring terms.

There are lots of implications behind this beyond stopwords disappearing. The patent does directly address indexing using stop words:

Typically, given a query, the performance bottleneck is the time it takes to decode the occurrences (which are typically delta encoded to save space, and thus have to be followed from the beginning) of the most frequently occurring term, especially if this term is a so-called stop-word such as “the”.

The system would look for the less popular terms that appear in the query, and then look to see if the stop words in the query are nearby.

We are also told that searches for phrases under this system would become much quicker:

Note that in particular, phrase matches would become much faster since we would only need to decode a limited number of terms that are immediately after or before the least-popular term. This operation would have the time complexity O(K*L*N) where K is the term identifier encoding frequency (discussed earlier), L is the length of the phrase, and N is the number of occurrences of the least-frequent term in the phrase.

Related patent filings

I’ve written before about some related patent documents that explore some other process that work with aspects of the compression method described in this patent.

Google looks at multi-stage query processing, which describes a way that searches could be processed in a number of stages, under the patent application: Multi-stage query processing system and method for use with tokenspace repository

Google on Multi-Tiered Indexing and Multi-Staged Query Processing explores Google’s patent System and method for encoding and decoding variable-length data

A reason for the loss of supplemental results, too?

Back in December, a post at the Official Google Blog told about The Ultimate Fate of Supplemental Results. In that, we were told from Google that “rather than searching some part of our index in more depth for obscure queries, we’re now searching the whole index for every query.”

Use of the indexing processes in these three patent filings might explain some changes to the results that we see in Google, if they are being used. Might they also account for the disappearance of supplemental results? What do you thinK?

Share

36 thoughts on “New Google Approach to Indexing and Stopwords”

  1. Supplementals are still there. No loss in their numbers.
    Actually their count grew in the past few days for me.

    PS: To find them sure site:domain.com/* to find non-supplementals.
    Some disagree with this method but I keep a close eye on them and only pages that show up for this search (/*) do get any kind of traffic.

  2. Bill I’d like to see how the future of the supplemental indexing may change as a result of this..even if it begins to take shape with time..I can see that this may be very valid.

    Thanks!

  3. Hi Kim,

    What they describe in this patent is a way of compressing the data in their index in a new way, and only decompressing parts that they want to look at when they need to. Sounds complicated, but what it might mean is that the size of the index is smaller while containing more information, organized in a different way – and there may not be a need to split up the index into a main section and an extended, or supplemental database.

    The Official Google Blog post that I linked to above, The Ultimate Fate of Supplemental Results states that they are searching their whole index for every query. If there is still an additional database, which there possibly might be, it’s being searched to.

    So one impact would be deeper searches. But there are others, like the ability to index phrases better, and to include results that pay attention to what used to be stop words.

    Hi Christopher.

    Thanks. I used to laugh in my undergraduate English classes when we would talk about “deconstructing” literature. Now I find myself doing that same thing to patents.

    Hi Subliminal,

    There definitely is something to what you’re describing with that search. I’ve done some testing on stop words, and that does seem to have changed from the old method. If the “multi-staged query processing” described in one of the other related patent filings has kicked in, that could provide some interesting results, too.

    If Google claims to have gotten rid of supplementals, like they did in their Blog post, then just what do we call those results that aren’t showing up in the modified search that you suggest? And why would the numbers be getting larger? It’s worth exploring.

    Hi Wil,

    It’s good to see you. Hope that you are doing well.

    Hi Dan,

    Definitely some interesting stuff here. I’m exploring some of the results, keeping the stop words, and the multi-stage query processing stuff in mind.

    I think that one is worth exploring in more detail now, too.

    Hi Zak,

    Thanks for stopping by and commenting.

  4. Indeed. But I keep an eye out for traffic on those pages.
    Any page that shows for /* query brings traffic, the others are ‘dead’. Once a ‘dead’ page pops out, it starts getting traffic and when it ‘dies’ again traffic vanishes. And numbers fluctuate chaotically with no obvious pattern.

    Not explicitly advertising pages as supplementals does not mean they are gone. Look at most scraper sites. They have less than 1% in the main index. It’s cool to supplemental those sites but so many whitehat sites suffer from this too … .

    The actual truth is that it seems Google goes any length to take down new (not yet popular) websites and move them to hosted solutions (like blogger … and many other such services) on strong domains in order to rank and I for one will never agree with that.

    Regards.

  5. I agree on Google don’t show anymore the warnings on stop words, but is still ignore some if it’s not use in a popular key-phrase.

    If you make a query on “the best things on the net” the first part “The best things” the word the is not consider as a stop word but the words “on the” are still considered as stop words. Just make that query and look at the snippet on result number 8 here :

    This is the best thing that’s ever happened to me!” says Brian Clark on Page Rank Decreases. … Here’s to longevity on thenet.

    You don’t see in bold the “on the” have been shown in bold here and it’s considered as stop words.

  6. Hi Eric,

    The multistaged query processing approach does do some query expansion, and may include some other variations which would include other phrases that may appear similar. That might make it appear like they are treating those words as stop words.

    But, a search for “the best things on the net” with the quotation marks included only shows 79 results. So, I wonder whether, if there just aren’t that many results for a phrase (searched for without quotation marks) Google might consider other wordings as more relevant, especially if all of the words searched for appear on those pages.

  7. Hi 5ubliminal,

    Yes, there are aspects of what Google is doing that are difficult to explain, and some of the potential reasons could appear questionable.

    But I’m not sure that we just know enough, to make claims about their intent. They do bear watching, though.

  8. Pingback: Stop Words nos resultados - Marketing de Busca
  9. You’re welcome.

    It’s interesting to see the way that search engines may be transforming how they work, and not stripping out stop words from searches may mean that phrase searching will work better in the future. :)

    I think that’s a big step forward…

  10. I noticed the Stop words change but wasn’t sure about it, it seem that a couple of improvements to search are going on in Google inc.

  11. Hi Elite,

    I think Google is continuously making changes to how its search algorithms work. Some may have a large impact on how search works, while others may only impact a small number of searches, but it is happening. One of Google’s engineers stated in an interview this summer that Google had made more than 450 changes to its ranking algorithms last year. Chances are good that they’ve continued to make changes this year as well.

  12. Sorry for bumping this old thread, but I think stop words is an interesting topic, and I’m currently doing some research in this area, trying to understand the algorithm’s logic in dealing with them, this patent did shine some light.

  13. Hi Clinton,

    This was an interesting patent. I’m not sure if you’ve looked into some of Google’s patents on phrase-based indexing as well, but you might find those interesting. Words that might be considered to be stopwords may also be considered to be parts of “good phrases” that would be indexed. For example, if you search at Google for “the matrix,” you stand a good chance of seeing many more references to the movie “the matrix” than you to do mathematical matrices.

  14. Is there any official reference/article on Google regarding a comprehensive list of Stopwords. It makes sense to me to avoid using stopwords in title tag of article but not using it in description of text is almost impossible.

    It is now 2010, I searched “a room with a view,” (without quotes) but I didn’t got any warning on Stopwords.

    Have Google updated their policy on Stopwords? If any do reply.

    Your guidance will help me learn new things.

    Rds, Gaurav.

  15. Hi Gaurav,

    Google hasn’t shown warnings about stopwords being included in a query in a few years now.

    I don’t believe that there is any official reference anywhere on which terms Google might consider stop words, but that’s fine – if you read through this post you might get the idea that Google does now index words that might be considered stopwords, and that it has a way of analyzing those to see if they are meaningful.

    To use an example that I’ve included in a few comments above, the word “the” might be considered a stop word, but when you include it in a query such as “the matrix,” it becomes much more meaningful as part of the name of a specific movie. This approach in determining whether or not certain very frequently appearing words might be meaningful or not is a much better approach than just excluding “stop” words from a search.

  16. I learned about the stop words issues several years ago and can understand the indexing issues. One solution I use is using the “&” in the place of “and”. I’ve gotten so used to writing page titles without stop words that it’s like second nature! :-0). Thanks for the awesome post!

  17. Hi Alexander,

    The problem isn’t so much that Google is or isn’t indexing stop words, but rather that sometimes stop words are meaningful. For instance, when the word “the” is used in the phrase “the matrix,” chances are that the movie of that name is being referred to, rather than some random mathematical matrix. Years ago, when you performed a search for the phrase “to be or not to be,”(without quotation marks), you wouldn’t get very good search results. Now, you get some great Hamlet references for that search.

  18. Well thanks you so much for the tutorial, it is helpful to better understand SEO.

Comments are closed.