In the post 10 Most Important SEO Patents, Part 5 – Phrase-Based Indexing I wrote about how Google’s then Head of Webspam sent a newsletter to Librarians. It described the inverted index that Google used to organize terms in their index of the web. It is no longer available online, but it was a great way for SEOs to learn how Google’s index worked.
Matt Cutts Writes the First Librarian Newsletter on a Google Inverted Index
Besides ranking documents based upon the quality and quantity of links pointing to a page, Google also looks at whether the query terms searched for also appear upon specific pages. Google’s Matt Cutts wrote one of the best descriptions of how Google does this in the first Google Librarian Newsletter. The newsletter appears to have disappeared from the Web not too long ago. But, I found a copy on the University of Michigan website. It was a highly recommended document. Unfortunately, it is no longer available on the internet archive as of the past week.
That first newsletter asked and answered the question, “How does Google collect and rank results? If you were able to read it, you would have seen it refer to “posting lists.” These are lists of the terms posted in the inverted index of the Web. It matches those terms from queries with documents on the Web. it appears that a tweet from Nicholas McDonough has returned a link to that copy of that post:
Required reading. A lot of useful info…
I think I found the piece you mentioned by Matt Cutts. It might not be the exact one but it's close 🙂https://t.co/sC3tltW7jR
— Nicholas McDonough (@Callmenicholi) July 10, 2021
The Matt Cutts post is: How does Google collect and rank results?
This was very helpful to an SEO learning about how Google’s inverted index worked, and it had me interested in learning more about information retrieval.
You can look at the phrase-based indexing patent I linked below. You will see references to how phrases are in posting lists as well. It is impossible to tell if Google has actually done the work of making an inverted index of phrases on the web that would work with phrase-based indexing. Having around 20 related patents about Phrase-based indexing shows that they have spent a lot of time working on the processes behind phrase-based indexing.
An Inverted Index is an Information Retrieval Approach to Indexing the Web
This is one of the information retrieval approaches to making an index. It involves creating an inverted index of terms found in documents on the web. If a query contains more than one word, Google will try to return search results that consist of all the pages that contain the union of all of the words found in a query. Just like Matt Cutts describes a Google inverted index in his newsletter article for librarians.
Stanford University has a page A first take at building an inverted index. It does a nice job of illustrating how an inverted index works. This is one of the information retrieval-based approaches to indexing the Web that search engines use. Google innovated with their Web index based on an inverted index while they sorted and ranked pages on the Web. They also ordered search results additionally ranked on the use of PageRank to sort and display search results.
Search Results Ordered By a Combination of Information Retrieval Score and Authority Score
Google may calculate an information retrieval (IR) score based on whether query terms appear on the page according to the inverted index. It can also look at the location of those query terms on the page. So a page with a query term in a more important place on the page, such as the page title, may rank higher than if the query term was in paragraph-based content on the page. In addition to an IR score, Google combines that score with an authority score based on a link-based analysis such as PageRank. This approach combining those combined scores means that a different set of pages ranked highly for a query than in other search engines.
Other Meaningful Results Returned when Query Terms are Missing or Replaced
It is possible to do searches at Google where search annotations appear after a SERP. And tell us that one of the query terms is missing. This has been happening for a while, and I wanted to document it when it does. Here is one example that I had found when Google decided to show many search results when one or more query terms are not in a document returned for a query. I searched for the Jorge Luis Borges short story “Library of Babel” and the Book “Ficciones.” The story appears in more than one book from the author, and some SERPs don’t include the name of the Book “ficciones.” I found one of those, and it has a search annotation that allows me to see only results that include the name of that book.
Sometimes Google will find meaningful alternatives to some of the words in a query. They would use a process such as Hummingbird or some other synonym substitution to replace those query terms. I searched for “Best Place in Encinitas to Order Lasagna?” Google gave me a featured snippet in response. It was looking for a restaurant but didn’t include the word “restaurant” in the query. See the featured image at the top of this post to see how the word “place” from my query has been rewritten to use “restaurant.”
A Patent on a Google Inverted Index
When I first thought about these patents, I searched for “inverted index” on the USPTO.gov website. Surprisingly it returned a relevant result.
Rather than provide details of how this patent works, I will link to it and provide the abstract, and if you want to check it out, you can do that. Here is that patent:
Updating inverted indices
Inventors: Muthian Sivathanu, Saurabh Goyal, and Rajiv Mathews
Assignee: GOOGLE LLC
US Patent: 10,073,874
Granted: September 11, 2018
Filed: November 21, 2013
Abstract
Implementations provide an indexing system with an instant failover that uses a moving snapshot window. For example, a method may include receiving, by a processor, a query and determining that a main query processing engine is not responding. The method may further include generating a search result for the query using a secondary query processing engine that applies at least one snapshot record to a portion of a posting list, the snapshot record including the portion of the posting list as it appeared before a modification, and the modification occurring within a predetermined time before receiving the query. The portion is a fixed size smaller than the posting list. Applying the snapshot record can include overlaying the portion of the posting list with the snapshot record beginning at an offset specified by the snapshot record. The main query processing engine generates a search result without applying snapshot records.
Inverted Index for Phrase-Based Indexing
Another Google patent tells us about a different inverted index of the web for complete and meaningful phrases used with phrase-based indexing. This means that Google keeps track of frequently co-occurring phrases on pages of the web (unlike LSI Keywords). This patent is at:
Index server architecture using tiered and sharded phrase posting lists
Inventors: Pei Cao, Nadav Eiron, Soham Mazumdar, Anna L. Patterson, Russell Power, and Yonatan Zunger
Assignee: Google Inc.
US Patent: 9,652,483
Granted: May 16, 2017
Filed: November 23, 2015
Abstract:
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are from the document collection. Documents get indexed according to their included phrases, using phrase posting lists. The phrase posting lists get stored in a cluster of index servers. The phrase posting lists can become tiered into groups and sharded into partitions. Phrases in a query get identified based on possible phrasifications. A query schedule based on the phrases can get created from the phrases and optimized to reduce query processing and communication costs. The execution of the query schedule can get managed to further reduce or eliminate query processing operations at various index servers.
I wrote about this inverted index in the post Are You Using Google Phrase-Based Indexing?
I had to write about the Google inverted index because it is something that I haven’t written about in this blog. Still, it is one of the basic SEO 101 approaches for how SEO works. I wanted to show how that method can get used in phrase-based indexing. It is used there to build a phrase-based posting list to index phrases on the Web.
Wow never knew this side of the SEO
Thanks for writing, Bill. And, thanks for the education.
Hi Koray,
The information retrieval roots of SEO with things like inverted indexes need to come out, because we are seeing them used again in places like phrase-based indexing. 🙂
Bill
Hi James,
I was reading a book on information retrieval and it started with inverted indexes. That was part of a presentation on a Google podcast as well. I remembered Matt Cutt’s Librarian Newsletter, and how helpful it was to have someone explain how an inverted index fits into the indexing of the web, and that inspired this post.
Bill
Hi Bill,
Very interesting article, but don’t you think that frequently co-occurring phrases can also be LSI phrases?
Thanks!
Hi Sjoerd,
There is no indication at all that Latent Semantic Indexing is being used by anyone to build an index of the web. An Inverted Index does not need latent Semantic Indexing. it is not used to count the words on a page or determine if that page is relevant for a query. LSI is a technology that was patented and developed by researchers at Bell Labs before there was an internet. I wrote about its limitations in the post Does Google Use Latent Semantic Indexing (LSI)?
Anna Lynn Patterson, the inventor of Phrase-Based Indexing, developed her algorithm to use frequently co-occurring complete and meaningful phrases. I wrote about it in Google Phrase-Based Indexing Updated . The technology behind LSI does not use frequently co-occurring phrases to index content on the Web. That is not how LSI works. LSI has nothing to do with inverted indexes or co-occurring phrases. You can learn more about LSI by reading the patent behind it: Computer information retrieval using latent semantic structure.
Hi Bill,
Is this the newsletter you’re referring to?
https://web.archive.org/web/20110412045528/https://www.google.com/librariancenter/articles/0512_01.html
Hi Sam,
I had saved the internet archive copy on a university site, and that one had disappeared on me within a week. I just added this link to this post a short time ago after someone else from Twitter sent me a copy. That is the newsletter that I was referring to, and it’s just as useful now in teaching how an inverted index works. Thanks.
Bill
Hi Bill,
I have saved this post as a bookmark of my browser.. it’s a great article and lots of valuable information that in my opinion should be deepened and always kept in mind.
Thanks for this information Bill.
Hi Gabriele,
Thank you. That is very good to hear. I have been thinking about making sure I write about some SEO basics that aren’t talked about very much these days. A couple came to mind.
Bill
Thank you very much Bill for all this information.
What are the changes you can measure after they implemented Rank Brain to ” Posting Lists”?
Hi Sessions,
I’m not sure that there is any metric that you can use to measure the effects of RankBrain on Posting lists. Rankbrain works to rewrite queries, which are created by searchers. Posting lists are created from web pages that are created by site owners. Rank Brain has no effect on Posting lists.
An example of the use of RankBrain, is a search for {“New York Times: Puzzles] Rankbrain may notice that whenever people talk about New York Times Puzzles, they are looking for crossword Puzzles. Those words co-occur in Word Vectors. This doesn’t change the pages of the New York Times that carry crossword puzzles, but if you perform a search at Google for [New York Times Puzzle] you will see search results at Google in response to the query showing New York Times Crossword Puzzles.
The search results pages are not touched, Rankbrain rewrites the query. There is no effect on the posting lists.
Bill
Amazing info. Bill, that looks like you have uncovered a secret of the Google indexing process.
Hi Rowdy,
I didn’t uncover a secret. Sadly, it’s just something that doesn’t get talked about much, and should be discussed more. This is a basic of how SEO works, and it isn’t that complicated, but it still is different than stuffing a page with things you assume might be semantically related. Most people who write about Semantic SEO are getting that wrong too.
Great read as always. I will check out the patent for more information but first a pit stop at the phase based indexing article.
Amazing post! your given information about Google’s Inverted Index of the Web really informative and usefull. Thanks for sharing your blog post.
love your blogs,
Awesome and amazing blog,
I really inspired so much for your thought and I also proceed to admire in my life,
thank you so much for knowledge information
Keep uploading more.
Good luck cheers!
It’s hard to come by experienced people about this subject, but you seem like you know what you’re talking about! Thanks
Hi Bill,
Thanks for the quality content.
Do you think latent semantic indexing still plays a huge role in 2021 SEO?
Thanks
Hi Preventivo Sito Web.
I believe that LSI has played no role at all in SEO, and there is no reason for it to matter in 2021. It is a technology that was patented by Bell Labs. The patent likely expired in 2008, but there does not appear to have been any reason for Google to use this technology that was created to index smaller static databases for use on the Web. The patent for LSI explains that every time more content to be indexed by LSI needs to be updated every time more content is added to the corpus to be indexed. Google has developed ways of indexing content using information retrieval methods such as inverted indexes and term frequencies and inverse document frequencies. Google has developed approaches such as term Frequencies, Semantic Frame language models, BERT Language Models. LSI is inadequate to index Web content, and there are more modern and up-to-date tools that Google can use. Google has proven over the years that they like Semantics. but that does not mean in any way that they have decided to use Latent Semantic Indexing. Also, to dispel the concept of LSI Keywords, there is no proof that Google has ever used that technology, and a few SEOs who are proponents of LSI Keywords treat almost any way of generating content related to keywords as an “LSI Keywords” approach, which they truly are not.
Bill
a quality content, keep uploading more of this!
Fascinating! Not your typical content to read. Really interesting and I would love to read this everyday!
Hello Bill,
In all my years of studying SEO and trying to understand all that is needed to get your content indexed and then served up to searchers, I have never come across this style of indexing.
Do you think there’s a possibility of this coming back into use?
Thanks for this eye-opener.
By the way, great content!
Regards
Hi TechTechTalk.
I don’t think that Google has moved away from using inverted indexes when indexing content on the web. It is likely one of the foundations of how they decide to rank content on the web, as foundational as letters are to writing;.
I like to read this post. I also want to bookmark this page to visit this blog again to get some new information. I would definitely here again.