A Phrase Posting List is an Inverted Index of Complete and Meaningful Phrases on the Web
Say you want to find out who the chief of police is in New York City. You might type the following words into a search box at Google:
- New York police chief
When Google attempts to find an answer for you, it may break your query into individual words to find all the documents that might be the best match for your search:
- New AND York AND police AND chief
Google may then take all the returned documents, see which ones contain all the terms you used, and then rank those based upon some of the ranking algorithms the search engine uses to show you the best matches for your query.
But, what if Google tried to find phrases from your query instead. Phrases that appear on web pages match your search. What if Google used something they refer to as phrasification? Google might take your query and break it into different combinations of phrases, such as the following:
- police AND chief AND “New York”
- “New York AND “police chief”
- chief AND New AND “York Police”
- “York police chief” AND New
- “police chief” AND New AND York
- “New York Police Chief”
How Phrasifications MIght Get Scored
Each of these phrasifications may get scored by using a scoring model that includes:
- Expected probability of the phrase occurring in a document,
- Phrases in the phrasification,
- Confidence of each phrase,
- Change parameters to controll the precision and recall of searches on the phrases.
The highest scoring ramifications may represent the phrases in a query. They could lead to a combination that best matches what you may intend to find with your search.
For instance, it’s much more likely that you were searching for the chief of police in New York City than you were the new chief of the York Police.
That analysis might also tell it that a phrase such as “Chief of Police” might also be helpful to find pages that may match the meaning behind your search.
If Google’s index contained information about phrases that appear on web pages in addition to individual terms, the ratification approach might work to improve the results that you see at Google.
Google’s Phrase-Based Indexing and A Phrase Posting List
Over the past few years, many related Google patent applications came out. They describe how the search engine might use a phrase-based indexing system.
We don’t know for certain if Google has adopted the approach in those patent filings. It appears that Google has now started publishing the second generation of patent filings using that phrase-based indexing system. There are more technical details on how such an index would work. It is frequently showing co-occurring phrases in a phrase posting list.
Previous Posts on Phrase-Based Indexing
My previous posts on that phrase-based approach include the following:
- What are the Top Phrases for Your Website?
- Google Phrase Based Indexing Patent Granted
- Phrase Based Information Retrieval and Spam Detection
- Google Aiming at 100 Billion Pages?
- Move over pagerank: GoogleΓ’β¬β’s looking at phrases?
This week, The USPTO granted Google a patent that describes how such a system might collect and store information about phrases it finds on Web pages, in a phrase posting list.
To get a sense of how this phrase-based indexing system works, it can help to look back at what Google has written about how an inverted index works. Also, look at how the search engine might explore different combinations of words it finds on pages to see how it may index concepts, or phrases instead of just individual words.
Individual Terms in an Inverted Index System and Phrase Posting List
How does a search engine save and store information about pages it finds on the Web?
Back in 2005, Google’s Matt Cutts published How does Google collect and rank results?, which provides an overview on how Google might collect and index words found on web pages in a type of index known as an inverted index.
That kind of index relates web pages to individual words found on each page by associating each unique word with a posting list that identifies documents containing that word. A posting list is a list of all documents that contain a specific word. Thus, when someone searches, the query they enter into a search box is first broken into individual terms, and the posting lists for those terms get looked at.
The documents from those posting lists are then ranked according to statistical measures, such as:
- Frequency of occurrence of the query terms,
- Host domain,
- Link analysis, and;
- The like.
Documents that contain all the words in a query might come before documents that contain less than all the words. The lists of documents are then displayed to the searcher, usually within their ranked order.
This approach is a direct “Boolean” matching of query terms, and it has some limitations. For instance, a search for “Australian Shepherds” wouldn’t return any documents about other herding dogs, such as Border Collies. Still, it might return and show documents about Australia that have nothing to do with dogs and other pages about shepherds.
This kind of approach focuses on individual terms rather than concepts.
Concepts and Indexing Systems
The ideas captured in language often take on new meanings when expressed in phrases. For example, if we were to try to search for and understand the words “President” and “United” and “States” separately, we would get a host of different meanings and possible pages associated with them. For instance, a page about the President of a Union in the United States might be as relevant as a page about the United States President.
If, instead, we look at those words as phrases such as “President of the United States,” we get a better sense of the kinds of web pages that might be most relevant for that specific phrase as a query.
Will The Search Engine Focus More on Phrase-Indexing, Using that Phrase Posting List?
Conventional search engine systems looking at individual terms in an inverted index may sometimes expand their index to a limited number of well-known phrases. If the search engines focused on more phrases, it could be taxing on that search engine. As the patent’s inventors tell us:
Indexing phrases is typically avoided because of the perceived computational and memory requirements to identify all possible phrases of, say, three, four, or five or more words.
For example, assuming that any five words could constitute a phrase and a large corpus would have at least 200,000 unique terms, there would be approximately 3.2.times.10.sup.26 possible phrases, clearly more than any existing system could store or otherwise programmatically manipulate.
The further problem is that phrases continually enter and leave the lexicon in terms of their usage, much more frequently than invented new individual words.
New phrases are always taken from technology, arts, world events, and law. Other phrases will decline in usage over time.
Search engines may also pay attention to how often different words may show up in the same documents to understand concepts. For instance, a search for the word “president” may return documents that may contain many of the same words, such as “white” and “house.”
Understanding this may result in a way to rerank search results so that pages with more related words like this rank higher in search results. But, this way of relating individual words that tend to show up in the same documents isn’t as powerful as looking for phrases that tend to co-occur on the same pages.
Will We See a Phrase-Based Index at Google?
A phrase-based indexing system would be huge and would need to use many servers that share information across those servers. Such a search engine would use a Phrase Posting List.
The new Google patent introduces concepts like ratification. It explores ways to efficiently and effectively capture information about which pages different phrases appear upon and to use phrase-based indexing to return more meaningful search results to searchers.
The patent is:
Index server architecture using tiered and sharded phrase posting lists
Invented by Pei Cao, Nadav Eiron, Soham Mazumdar, Anna Patterson, Russell Power, and Yonatan Zunger
Assigned to Google
US Patent 7,693,813
Granted April 6, 2010
Filed March 30, 2007
Abstract
An information retrieval system uses phrases to index, retrieve, organize, and describe documents.
Phrases can get taken from the document collection. Documents are then indexed according to their included phrases, using a phrase posting list. The phrase posting lists are in a cluster of index servers. The phrase posting lists can go into groups and shard into partitions.
Phrases in a query work based on possible ramifications. A query schedule based on the phrases works with the phrases and reduces query processing and communication costs. The execution of the query schedule can further reduce or get rid of query processing operations at various index servers.
We are also told about some related unpublished patent applications at the US Patent Office:
- Query Scheduling Using Hierarchical Tiers of Index Servers, filed Mar. 30, 2007;
- Index Updating Using Segment Swapping, filed Mar. 30, 2007;
- Phrase Extraction Using Subphrase Scoring, filed Mar. 30, 2007; and
- Bifurcated Document Relevance Scoring, filed Mar. 30, 2007
Phrase Posting List Conclusion
The Phrase Posting Lists patent itself is fairly long and detailed. It describes how phrases get taken from web pages and indexed. It also looks at how those indices work across multiple servers, how the purification process works in more depth, and how this phrase-based information system can look at co-occurrence to identify related phrases.
If you’re interested in how Google indexes content found on web pages and are willing to dig into technical details, you may want to spend some time with the phrase-based indexing system patent filings.
I have had a few people link to some of my earlier posts on phrase-based indexing and state that they are indications that Google is using Latent Semantic Indexing because the indexing system pays attention to different phrases that tend to co-occur on web pages. While that part of the indexing system is interesting and worth studying, it isn’t latent semantic indexing.
Google is a lot smarter than people realize. Such as semantics, synonymns, etc. People fret about the exact phrases and titles, but it is more of building a mass over overall content than anything else.
Yes, I agree with Drew that Google is a lot smarter and it is evolving to be better everytime. If phrasification is used does this mean that long tail keywords would be better off?
Google really is something, don’t you think?…
Just when webmasters think they have cracked Google’s code, Google comes up with a new algorithm. First it was just keywords, then comes keyphrases apart from LSI among others.
What can I say??… Google is just like a woman — very unpredictable!
Fascinating turn of events Bill and many thanks for the great overview. This patent clarifies their direction. This patent, combined with patents for storing different “meanings” show that the Semantic Web has been here for quite some time and search engineers are doing their best to embrace it as well as emerging behavioral trends on the part of users. Win-win for us all. Now, can we strike the term “keyword” from the lexicon of search and replace it with “keyphrase.”
Hi Drew,
We are seeing Google and the other search engines evolve towards understanding more of the meaning behind sites than just matching keywords from queries, but it doesn’t hurt to spend some time thinking carefully about some of the words and phrases that people who might be interested in what a site has to offer might use to find that site in search engines.
Hi Andrew,
It’s always been a good idea to consider the possibility of having pages show up for long tail queries as well as head terms that those pages might be optimized for, and I don’t think that this patent by itself changes that. Pages that include words and phrases that are related to queries (whether individual words or phrasifications) that people might use to find a page may be more likely to show up in search results for those queries. And those pages may be more likely to show up in search results for those related words and phrases in search results as well – increasing the likelihood that a page shows up for long tail terms.
Hi marianne,
Thank you. I do think that this approach from Google should make us think more carefully about the way that we use words and phrases on our pages, and can improve the results that searchers receive from their queries.
Hi Pidro,
This patent points to a move that is somewhat inevitable – considering relevance to be more than matching keywords on a page. I should get similar results from a search whether I try to find “car dealerships” or “automobile dealerships.”
The core of this patent seems to be “Phrases in a query are identified based on possible phrasifications” – since, as you describe, phrases change over time, and in context.
I sometimes wonder how much of what Google does is ‘human-generated’ vs. algo – or put another way, how many ‘exceptions’ or semantic adjustments are there in the algorithm, that are specified by human users, rather than derived purely algorithmically. Regardless, thanks for the detailed and interesting post.
I was under the impression they were already using “phrasification” Oh well, if it improves searches, and cuts out spam sounds good!
Google allows phrase based searching by enclosing the search terms in quotation marks – but I guess not many users make use of this facility. I think it could be useful in some cases for a search engine to default to phrase searching.
Where there are insufficient phrase matches, the phrase-based results could be followed by non-phrase based search results (i.e. where all the search terms are found but not as a phrase in the correct order.)
Hi sms,
I suspect that there may be some things that are manually changed, but with the vast number of pages on the Web, and the billions of queries that Google receives each month, the overwhelming majority of what Google does to rank pages has to be automated. There may be human input in creating language models and training sets, as well as human reviews of the relevance of results, but too much of that is unlikely.
Hi Shaun,
It is very much possible that Google is using the techniques described in this patent, and in the other phrase-based indexing patent filings. I really didn’t find much in the way of references to the term “phrasification” itself before I started writing this post, but it may be something that Google has been doing for the last 3-4 years at least.
I’m not sure that we’ve seen a description of a process before from Google of the search engine breaking queries into different possible combinations of words and phrases, comparing those to find the most likely combinations, and then providing results from an index that includes phrases instead of just words.
Hi Ted,
I don’t know how many people are aware that they can use quotation marks for phrases either. We have had the ability in Google to put quotation marks around phrases and have the search engine return exact matches for years. I believe that you may have been able to do that back when Google was in Beta more than a decade ago. I believe Google has always defaulted to a findall search, where it would attempt to make sure that all of the keywords in a query appeared in the documents returned – and if there weren’t enough, it would start returning documents with less than all of the terms.
The nice thing about this phrase based approach is that it wouldn’t return phrase results by default, but instead try to determine if there were meaningful phrases within a query before searching, and if there are, it would try to focus upon returning results with those phrases first. One main difference is that in this phrase based indexing system, instead of just including individual words, Google’s index would also include phrases and their locations on documents on the Web.
Thanks for the helpful and informative post! When deciding on appropriate keyword or keyword phrases to optimize for in a website, it is essential to look at how people are actually phrasing their searches to get effective results. Search engine optimization really is based on so many different aspects of a website, that readers are very lucky to have such a great SEO best practices resource! Keep up the good work!
Great stuff Bill. No need to reply just wanted to give you kudos on another fascinating read my friend!
Hi lgenetti,
Thank you for your kind words. I do think that it can be important if you are considering attempting to use a particular phrase when writing a page, with the idea that people might search for that phrase, to have some idea of how Google or the other major search engines might attempt to interpret that phrase when they see it in a query. Sounds like common sense, but to use an extreme example, if you are writing about a new police chief in the City of York, you probably don’t want to title your article “New York Police Chief Appointed.” π
Re constructing phrases in content for a search engine – I guess that was what metadata was for!
It would be nice to have HTML markup that identified specific phrases for search engines (that can recognize phrases) to pick up – City of York Police Chief but this could also be subject to the same abuses as metadata.
Sounds like Google keeps looking at LSI. Google is going to search their index for what they think you are looking for, not what you type in the search box.
Hi Ted,
Metadata such as the meta keywords element has always been somewhat suspect because it doesn’t actually appear on the page of a site. The kind of markup that you mention might be a nice addition, but I agree with you that it might be prone to the same kind of abuses as meta data. For some phrases, such as acronyms, you do get something that could signal to a search engine that you intended a certain phrase, but that can only be used in a limited fashion, such as in the following example.
“<p>The National Aeronautics and Space Administration (<acronym title=”National Aeronautics and Space Administration”>NASA</acronym>) is undergoing a transformation as the US Government is considering ending the agency’s role in manned space flight.</p>”
I’m not sure that the major search engines are paying much attention, if any at all, to the use of the acronym element.
I imagine people producing content will have to study linguistics now. Considering how detailed the field of linguistics has become in deconstructing languages into elements which follow rules, it seems that Google and others can well apply these rules to understanding content appropriateness to a query. (Coupled with semantics). Does this mean that my degree in English Lit with work in Linguistics will help me find new work? π
Hi Mike,
Google may use something like PLSI, but I don’t think that they would use LSI – when it was conceived in the late 80s and early 90s, it wasn’t really intended for use on the Web, but rather for much smaller document repositiories that don’t change much and don’t have features like links between documents.
I do agree with you though that Google seems to be moving from matching keywords from your queries to keywords found in documents to a system that might better understand the intent behind your search and return pages that may be a better fit for what you mean than the words that you used.
Hi Frank,
Knowing more about linguistics might help, but I’m not sure that it’s necessary. On the other hand, I’ve always felt that my degree in English Lit has helped me be a better writer. π
Being able to find related words and phrases to use within the copy of a page can be helpful in a phrase-based indexing system. Understanding how what you write might be deconstructed by a computerized system might help too. What’s interesting is that Google isn’t so concerned with the technical rules of linguistics as it is with the observable patterns that it sees in documents on the Web, and the ways that people use words and phrases.