Phrasification and Revisiting Google’s Phrase Based Indexing

A newly granted Google patent on phrase-based indexing calls for a new look at that approach to indexing phrases on the Web, including a process referred to as phrasification.

Say you want to find out who the chief of police is in New York City. You might type the following words into a search box at Google:

  • New York police chief

When Google attempts to find an answer for you, it may break your query into individual words to find all of the documents that might be a best match for your search:

  • New AND York AND police AND chief

Google may then take all the documents that are returned, and see which ones contain all of the terms you used, and then rank those based upon some of the ranking algorithms the search engine uses to try to show you the best matches for your query.

But, what if Google tried to find phrases from your query instead, that appear on web pages that are a match for your search. What if Google used something they refer to as phrasification? Google might start out by taking your query and breaking it into different combinations of phrases, such as the following:

  • “New York” AND police AND chief
  • “New York AND “police chief”
  • New AND “York Police” AND chief
  • New AND “York police chief”
  • New AND York AND “police chief”
  • “New York Police Chief”

Each of these phrasifications may be scored by using a scoring model that includes:

  • The expected probability of the phrase occurring in a document,
  • The number of phrases in the phrasification,
  • A confidence measure of each phrase,
  • Some adjustment parameters for controlling the precision and recall of searches on the phrases.

The highest scoring phrasifications may be selected as best representing the phrases contained in a query, and possibly lead to a combination that best matches what you may intend to find with your search.

For instance, it’s much more likely that you were searching for the chief of police in New York City then you were the new chief of the York Police.

That analysis might also tell it that a phrase such as “Chief of Police” might also be helpful to find pages that may match the meaning behind your search.

If Google’s index contained information about phrases that appear on web pages in addition to individual terms, the phrasification approach might work to improve the results that you see at Google.

Google Phrase-Based Indexing

Over the past few years, a number of Google patent applications were published which describe how the search engine might use a phrase-based indexing system.

We don’t know for certain if Google has adopted the approach in those patent filings, but it appears that Google has now started publishing a second generation of patent filings on that phrase-based indexing system that go into more technical details on how such an index might be constructed. My previous posts on that phrase-based approach include the following:

Google was granted a patent this week that describes how such a system might collect and store information about phrases it finds on Web pages.

To get a sense of how this phrase-based indexing system works, it can help to look back at what Google has written about how an inverted index works, and to look at how the search engine might explore different combinations of words it finds on pages to see how it may index concepts, or phrases, instead of just individual words.

Individual Terms in an Inverted Index System

How does a search engine save and store information about pages it finds on the Web?

Back in 2005, Google’s Matt Cutts published How does Google collect and rank results?, which provides an overview on how Google might collect and index words found on web pages in a type of index known as an inverted index.

That kind of index relates web pages to individual words found on each page, by associating each unique word with a posting list that identifies documents containing that word. A posting list is list of all documents that contain a specific word. When someone searches, the query they enter into a search box is first broken into individual terms, and the posting lists for each term is accessed.

The documents from those posting lists are then ranked according to statistical measures, such as:

  • Frequency of occurrence of the query terms,
  • Host domain,
  • Link analysis, and;
  • The like.

Documents that contain all of the words in a query might be shown before documents that contain less than all of the words. The lists of documents are then displayed to the searcher, usually within their ranked order.

This approach is known as a direct “Boolean” matching of query terms, and it has some limitations. For instance, a search for “Australian Shepherds” wouldn’t return any documents about other herding dogs such as Border Collies, but it might return and show documents about Australia that have nothing to do with dogs, and other pages about shepherds.

This kind of approach focuses upon individual terms rather than concepts.

Concepts and Indexing Systems

The ideas captured in language often take on new meanings when they are expressed in phrases. For example, if we were to try to search for and understand the words “President” and “United” and “States” separately, we would get a host of different meanings, and possible pages associated with them. For instance, a page about the President of a Union in the United States might be as relevant a result as a page about the President of the United States.

If instead, we look at those words as a phrase such as “President of the United States,” we get a better sense of the kinds of web pages that might be most relevant for that specific phrase as a query.

Conventional search engines systems looking at individual terms in an inverted index may sometimes expand their index to a limited number of well known phrases. If the search engines tried to focus on more phrases, it could be taxing on that a search engine. As the patent’s inventors tell us:

Indexing of phrases is typically avoided because of the perceived computational and memory requirements to identify all possible phrases of say three, four, or five or more words.

For example, on the assumption that any five words could constitute a phrase, and that a large corpus would have at least 200,000 unique terms, there would be approximately 3.2.times.10.sup.26 possible phrases, clearly more than any existing system could store or otherwise programmatically manipulate.

A further problem is that phrases continually enter and leave the lexicon in terms of their usage, much more frequently than new individual words are invented.

New phrases are always being generated, from sources such technology, arts, world events, and law. Other phrases will decline in usage over time.

Search engines may also pay attention to how often different words may tend to show up in the same documents, to try to understand concepts. For instance, a search for the word “president” may return documents that may contain many of the same words, such as “white” and “house.”

Understanding this may result in a way to rerank search results so that pages with more related words like this are ranked higher in search results. But, this way of relating individual words that tend to show up in the same documents isn’t as powerful as looking for phrases that tend to co-occur on the same pages.

A phrase-based indexing system would be very large, and would need to use multiple servers that share information across those servers.

The new Google patent introduces concepts like phrasification, and explores ways to efficiently and effectively capture information about which pages different phrases appear upon, and to use phrase-based indexing to return more meaningful search results to searchers.

The patent is:

Index server architecture using tiered and sharded phrase posting lists
Invented by Pei Cao, Nadav Eiron, Soham Mazumdar, Anna Patterson, Russell Power, and Yonatan Zunger
Assigned to Google
US Patent 7,693,813
Granted April 6, 2010
Filed March 30, 2007

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents.

Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions.

Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.

We are also told about a number of related patent applications that don’t appear to have been published yet at the US Patent Office:

  • Query Scheduling Using Hierarchical Tiers of Index Servers, filed Mar. 30, 2007;
  • Index Updating Using Segment Swapping, filed Mar. 30, 2007;
  • Phrase Extraction Using Subphrase Scoring, filed Mar. 30, 2007; and
  • Bifurcated Document Relevance Scoring, filed Mar. 30, 2007

Conclusion

The Phrase Posting Lists patent itself is fairly long and detailed, and describes how phrases are extracted from web pages and indexed, how those indices are arranged across multiple servers, how the phrasification process is handled in more depth, and how this phrase-based information system can look at co-occurrence to indentify related phrases.

If you’re interested in how Google indexes content found on web pages, and willing to dig into some of the technical details, you may want to spend some time with the phrase-based indexing system patent filings.

I have had a few people link to some of my earlier posts on phrase-based indexing, and state that they are indications that Google is using Latent Semantic Indexing because the indexing system pays attention to different phrases that tend to co-occur on web pages. While that part of the indexing system is interesting and worth studying, it isn’t latent semantic indexing.

Share

30 thoughts on “Phrasification and Revisiting Google’s Phrase Based Indexing”

  1. Google is a lot smarter than people realize. Such as semantics, synonymns, etc. People fret about the exact phrases and titles, but it is more of building a mass over overall content than anything else.

  2. Yes, I agree with Drew that Google is a lot smarter and it is evolving to be better everytime. If phrasification is used does this mean that long tail keywords would be better off?

  3. Fascinating turn of events Bill and many thanks for the great overview. This patent clarifies their direction. This patent, combined with patents for storing different “meanings” show that the Semantic Web has been here for quite some time and search engineers are doing their best to embrace it as well as emerging behavioral trends on the part of users. Win-win for us all. Now, can we strike the term “keyword” from the lexicon of search and replace it with “keyphrase.”

  4. Google really is something, don’t you think?…

    Just when webmasters think they have cracked Google’s code, Google comes up with a new algorithm. First it was just keywords, then comes keyphrases apart from LSI among others.

    What can I say??… Google is just like a woman — very unpredictable!

  5. Hi Drew,

    We are seeing Google and the other search engines evolve towards understanding more of the meaning behind sites than just matching keywords from queries, but it doesn’t hurt to spend some time thinking carefully about some of the words and phrases that people who might be interested in what a site has to offer might use to find that site in search engines.

  6. Hi Andrew,

    It’s always been a good idea to consider the possibility of having pages show up for long tail queries as well as head terms that those pages might be optimized for, and I don’t think that this patent by itself changes that. Pages that include words and phrases that are related to queries (whether individual words or phrasifications) that people might use to find a page may be more likely to show up in search results for those queries. And those pages may be more likely to show up in search results for those related words and phrases in search results as well – increasing the likelihood that a page shows up for long tail terms.

  7. Hi marianne,

    Thank you. I do think that this approach from Google should make us think more carefully about the way that we use words and phrases on our pages, and can improve the results that searchers receive from their queries.

  8. Hi Pidro,

    This patent points to a move that is somewhat inevitable – considering relevance to be more than matching keywords on a page. I should get similar results from a search whether I try to find “car dealerships” or “automobile dealerships.”

  9. The core of this patent seems to be “Phrases in a query are identified based on possible phrasifications” – since, as you describe, phrases change over time, and in context.

    I sometimes wonder how much of what Google does is ‘human-generated’ vs. algo – or put another way, how many ‘exceptions’ or semantic adjustments are there in the algorithm, that are specified by human users, rather than derived purely algorithmically. Regardless, thanks for the detailed and interesting post.

  10. I was under the impression they were already using “phrasification” Oh well, if it improves searches, and cuts out spam sounds good!

  11. Google allows phrase based searching by enclosing the search terms in quotation marks – but I guess not many users make use of this facility. I think it could be useful in some cases for a search engine to default to phrase searching.

    Where there are insufficient phrase matches, the phrase-based results could be followed by non-phrase based search results (i.e. where all the search terms are found but not as a phrase in the correct order.)

  12. Hi sms,

    I suspect that there may be some things that are manually changed, but with the vast number of pages on the Web, and the billions of queries that Google receives each month, the overwhelming majority of what Google does to rank pages has to be automated. There may be human input in creating language models and training sets, as well as human reviews of the relevance of results, but too much of that is unlikely.

  13. Hi Shaun,

    It is very much possible that Google is using the techniques described in this patent, and in the other phrase-based indexing patent filings. I really didn’t find much in the way of references to the term “phrasification” itself before I started writing this post, but it may be something that Google has been doing for the last 3-4 years at least.

    I’m not sure that we’ve seen a description of a process before from Google of the search engine breaking queries into different possible combinations of words and phrases, comparing those to find the most likely combinations, and then providing results from an index that includes phrases instead of just words.

  14. Hi Ted,

    I don’t know how many people are aware that they can use quotation marks for phrases either. We have had the ability in Google to put quotation marks around phrases and have the search engine return exact matches for years. I believe that you may have been able to do that back when Google was in Beta more than a decade ago. I believe Google has always defaulted to a findall search, where it would attempt to make sure that all of the keywords in a query appeared in the documents returned – and if there weren’t enough, it would start returning documents with less than all of the terms.

    The nice thing about this phrase based approach is that it wouldn’t return phrase results by default, but instead try to determine if there were meaningful phrases within a query before searching, and if there are, it would try to focus upon returning results with those phrases first. One main difference is that in this phrase based indexing system, instead of just including individual words, Google’s index would also include phrases and their locations on documents on the Web.

  15. Thanks for the helpful and informative post! When deciding on appropriate keyword or keyword phrases to optimize for in a website, it is essential to look at how people are actually phrasing their searches to get effective results. Search engine optimization really is based on so many different aspects of a website, that readers are very lucky to have such a great SEO best practices resource! Keep up the good work!

  16. Hi lgenetti,

    Thank you for your kind words. I do think that it can be important if you are considering attempting to use a particular phrase when writing a page, with the idea that people might search for that phrase, to have some idea of how Google or the other major search engines might attempt to interpret that phrase when they see it in a query. Sounds like common sense, but to use an extreme example, if you are writing about a new police chief in the City of York, you probably don’t want to title your article “New York Police Chief Appointed.” :)

  17. Re constructing phrases in content for a search engine – I guess that was what metadata was for!

    It would be nice to have HTML markup that identified specific phrases for search engines (that can recognize phrases) to pick up – City of York Police Chief but this could also be subject to the same abuses as metadata.

  18. Sounds like Google keeps looking at LSI. Google is going to search their index for what they think you are looking for, not what you type in the search box.

  19. Hi Ted,

    Metadata such as the meta keywords element has always been somewhat suspect because it doesn’t actually appear on the page of a site. The kind of markup that you mention might be a nice addition, but I agree with you that it might be prone to the same kind of abuses as meta data. For some phrases, such as acronyms, you do get something that could signal to a search engine that you intended a certain phrase, but that can only be used in a limited fashion, such as in the following example.

    “<p>The National Aeronautics and Space Administration (<acronym title=”National Aeronautics and Space Administration”>NASA</acronym>) is undergoing a transformation as the US Government is considering ending the agency’s role in manned space flight.</p>”

    I’m not sure that the major search engines are paying much attention, if any at all, to the use of the acronym element.

  20. I imagine people producing content will have to study linguistics now. Considering how detailed the field of linguistics has become in deconstructing languages into elements which follow rules, it seems that Google and others can well apply these rules to understanding content appropriateness to a query. (Coupled with semantics). Does this mean that my degree in English Lit with work in Linguistics will help me find new work? ;)

  21. Hi Mike,

    Google may use something like PLSI, but I don’t think that they would use LSI – when it was conceived in the late 80s and early 90s, it wasn’t really intended for use on the Web, but rather for much smaller document repositiories that don’t change much and don’t have features like links between documents.

    I do agree with you though that Google seems to be moving from matching keywords from your queries to keywords found in documents to a system that might better understand the intent behind your search and return pages that may be a better fit for what you mean than the words that you used.

  22. Hi Frank,

    Knowing more about linguistics might help, but I’m not sure that it’s necessary. On the other hand, I’ve always felt that my degree in English Lit has helped me be a better writer. :)

    Being able to find related words and phrases to use within the copy of a page can be helpful in a phrase-based indexing system. Understanding how what you write might be deconstructed by a computerized system might help too. What’s interesting is that Google isn’t so concerned with the technical rules of linguistics as it is with the observable patterns that it sees in documents on the Web, and the ways that people use words and phrases.

Comments are closed.