A newly granted Google patent on phrase-based indexing calls for a new look at that approach to indexing phrases on the Web, including a process referred to as phrasification.
Say you want to find out who the chief of police is in New York City. You might type the following words into a search box at Google:
- New York police chief
When Google attempts to find an answer for you, it may break your query into individual words to find all of the documents that might be a best match for your search:
- New AND York AND police AND chief
Google may then take all the documents that are returned, and see which ones contain all of the terms you used, and then rank those based upon some of the ranking algorithms the search engine uses to try to show you the best matches for your query.
But, what if Google tried to find phrases from your query instead, that appear on web pages that are a match for your search. What if Google used something they refer to as phrasification? Google might start out by taking your query and breaking it into different combinations of phrases, such as the following:
- “New York” AND police AND chief
- “New York AND “police chief”
- New AND “York Police” AND chief
- New AND “York police chief”
- New AND York AND “police chief”
- “New York Police Chief”
Each of these phrasifications may be scored by using a scoring model that includes:
- The expected probability of the phrase occurring in a document,
- The number of phrases in the phrasification,
- A confidence measure of each phrase,
- Some adjustment parameters for controlling the precision and recall of searches on the phrases.
The highest scoring phrasifications may be selected as best representing the phrases contained in a query, and possibly lead to a combination that best matches what you may intend to find with your search.
For instance, it’s much more likely that you were searching for the chief of police in New York City then you were the new chief of the York Police.
That analysis might also tell it that a phrase such as “Chief of Police” might also be helpful to find pages that may match the meaning behind your search.
If Google’s index contained information about phrases that appear on web pages in addition to individual terms, the phrasification approach might work to improve the results that you see at Google.
Google Phrase-Based Indexing
Over the past few years, a number of Google patent applications were published which describe how the search engine might use a phrase-based indexing system.
We don’t know for certain if Google has adopted the approach in those patent filings, but it appears that Google has now started publishing a second generation of patent filings on that phrase-based indexing system that go into more technical details on how such an index might be constructed. My previous posts on that phrase-based approach include the following:
- What are the Top Phrases for Your Website?
- Google Phrase Based Indexing Patent Granted
- Phrase Based Information Retrieval and Spam Detection
- Google Aiming at 100 Billion Pages?
- Move over pagerank: Googleâ€™s looking at phrases?
Google was granted a patent this week that describes how such a system might collect and store information about phrases it finds on Web pages.
To get a sense of how this phrase-based indexing system works, it can help to look back at what Google has written about how an inverted index works, and to look at how the search engine might explore different combinations of words it finds on pages to see how it may index concepts, or phrases, instead of just individual words.
Individual Terms in an Inverted Index System
How does a search engine save and store information about pages it finds on the Web?
Back in 2005, Google’s Matt Cutts published How does Google collect and rank results?, which provides an overview on how Google might collect and index words found on web pages in a type of index known as an inverted index.
That kind of index relates web pages to individual words found on each page, by associating each unique word with a posting list that identifies documents containing that word. A posting list is list of all documents that contain a specific word. When someone searches, the query they enter into a search box is first broken into individual terms, and the posting lists for each term is accessed.
The documents from those posting lists are then ranked according to statistical measures, such as:
- Frequency of occurrence of the query terms,
- Host domain,
- Link analysis, and;
- The like.
Documents that contain all of the words in a query might be shown before documents that contain less than all of the words. The lists of documents are then displayed to the searcher, usually within their ranked order.
This approach is known as a direct “Boolean” matching of query terms, and it has some limitations. For instance, a search for “Australian Shepherds” wouldn’t return any documents about other herding dogs such as Border Collies, but it might return and show documents about Australia that have nothing to do with dogs, and other pages about shepherds.
This kind of approach focuses upon individual terms rather than concepts.
Concepts and Indexing Systems
The ideas captured in language often take on new meanings when they are expressed in phrases. For example, if we were to try to search for and understand the words “President” and “United” and “States” separately, we would get a host of different meanings, and possible pages associated with them. For instance, a page about the President of a Union in the United States might be as relevant a result as a page about the President of the United States.
If instead, we look at those words as a phrase such as “President of the United States,” we get a better sense of the kinds of web pages that might be most relevant for that specific phrase as a query.
Conventional search engines systems looking at individual terms in an inverted index may sometimes expand their index to a limited number of well known phrases. If the search engines tried to focus on more phrases, it could be taxing on that a search engine. As the patent’s inventors tell us:
Indexing of phrases is typically avoided because of the perceived computational and memory requirements to identify all possible phrases of say three, four, or five or more words.
For example, on the assumption that any five words could constitute a phrase, and that a large corpus would have at least 200,000 unique terms, there would be approximately 3.2.times.10.sup.26 possible phrases, clearly more than any existing system could store or otherwise programmatically manipulate.
A further problem is that phrases continually enter and leave the lexicon in terms of their usage, much more frequently than new individual words are invented.
New phrases are always being generated, from sources such technology, arts, world events, and law. Other phrases will decline in usage over time.
Search engines may also pay attention to how often different words may tend to show up in the same documents, to try to understand concepts. For instance, a search for the word “president” may return documents that may contain many of the same words, such as “white” and “house.”
Understanding this may result in a way to rerank search results so that pages with more related words like this are ranked higher in search results. But, this way of relating individual words that tend to show up in the same documents isn’t as powerful as looking for phrases that tend to co-occur on the same pages.
A phrase-based indexing system would be very large, and would need to use multiple servers that share information across those servers.
The new Google patent introduces concepts like phrasification, and explores ways to efficiently and effectively capture information about which pages different phrases appear upon, and to use phrase-based indexing to return more meaningful search results to searchers.
The patent is:
Index server architecture using tiered and sharded phrase posting lists
Invented by Pei Cao, Nadav Eiron, Soham Mazumdar, Anna Patterson, Russell Power, and Yonatan Zunger
Assigned to Google
US Patent 7,693,813
Granted April 6, 2010
Filed March 30, 2007
An information retrieval system uses phrases to index, retrieve, organize and describe documents.
Phrases are extracted from the document collection. Documents are the indexed according to their included phrases, using phrase posting lists. The phrase posting lists are stored in an cluster of index servers. The phrase posting lists can be tiered into groups, and sharded into partitions.
Phrases in a query are identified based on possible phrasifications. A query schedule based on the phrases is created from the phrases, and then optimized to reduce query processing and communication costs. The execution of the query schedule is managed to further reduce or eliminate query processing operations at various ones of the index servers.
We are also told about a number of related patent applications that don’t appear to have been published yet at the US Patent Office:
- Query Scheduling Using Hierarchical Tiers of Index Servers, filed Mar. 30, 2007;
- Index Updating Using Segment Swapping, filed Mar. 30, 2007;
- Phrase Extraction Using Subphrase Scoring, filed Mar. 30, 2007; and
- Bifurcated Document Relevance Scoring, filed Mar. 30, 2007
The Phrase Posting Lists patent itself is fairly long and detailed, and describes how phrases are extracted from web pages and indexed, how those indices are arranged across multiple servers, how the phrasification process is handled in more depth, and how this phrase-based information system can look at co-occurrence to indentify related phrases.
If you’re interested in how Google indexes content found on web pages, and willing to dig into some of the technical details, you may want to spend some time with the phrase-based indexing system patent filings.
I have had a few people link to some of my earlier posts on phrase-based indexing, and state that they are indications that Google is using Latent Semantic Indexing because the indexing system pays attention to different phrases that tend to co-occur on web pages. While that part of the indexing system is interesting and worth studying, it isn’t latent semantic indexing.