Phrase Based Information Retrieval and Spam Detection
Imagine an information retrieval system that uses phrases to index, search, rank, and describe documents on the web. This system would look at the way those phrases were used across the web to decide if they were “valid” or “good” phrases. In addition to considering if their use on all webpages was significant statistically by how frequently they were used, it would also look at how those phrases might have been related to each other – certain phrases tend to be mentioned in the same documents as other phrases
For example, a document that talks about the “President of the United States” may also be likely to include the phrase “white house.” So the appearance of some phrases can be used to predict the appearance of other phrases. And a spam document might contain an excessive number of related phrases.
Some “spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages.” Similar pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements. To searchers looking for meaningful content, those pages can be a waste of time, and cause of frustration.
Google’s Anna Patterson is the listed inventor on a number of patent applications that describe a phrase based information system that probably should be getting more attention than it has in the past. I’ve written about a couple of these patent filings, and a new one was published this week. The newest filing, and a couple of the others, are explicitly assigned to Google.
- * Multiple index based information retrieval system (20060106792) Assigned to Google
- * Phrase-based searching in an information retrieval system (20060031195) Assigned to Google
- * Phrase-based indexing in an information retrieval system (20060020607)
- * Phrase-based generation of document descriptions (20060020571)
- * Phrase identification in an information retrieval system (20060018551)
I focused upon a couple of these in Google Aiming at 100 Billion Pages? in May, and in February’s Move over pagerank: Google’s looking at phrases?, and those posts summarize some of the aspects of this system. It has a number of useful features which allow for indexing of a very large number of web pages, a historic index of older versions of web pages, a supplemental index for less popular pages, an elimination of “Google Bombing,” duplicate content detection, and as noted in the new patent filing, a way to identify web spam in the form of keyword stuffed pages and honeypots.
Detecting spam documents in a phrase based information retrieval system
Invented by Anna Lynn Patterson
US Patent Application 20060294155
Published December 28, 2006
Filed: June 28, 2006
An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.
Search System overview
Indexing happens by this system identifying phrases in documents on the web, and indexing those according to their phrases.
A searcher submits a query to search system, which attempts to provide relevant pages in response, while looking for phrases in the query, and then ranking the results using phrases to influence the ranking order.
When results are presented to searchers, they are first modified to remove near duplicate documents, and to create snippets (topical descriptions of the pages) for the pages.
This system includes a primary index and a secondary index which stores indexing information pertaining to documents, as well as a phrase data store that stores phrases, and related statistical information.
The previous patent application on a Multiple Index System tells us that information about some pages may be stored in the secondary, or supplemental index, that doesn’t capture as much information as would be found for results in the primary index.
Parts of the Indexing System
There are three primary functions performed by the indexing system:
- identification of phrases and related phrases,
- indexing of documents with respect to phrases, and;
- generation and maintenance of a phrase-based taxonomy.
Phrase identification acts to identify “good” and “bad” phrases in the document collection.
Good phrases stand out because they:
- Appear in more than certain percentage of the documents on the web, and/or;
- Are distinguished in appearance, by being marked up by html tags or “other morphological, format, or grammatical markers.”
- Predict other good phrases rather than being mere sequences of words appearing in the lexicon
An example of the predictive ability of good phrases:
The phrase “President of the United States” predicts other phrases such as “George Bush” and “Bill Clinton.”
Other phrases may not be predictive, such as “fell down the stairs” or “top of the morning,” “out of the blue.” Idioms and colloquisms like these are widely used, and often appear with many other different and unrelated phrases. Looking at how frequently phrases co-occur on individual pages, within the whole collection of indexed pages, can tell us whether or not the appearance of one phrase might be used to predict the appearance of another.
Functional Stages of the Phrase Identification Process
These can also be broken down into three steps:
- Collect possible and good phrases, along with frequency and co-occurrence statistics of the phrases,
- Classify possible phrases to either good or bad phrases based on frequency statistics, and;
- Prune good phrase list based on a predictive measure derived from co-occurrence statistics.
The patent application goes into a good deal of detail on how those steps are taken. A summary:
1. The documents are indexed in manageable partitions.
2. Phrases are identified at different word lengths, paying attention to stop words, ends of lines, paragraph returns, markup tags, and other possible ways to recognize changes in content or format. This step is known as traversal.
Example of traversal
The phrase window starts at the word “stock” and extends 5 words to the right.
The first word in the window is candidate phrase i, and the each of the sequences i+1, i+2, i+3, i+4, and i+5 is likewise a candidate phrase.
In this example, the candidate phrases are: “stock”, “stock dogs”, “stock dogs for”, “stock dogs for the”, “stock dogs for the Basque”, and “stock dogs for the Basque shepherds”.
In each phrase window, each candidate phrase is checked in turn to determine if it is already present in the good phrase list or the possible phrase list. If it isn’t in either, then the candidate has already been determined to be “bad” and is skipped.
If the candidate phrase is in the good phrase list as entry g.sub.j, then the index entry for phrase g.sub.j is updated to include the document (e.g., its URL or other document identifier), to indicate that this candidate phrase g.sub.j appears in the current document.
An entry in the index for a phrase g.sub.j (or a term) is referred to as the posting list of the phrase g.sub.j.
The posting list includes a list of documents d (by their document identifiers, e.g. a document number, or alternatively a URL) in which the phrase occurs.
In one embodiment, the document number is derived by a one-way hash of the URL, using, for example, MD5.
3. The identified phrases are put into a possible phrase list, and statistics are collected about those phrases.
a) P(p): Number of documents on which the possible phrase appears;
b) S(p): Number of all instances of the possible phrase; and
c) M(p): Number of interesting instances of the possible phrase. This may be where the possible phrase is distinguished from neighboring content in a document by grammatical or format markers, (boldface, or underline, anchor text in a hyperlink, or in quotation marks, or others). These distinguishing appearances involve various HTML markup language tags and grammatical markers. This information is retained for phrases when they are placed on the good phrase list.
Updating the Good list from the Possible List
After traversal in a partition is completed, the next step is to update the good phrase list from the possible phrase list.
The frequency of appearance of a phrase and the number of documents it appears within may indicate that it is used as semantically meaningful phrase.
The good phrase list will include individual words as phrases, as well as multi-word phrases.
A list of bad phrases isn’t stored – only possible and good phrases.
If a phrase appears for the very first time, it is very unlikely to be considered a good phrase at that time. It may be just coming into usage, and might be seen to be increasingly common – and if so will satisfy thresholds for being recongized as a good phrase.
Updating the co-occurrence matrix
A co-occurrence matrix is maintained and updated for the good phrases. This helps to keep track of when different phrases appear together in the same documents.
The matrix G has a dimension of m.times.m, where m is the number of good phrases.
Each entry G(j, k) in the matrix represents a pair of good phrases (g.sub.j, g.sub.k).
The co-occurrence matrix logically (though not necessarily physically) maintains three separate counts for each pair (g.sub.j, g.sub.k) of good phrases with respect to a secondary window that is centered at the current word i, and extends +/-h words.
In one embodiment, such as illustrated in FIG. 3, the secondary window 304 is 30 words. The co-occurrence matrix thus maintains:
1) R(j,k): Raw Co-occurrence count. The number of times that phrase g.sub.j appears in a secondary window 304 with phrase g.sub.k;
2) D(j,k): Disjunctive Interesting count. The number of times that either phrase g.sub.j or phrase g.sub.k appears as distinguished text in a secondary window; and
3) C(j,k): Conjunctive Interesting count: the number of times that both g.sub.j and phrase g.sub.k appear as distinguished text in a secondary window. The use of the conjunctive interesting count is particularly beneficial to avoid the circumstance where a phrase (e.g., a copyright notice) appears frequently in sidebars, footers, or headers, and thus is not actually predictive of other text.
Pruning the Good list Using the co-occurrence matrix
The third stage of the indexing operation is to prune the good phrase list using a predictive measure derived from the co-occurrence matrix.
Without pruning, the good phrase list is likely to include many phrases that while legitimately appearing in the lexicon, themselves do not sufficiently predict the presence of other phrases, or themselves are subsequences of longer phrases.
To identify good phrases, a predictive measure is used which expresses the increased likelihood of one phrase appearing in a document given the presence of another phrase.
Pruning the Good Phrase List to Remove Incomplete Phrases
The final step of this stage is to prune the good phrase list to remove incomplete phrases. An incomplete phrase is a phrase that only predicts its phrase extensions, and which starts at the left most side of the phrase (i.e., the beginning of the phrase).
The phrase “President of” predicts “President of the United States”, “President of Mexico”, “President of AT&T”, etc.
All of these latter phrases are phrase extensions of the phrase “President of” since they begin with “President of” and are super-sequences thereof.
It’s useful because it can predict one of those other phrases. But, if it doesn’t predict at least one other phrase that isn’t an extension of it, it may be seen as an incomplete phrase.
“President of the United” is an incomplete phrase because the only other phrase that it predicts is “President of the United States” which is an extension of the phrase.
This incomplete phrase list might be kept to help searchers. When a search query is received, it can be compared against the incomplete phase list.
For example, if the search query is “President of the United,” the search system can automatically suggest to the user “President of the United States” as the search query.
Each good phrase is used with sufficient frequency and independence to represent meaningful concepts or ideas expressed in the corpus.
Identification of Related Phrases and Clusters of Related Phrases
Looking at the co-occurence of phrases, and collecting information about them can help organize this information:
This approach provides a useful organization for clusters. First, rather than a strictly–and often arbitrarily–defined hierarchy of topics and concepts, this approach recognizes that topics, as indicated by related phrases, form a complex graph of relationships, where some phrases are related to many other phrases, and some phrases have a more limited scope, and where the relationships can be mutual (each phrase predicts the other phrase) or one-directional (one phrase predicts the other, but not vice versa). The result is that clusters can be characterized “local” to each good phrase, and some clusters will then overlap by having one or more common related phrases.
Ordering related phrases by information gain (how likely they will predict other phrases) helps to create a taxonomy for naming the clusters of the phrase. The patent tells us that
The above process provides a very robust way of identifying significant phrases that appear in the document collection, and beneficially, the way these related phrases are used together in natural “clusters” in actual practice. As a result, this data-driven clustering of related phrases avoids the biases that are inherent in any manually directed “editorial” selection of related terms and concepts, as is common in many systems.
The process used here has three parts:
- Identify related phrases having a high information gain value.
- Identify clusters of related phrases.
- Store cluster bit vector and cluster number.
Indexing Documents with Phrases and Related Phrases
I’m going to provide a very quick summary of most of the rest of the document – this is getting to be a pretty long post. Here are the other parts covered in this patent application:
Indexing Documents with Phrases and Related Phrases
Once there is a good phrase list, with the information about related phrases and clusters, the next step is to index the documents with respect to the good phrases and clusters, and store the updated information in the primary index and the secondary index.
Documents are pre-ranked according to information retreival practices with respect to the phrases.
The scoring algorithm for pre-ranking the documents may be the same underlying relevance scoring algorithm used in the search system to generate a relevance score. In one embodiment, the IR score is based on the page rank algorithm, as described in U.S. Pat. No. 6,285,999.
Alternatively or additionally, statistics for a number of IR-relevant attributes of the document, such as the number of inlinks, outlinks, document length, may also be stored, and used alone or in combination in order to rank the documents.
For example, the documents may be ranked in declining order according to the number of inlinks.
To further facilitate the fastest possible retrieval of information from the primary index, the entries in each posting list are physically stored on the appropriate primary server in the rank ordering by the IR-type score.
The lower an entry is ranked, the less information may be retained for it, and those lower ranked documents may be kept in a secondary server:
The foregoing storage arrangement enables storing significantly more entries in a given amount of hard disk storage than conventional techniques.
1) Elimination of the term position information for every phrase in every document provides approximately a 50% reduction in the amount of storage needed for a given set of documents, thereby effectively doubling the number of documents that can be stored.
2) Partitioning the posting lists between the primary index and secondary indices and storing relevance information only in the primary index provides further substantial savings. Many phrases have over 100,000, even 1,000,000 documents in their posting lists. Storing the relevance information for only a limited number of entries in the primary index eliminates the storage needed for the documents that are not likely to be returned in search. This aspect provides approximately a ten-fold increase in the number of documents that can be stored.
3) Further savings (approximately 25%-50% reduction in required storage capacity) are achieved by selectively storing less relevance information in the primary index for the less relevant (lower ranked) documents in each posting list.
The patent discusses keeping older versions of pages for archival purposes, and to allow people to search by date range. It also covers future indexing, and how changes in pages might be identified, and may impact how phrases are indexed.
Identifying Phrases in Queries
An important aspect of this whole process is the ability to identify whether or not a query contains a phrase. If it does, than this phrase indexing can be helpful in locating relevant documents for searchers. Identifying phrases in a query is a little similar to identifying phrases on a page to be indexed. But there are some differences.
For instance, capitalization may play a role in understanding what someone is looking for when they type a query into a search box.
Comparing queries to indexed phrases is a little complex, but the patent application notes four types of approaches they might take in doing some matching, when they see the following in a query:
– A single query phrase (the results are already pre-ranked)
– Two common query phrases (an intersection of results need to be ranked based upon the relevance data collected for each)
– Two rare query phrases (similar to when there are two common query phrases, but the likelihood of having to go to the secondary index is unlikely.)
– A common phrase and a rare phrase (primary index documents are joined together, and then secondary documents are added for the more common phrase, and all of those documents are ranked.)
Different Kinds of Rankings
The patent describes in detail, rankings based upon:
– Contained phrases
– Anchor Phrases
– Date Range Relevance
Identifying Spam Documents
Don’t know if you’re still with me here, but this is the main addition that this patent application brings to phrase based indexing.
How does this system know if keyword stuffing or honeypot activity is going on?
From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.
There are some more details, and some additional information about identifying phrases associated with spam to find spam pages, but the volume of related phrases seems to be a large (and possibly very effective) part of this process.
A long and detailed patent application, I apologize for the length of this post. But I wanted to capture a fair amount of it, especially since my last two posts on phrase-based indexing didn’t go into a lot of depth on the subject.
Is Google using a process like this? It’s possible.
For something similar from Yahoo, which they’ve at least tested live, and supposedly implemented for “related results” since 2003, see my post on Reranking Search Results Based Upon Concepts in Users’ Queries.