Phrase Based Information Retrieval and Spam Detection

Skipjack Chestertown

Imagine an information retrieval system that uses phrases to index, search, rank, and describe documents on the web. This system would look at the way those phrases were used across the web to decide if they were “valid” or “good” phrases. In addition to considering if their use on all webpages was significant statistically by how frequently they were used, it would also look at how those phrases might have been related to each other – certain phrases tend to be mentioned in the same documents as other phrases

For example, a document that talks about the “President of the United States” may also be likely to include the phrase “white house.” So the appearance of some phrases can be used to predict the appearance of other phrases. And a spam document might contain an excessive number of related phrases.

Some “spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages.” Similar pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements. To searchers looking for meaningful content, those pages can be a waste of time, and cause of frustration.

Google’s Anna Patterson is the listed inventor on a number of patent applications that describe a phrase based information system that probably should be getting more attention than it has in the past. I’ve written about a couple of these patent filings, and a new one was published this week. The newest filing, and a couple of the others, are explicitly assigned to Google.

I focused upon a couple of these in Google Aiming at 100 Billion Pages? in May, and in February’s Move over pagerank: Google’s looking at phrases?, and those posts summarize some of the aspects of this system. It has a number of useful features which allow for indexing of a very large number of web pages, a historic index of older versions of web pages, a supplemental index for less popular pages, an elimination of “Google Bombing,” duplicate content detection, and as noted in the new patent filing, a way to identify web spam in the form of keyword stuffed pages and honeypots.

Detecting spam documents in a phrase based information retrieval system
Invented by Anna Lynn Patterson
US Patent Application 20060294155
Published December 28, 2006
Filed: June 28, 2006

Abstract

An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. A spam document is identified based on the number of related phrases included in a document.

Search System overview

Indexing happens by this system identifying phrases in documents on the web, and indexing those according to their phrases.

A searcher submits a query to search system, which attempts to provide relevant pages in response, while looking for phrases in the query, and then ranking the results using phrases to influence the ranking order.

When results are presented to searchers, they are first modified to remove near duplicate documents, and to create snippets (topical descriptions of the pages) for the pages.

This system includes a primary index and a secondary index which stores indexing information pertaining to documents, as well as a phrase data store that stores phrases, and related statistical information.

The previous patent application on a Multiple Index System tells us that information about some pages may be stored in the secondary, or supplemental index, that doesn’t capture as much information as would be found for results in the primary index.

Parts of the Indexing System

There are three primary functions performed by the indexing system:

  1. identification of phrases and related phrases,
  2. indexing of documents with respect to phrases, and;
  3. generation and maintenance of a phrase-based taxonomy.

Phrase Identification

Phrase identification acts to identify “good” and “bad” phrases in the document collection.

Good phrases stand out because they:

  1. Appear in more than certain percentage of the documents on the web, and/or;
  2. Are distinguished in appearance, by being marked up by html tags or “other morphological, format, or grammatical markers.”
  3. Predict other good phrases rather than being mere sequences of words appearing in the lexicon

An example of the predictive ability of good phrases:

The phrase “President of the United States” predicts other phrases such as “George Bush” and “Bill Clinton.”

Other phrases may not be predictive, such as “fell down the stairs” or “top of the morning,” “out of the blue.” Idioms and colloquisms like these are widely used, and often appear with many other different and unrelated phrases. Looking at how frequently phrases co-occur on individual pages, within the whole collection of indexed pages, can tell us whether or not the appearance of one phrase might be used to predict the appearance of another.

Functional Stages of the Phrase Identification Process

These can also be broken down into three steps:

  1. Collect possible and good phrases, along with frequency and co-occurrence statistics of the phrases,
  2. Classify possible phrases to either good or bad phrases based on frequency statistics, and;
  3. Prune good phrase list based on a predictive measure derived from co-occurrence statistics.

The patent application goes into a good deal of detail on how those steps are taken. A summary:

1. The documents are indexed in manageable partitions.

2. Phrases are identified at different word lengths, paying attention to stop words, ends of lines, paragraph returns, markup tags, and other possible ways to recognize changes in content or format. This step is known as traversal.

Example of traversal

The phrase window starts at the word “stock” and extends 5 words to the right.

The first word in the window is candidate phrase i, and the each of the sequences i+1, i+2, i+3, i+4, and i+5 is likewise a candidate phrase.

In this example, the candidate phrases are: “stock”, “stock dogs”, “stock dogs for”, “stock dogs for the”, “stock dogs for the Basque”, and “stock dogs for the Basque shepherds”.

In each phrase window, each candidate phrase is checked in turn to determine if it is already present in the good phrase list or the possible phrase list. If it isn’t in either, then the candidate has already been determined to be “bad” and is skipped.

If the candidate phrase is in the good phrase list as entry g.sub.j, then the index entry for phrase g.sub.j is updated to include the document (e.g., its URL or other document identifier), to indicate that this candidate phrase g.sub.j appears in the current document.

An entry in the index for a phrase g.sub.j (or a term) is referred to as the posting list of the phrase g.sub.j.

The posting list includes a list of documents d (by their document identifiers, e.g. a document number, or alternatively a URL) in which the phrase occurs.

In one embodiment, the document number is derived by a one-way hash of the URL, using, for example, MD5.

3. The identified phrases are put into a possible phrase list, and statistics are collected about those phrases.

a) P(p): Number of documents on which the possible phrase appears;

b) S(p): Number of all instances of the possible phrase; and

c) M(p): Number of interesting instances of the possible phrase. This may be where the possible phrase is distinguished from neighboring content in a document by grammatical or format markers, (boldface, or underline, anchor text in a hyperlink, or in quotation marks, or others). These distinguishing appearances involve various HTML markup language tags and grammatical markers. This information is retained for phrases when they are placed on the good phrase list.

Updating the Good list from the Possible List

After traversal in a partition is completed, the next step is to update the good phrase list from the possible phrase list.

The frequency of appearance of a phrase and the number of documents it appears within may indicate that it is used as semantically meaningful phrase.

The good phrase list will include individual words as phrases, as well as multi-word phrases.

A list of bad phrases isn’t stored – only possible and good phrases.

If a phrase appears for the very first time, it is very unlikely to be considered a good phrase at that time. It may be just coming into usage, and might be seen to be increasingly common – and if so will satisfy thresholds for being recongized as a good phrase.

Updating the co-occurrence matrix

A co-occurrence matrix is maintained and updated for the good phrases. This helps to keep track of when different phrases appear together in the same documents.

The matrix G has a dimension of m.times.m, where m is the number of good phrases.

Each entry G(j, k) in the matrix represents a pair of good phrases (g.sub.j, g.sub.k).

The co-occurrence matrix logically (though not necessarily physically) maintains three separate counts for each pair (g.sub.j, g.sub.k) of good phrases with respect to a secondary window that is centered at the current word i, and extends +/-h words.

In one embodiment, such as illustrated in FIG. 3, the secondary window 304 is 30 words. The co-occurrence matrix thus maintains:

1) R(j,k): Raw Co-occurrence count. The number of times that phrase g.sub.j appears in a secondary window 304 with phrase g.sub.k;

2) D(j,k): Disjunctive Interesting count. The number of times that either phrase g.sub.j or phrase g.sub.k appears as distinguished text in a secondary window; and

3) C(j,k): Conjunctive Interesting count: the number of times that both g.sub.j and phrase g.sub.k appear as distinguished text in a secondary window. The use of the conjunctive interesting count is particularly beneficial to avoid the circumstance where a phrase (e.g., a copyright notice) appears frequently in sidebars, footers, or headers, and thus is not actually predictive of other text.

Pruning the Good list Using the co-occurrence matrix

The third stage of the indexing operation is to prune the good phrase list using a predictive measure derived from the co-occurrence matrix.

Without pruning, the good phrase list is likely to include many phrases that while legitimately appearing in the lexicon, themselves do not sufficiently predict the presence of other phrases, or themselves are subsequences of longer phrases.

To identify good phrases, a predictive measure is used which expresses the increased likelihood of one phrase appearing in a document given the presence of another phrase.

Pruning the Good Phrase List to Remove Incomplete Phrases

The final step of this stage is to prune the good phrase list to remove incomplete phrases. An incomplete phrase is a phrase that only predicts its phrase extensions, and which starts at the left most side of the phrase (i.e., the beginning of the phrase).

Example:

The phrase “President of” predicts “President of the United States”, “President of Mexico”, “President of AT&T”, etc.

All of these latter phrases are phrase extensions of the phrase “President of” since they begin with “President of” and are super-sequences thereof.

It’s useful because it can predict one of those other phrases. But, if it doesn’t predict at least one other phrase that isn’t an extension of it, it may be seen as an incomplete phrase.

“President of the United” is an incomplete phrase because the only other phrase that it predicts is “President of the United States” which is an extension of the phrase.

This incomplete phrase list might be kept to help searchers. When a search query is received, it can be compared against the incomplete phase list.

For example, if the search query is “President of the United,” the search system can automatically suggest to the user “President of the United States” as the search query.

Each good phrase is used with sufficient frequency and independence to represent meaningful concepts or ideas expressed in the corpus.

Identification of Related Phrases and Clusters of Related Phrases

Looking at the co-occurence of phrases, and collecting information about them can help organize this information:

This approach provides a useful organization for clusters. First, rather than a strictly–and often arbitrarily–defined hierarchy of topics and concepts, this approach recognizes that topics, as indicated by related phrases, form a complex graph of relationships, where some phrases are related to many other phrases, and some phrases have a more limited scope, and where the relationships can be mutual (each phrase predicts the other phrase) or one-directional (one phrase predicts the other, but not vice versa). The result is that clusters can be characterized “local” to each good phrase, and some clusters will then overlap by having one or more common related phrases.

Ordering related phrases by information gain (how likely they will predict other phrases) helps to create a taxonomy for naming the clusters of the phrase. The patent tells us that

The above process provides a very robust way of identifying significant phrases that appear in the document collection, and beneficially, the way these related phrases are used together in natural “clusters” in actual practice. As a result, this data-driven clustering of related phrases avoids the biases that are inherent in any manually directed “editorial” selection of related terms and concepts, as is common in many systems.

The process used here has three parts:

  1. Identify related phrases having a high information gain value.
  2. Identify clusters of related phrases.
  3. Store cluster bit vector and cluster number.

Indexing Documents with Phrases and Related Phrases

I’m going to provide a very quick summary of most of the rest of the document – this is getting to be a pretty long post. Here are the other parts covered in this patent application:

Indexing Documents with Phrases and Related Phrases

Once there is a good phrase list, with the information about related phrases and clusters, the next step is to index the documents with respect to the good phrases and clusters, and store the updated information in the primary index and the secondary index.

Documents are pre-ranked according to information retreival practices with respect to the phrases.

The scoring algorithm for pre-ranking the documents may be the same underlying relevance scoring algorithm used in the search system to generate a relevance score. In one embodiment, the IR score is based on the page rank algorithm, as described in U.S. Pat. No. 6,285,999.

Alternatively or additionally, statistics for a number of IR-relevant attributes of the document, such as the number of inlinks, outlinks, document length, may also be stored, and used alone or in combination in order to rank the documents.

For example, the documents may be ranked in declining order according to the number of inlinks.

To further facilitate the fastest possible retrieval of information from the primary index, the entries in each posting list are physically stored on the appropriate primary server in the rank ordering by the IR-type score.

The lower an entry is ranked, the less information may be retained for it, and those lower ranked documents may be kept in a secondary server:

The foregoing storage arrangement enables storing significantly more entries in a given amount of hard disk storage than conventional techniques.

1) Elimination of the term position information for every phrase in every document provides approximately a 50% reduction in the amount of storage needed for a given set of documents, thereby effectively doubling the number of documents that can be stored.

2) Partitioning the posting lists between the primary index and secondary indices and storing relevance information only in the primary index provides further substantial savings. Many phrases have over 100,000, even 1,000,000 documents in their posting lists. Storing the relevance information for only a limited number of entries in the primary index eliminates the storage needed for the documents that are not likely to be returned in search. This aspect provides approximately a ten-fold increase in the number of documents that can be stored.

3) Further savings (approximately 25%-50% reduction in required storage capacity) are achieved by selectively storing less relevance information in the primary index for the less relevant (lower ranked) documents in each posting list.

The patent discusses keeping older versions of pages for archival purposes, and to allow people to search by date range. It also covers future indexing, and how changes in pages might be identified, and may impact how phrases are indexed.

Identifying Phrases in Queries

An important aspect of this whole process is the ability to identify whether or not a query contains a phrase. If it does, than this phrase indexing can be helpful in locating relevant documents for searchers. Identifying phrases in a query is a little similar to identifying phrases on a page to be indexed. But there are some differences.

For instance, capitalization may play a role in understanding what someone is looking for when they type a query into a search box.

Comparing queries to indexed phrases is a little complex, but the patent application notes four types of approaches they might take in doing some matching, when they see the following in a query:

– A single query phrase (the results are already pre-ranked)
– Two common query phrases (an intersection of results need to be ranked based upon the relevance data collected for each)
– Two rare query phrases (similar to when there are two common query phrases, but the likelihood of having to go to the secondary index is unlikely.)
– A common phrase and a rare phrase (primary index documents are joined together, and then secondary documents are added for the more common phrase, and all of those documents are ranked.)

Different Kinds of Rankings

The patent describes in detail, rankings based upon:

– Contained phrases
– Anchor Phrases
– Date Range Relevance

Identifying Spam Documents

Don’t know if you’re still with me here, but this is the main addition that this patent application brings to phrase based indexing.

How does this system know if keyword stuffing or honeypot activity is going on?

From the foregoing, the number of the related phrases present in a given document will be known. A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. By contrast, a spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases. Thus, the present invention takes advantage of this discovery by identifying as spam documents those documents that have a statistically significant deviation in the number of related phrases relative to an expected number of related phrases for documents in the document collection.

There are some more details, and some additional information about identifying phrases associated with spam to find spam pages, but the volume of related phrases seems to be a large (and possibly very effective) part of this process.

Conclusion

A long and detailed patent application, I apologize for the length of this post. But I wanted to capture a fair amount of it, especially since my last two posts on phrase-based indexing didn’t go into a lot of depth on the subject.

Is Google using a process like this? It’s possible.

For something similar from Yahoo, which they’ve at least tested live, and supposedly implemented for “related results” since 2003, see my post on Reranking Search Results Based Upon Concepts in Users’ Queries.

Share

36 thoughts on “Phrase Based Information Retrieval and Spam Detection”

  1. Hey Bill,

    congratulations on this fantastic quality post… I’m stunned about its length and yet comprehensive style explaining me this concept.

    Now this is pretty sure the reason for really low quality content created by “seo copywriters” (focussing on keyword density and such crap) goes supplemental… (maybe combined with better dupe content checks… )

    thanks & happy new year
    cheers, christoph
    – the marketingfan

  2. Hi Christoph,

    It’s good to see you here. Thank you for your kind words.

    This may be one of my longest posts here, but it’s also one of the longest patent applications that I’ve written about, and potentially one of the most important.

    Happy New Year to you.

  3. _____________________________________________

    This patent is basically an advanced branch of CONCEPT SEARCHING (remember EXCITE) – and fundamentally, is already being used to some extent is some legal search applications. :-|

    However, it is possible, that classic sites like this:
    cuiwww.unige.ch/meta-index.html
    (which, for years, until recently, has been in the top 20 on Google for the term SEARCH ENGINES) would be seen as spammy.

    There will also be false positives with this method, only because of aggressive SEOing or extremely information-packed homepages. Or the aggressive use of synonyms or acronyms to cover all basis. Also, those who write in a taxonomy style (just using keywords without stop words) will suffer!!!!!!!

    However, if this method is used in balance with link popularity and popularity links, it would be worth evaluating the SERPs.

    But it must be remembered that, the new priority of search engines ALGOs – analyzing the anchor text/ back links from high trust ranked sites, makes it now nearly impossible for those honeypot sites to get high on the SERPs.

    Most searchers do not use complex search terms – so many no longer get spam sites to the degree they would have gotten a few years ago,
    And for those who do use complex terms, usually reference sites come up first. Poor sites usually remain near the bottom 900 – 1000 end of the serps

    if Google does buy into this, the so-called bad phrases sites might go into the supplimental listings.

  4. Wow, had to read this several times. It sounds great in theory, it will be interesting to see how it works. As users become more sophisticated queries are getting longer which should help phrase matching. It seems like part LSA, part search behavior analysis and part of the process being used to determine duplicate content. I know that’s a pretty simplistic observation but then again consider the source.

  5. However, if this method is used in balance with link popularity and popularity links, it would be worth evaluating the SERPs.

    Right, Search Engines WEB. None of that stuff goes away. This phrase-based indexing is a little like a reranking approach in that it fits over the information retrieval and link popularity methods in place. I really haven’t seen any other white papers or patents from Google that discuss supplemental results like these do, and under this, the bad phrases results do go into the supplemental results. And the solutions that we’ve been hearing about supplementals – links to the pages, and changing titles, meta descriptions, and other parts of pages that might be duplicates would work within the framework of these patents.

    Thanks, Brian of SearchDaddy. Good luck to the Bengals on Monday.

  6. As soon as there is enough processing power to do it, you’ll see a lot of new statistical methods of checking for spam.

    Of course, with enough processing power you could create statistically valid pages as well that were spam but don’t look like spam so the sword cuts both ways.

    I’m actually amazed that I haven’t seen google counting periods, commas, and other stuff like that. From what I’ve seen when looking at various spam pages, they have an inordinately low or high number of them depending on how the pages are created.

    G-Man

  7. I agree with you, G-Man.

    That reminds me of the Microsoft paper Spam, Damn Spam, and Statistics (pdf).

    This patent does mention punctuation in the context of helping to understand where phrases may stop, but nothing about using that type of statistical analysis to recognize spam. I wouldn’t be surprised if they were doing some of the things that the Microsoft paper suggests, though.

  8. I, like David, had to read this a couple of times (plus the links to other info), but it makes sense that the search engines would be wanting to achieve this. Certainly hope it starts to kill those awful articles that make no sense and waste the reader’s time.

  9. Hi Arnie,

    If it can have that effect, that would be tremendous.

    It’s hard to tell if any of this phrase based indexing has been implemented, but I think that parts of it may be. I’m still seeing Google bombing work, which we’re told under one of the patent applications wouldn’t be as prevalent. But the existence of supplemental results point towards the possible use of some of the ideas in these patent applications.

  10. Great post Bill,

    At first glance, this seems like an attempt to filter out people who scrape the SERPS and feed it right back to Google as keyword rich content. (Sorry G-Man)

    It might affect some SEO type of directories that have a high ratio of the same type of phrases in the titles, descriptions and listings listings for a competitive niche as well. Human edited directories may be less likely to be affected by this as the listings would likely be less repetitive.

  11. Thanks.

    I like the series of patent applications on phrase indexing a lot, and I’m hoping that some of the other types of things mentioned within them are added to Google, such as the ability to search by date range, and find older versions of pages. I guess we wait.

    The first four of these “phrase-based indexing” patent filings mention that there were six patents filed on the same day that are related to each other. Only four of them have now been published, and two others have also make their way to public scrutiny, including this one which was filed almost two years after the original six.

    This one does seem to have the potential to negatively affect Made for Adsense (MFA) sites as well as sites with keyword phrases stuffed into them. I wouldn’t be upset if it did.

    Good points on the different types of directories.

  12. Pingback: Yahoo Phrase Based Indexing in a Nutshell |
  13. Excellent article, quite long so will need to read again to comprehend the extensive analysis provided. While we`ll probably never be rid of crap, having come online after this article was authored I can say for certain at least the “quality of crap” has risen somewhat.

  14. i ve read alot about keywords and seo but this was the first article that i came across about phrase based indexing…thanks for the in depth details…Cheers !!!

  15. Old article, but very interesting patent. Always makes me wonder how Google gets so smart. I believe phrase based indexing also works now with text that’s been through article spinning.

  16. Hi Piotr,

    Interesting points.

    Google has also done a lot of research on statistical language models in the past few years, which can help with identifying synonyms amongst other things. Articles that are “spun” may not do the best job of appearing natural, and phrase-based indexing might also be used to identify them as well.

Comments are closed.