Document Level Classifiers and Google Spam Identification

There have been a number of news opinion pieces and blog posts appearing on the Web in recent months telling us that Google has become less useful because of web spam from pages scraping content from other site as well as from low quality articles on content farms. Google’s head of Web Spam, Matt Cutts responded to those criticisms by announcing some new efforts at Google to make those kinds of pages not rank as well in search results. From the Official Google Blog, on January 21, 2011:

As we’ve increased both our size and freshness in recent months, we’ve naturally indexed a lot of good content and some spam as well. To respond to that challenge, we recently launched a redesigned document-level classifier that makes it harder for spammy on-page content to rank highly.

The new classifier is better at detecting spam on individual web pages, e.g., repeated spammy words—the sort of phrases you tend to see in junky, automated, self-promoting blog comments.

Matt Cutts – Google search and search engine spam

When I got to the section of that post about a “redesigned document-level classifier,” I started asking myself “What might Matt mean by a document level classifier, and how it might work to reduce the amount of spam found in search results.”

To provide a little insight into what a Document Level Classifier is, and how it might work, I dug into Google’s patents to see if I could find an example of a patent that specifically referred to a document level classifier that I may not have written about before.

I found the following patent which uses a document level classifier to understand the language being used on a web page:

Identifying language attributes through probabilistic analysis
Invented by Alexander Franz, Brian Milch, Eric Jackson,Jenny Zhou, and Benjamin Diament
Assignee: Google
US Patent 7,386,438
Granted June 10, 2008
Filed: August 4, 2003

Abstract

A system and method for identifying language attributes through probabilistic analysis is described. A set of language classes and a plurality of training documents are defined, Each language class identifies a language and a character set encoding. Occurrences of one or more document properties within each training document are evaluated.

For each language class, a probability for the document properties set conditioned on the occurrence of the language class is calculated. Byte occurrences within each training document are evaluated. For each language class, a probability for the byte occurrences conditioned on the occurrence of the language class is calculated.

A Document Level Classifier

A document level classifier is simply a program that might look at a number of attributes it finds upon a page to calculate a probability about a classification for that page. In the case of this patent about language attributes, those attributes could possibly includ things like looking at a character set and language meta tag, like the following:

<head><meta charset=”iso-latin-1″> <META LANG-=”fr”></HEAD>

But, the patent tells us that language and character set meta tags appear upon pages rarely, and are often incorrect when they do.

We’re also told that the search engine could look other clues to identify the language of a page, such as whether or not the domain the page is on uses a specific top level country code. For example, it’s possible that a site on a “.es,” domain is from Spain, and may be in Spanish.

Instead, the approach that this patent takes is to look at features like those, but to also use a text analysis approach that breaks text upon pages into n-grams, or groupings of words that are “n” words long. In this patent, the suggested length is 3 words.

So, if the patent were to look at this page to attempt to classify what language it might be in, it might start at the first sentence of this page, and start breaking it down into n-grams, three words long. So, it might take my first sentence, and start breaking it down into three word phrases like this:

There have been
have been a
been a number
a number of
number of news
of news opinion
news opinion pieces
opinion pieces and

These n-grams might be compared to a number of other pages on the web where the language of those pages is known, to determine that my page (or at least parts of my page) is in English. Note, the Google Books N-gram Viewer runs on a similar body of data as that used in this language detection approach.

The n-gram approach has been used in a number of ways by Google, as noted in the Official Google Research Blog post All Our N-Gram Belong to You:

Here at Google Research we have been using word n-gram models for a variety of R&D projects, such as statistical machine translation, speech recognition, spelling correction, entity detection, information extraction, and others

Using a Document Classifier to Identify Web Spam

A document level classifier doesn’t necessarily have to use an n-gram approach to identify web spam pages, but it’s possible that it might. A Google patent granted this past August included an n-gram approach to classify pages. I wrote about it in How Google Might Fight Web Spam Based upon Classifications and Click Data.

Another post that I wrote about a Google patent which describes how the search engine might identify web spam is Google Patent on Web Spam, Doorway Pages, and Manipulative Articles. That patent lists examples which might indicate that a page is web spam. Those examples include:

  • Whether the text of the document appears to be a normal language or language that might have been generated by a computer. For example, it might contain a large number of keywords, but not any actual sentences.
  • Whether or not the page uses meta tags, and if it does, whether or not those contain a large number of repeated keywords.
  • If there are scripts in the document redirecting visitors to another document upon when they access that page.
  • If the text on the page is the same color as the background of the page
  • Whether the page contains a large number of unrelated links
  • Looking at the history of the document, and whether certain things might have changed, such as the content, the linking structure, and possibly the ownership of the document/
  • The amount of links and anchor text on a page. If there are a lot of links, and very little text that isn’t, that could be a sign that the page is web spam.

Chances are that the redesigned document level classifier that Google is using to try to identify whether or not a page is web spam may look at many of these features as well as others.

What features would you look at if you were designing a document level classifier to identify web spam?

Share

31 thoughts on “Document Level Classifiers and Google Spam Identification”

  1. I think it’s interesting that Google talk about document classification not page classification. So by applying page segmentation techniques they could determine that a blog has a spammy footer and discount the links therein, whilst indexing the post content and passing pagerank from contextual links.

    As blogging gives way to social updates, which are shorter, link-heavy and content-light classifying page segments becomes powerful. For example, if 5% of my microblog updates are self-promotional spam, a search engine could ignore those and still make use of the 95% ham.

  2. Hi Andy,

    Interesting thoughts.

    Many of the patents that Google files use the word “document” instead of “page,” because it potentially allows them to apply what they’ve come up with in their patents to more than just web pages. So, PDF files, Word documents, audio and video files, and other kinds of documents are potentially within the scope of those patents. There really isn’t any other significance than that when it comes to the use of the word “document” as opposed to “page.”

    Having said that, it is very much possible that Google is using a visual gap segmentation or VIPS approach to breaking a page up into parts, and looking at the different parts carefully. A document level classifier could include a careful analysis of different parts of pages.

    I’m not sure that I see “blogging giving way to social updates.” They’re different things, with different purposes, and while some people may prefer writing microposts, others may find 140 characters too limiting.

    Regarding those social updates, every individual tweet or social update may be ranked both individually, and based upon an overall reputation score for the person posting them.

  3. I could also imagine that they look at the appearance of words from different topics. On spam blogs you often find all kinds of topics, for example from mortgage posts to vitamin pills posts to software reviews.
    So sites that publish this kind of material and fail on some other metrics, such as quality links and traffic, can be easily excluded in search engine rankings.

  4. Pingback: Anonymous
  5. Based on my observations, Google makes SERPs adjustments based on internal structure before it shows the change in structure on the search result snippet. I.e, a better title tag may boost your ranking even though Google appears to have not indexed your page and thus displays the old title tag.

    Your spam indicator “Looking at the history of the document, and whether certain things might have changed, such as the content, the linking structure, and possibly the ownership of the document” could tie into that very well. Sort of like a Bait-and-switch. Google gives an SEO rankings for a new structure, the SEO freaks out and reverts back to old structure since that’s what Google will appear to have indexed, Google says “gotcha”.

    Just a thought that needs a lot more testing!

  6. I would think that keyword-heavy page titles (hx-tags, not the title attribute) would be an excellent spam indicator.

    Newspapers and quality blogs write their page titles to communicate the contents of the article and persuade readers to click and read. Spam typically stuffs multiple related keyphrases in their page titles.

    Perhaps an even better indicator would be synonyms in close proximity in page titles, internal anchor text and other user-facing elements of a document.

    I see zero evidence of this being used, but a title like “Virgia Small Business SEO (Search Engine Optimization)” or “Chicago DUI Lawyer – Illinois Drunk Driving Attorney” is clearly not written for human intelligence and, depending on where it occurs, would be a good start when identifying spam documents.

  7. Frankly I’m surprised if Google have only been using these language modelling techniques in research at the time of that blog entry! I seen some very effective examples of language classification in the late 90′s using similar techniques to model the documents, and also examples of it in classification roles.

    My reckoning is that some variation of this could be used to detect the fluff that you get in the poor content so often turned out by the content farms, against the much richer language found in decent articles.

  8. So it sounds to me that if the spammers abide by a few SEO good practices, they will continue to fly right under the radar anyway, if Google really have tightened things up, a few sites may be caught out, but it won’t take long until the culprits adapt their approach anyway.

  9. If I understand this post correctly, I get the feeling that the n-grams seem to be checking for correct grammar (for the language as it is identified by the classifier) which may be somewhat lacking in keyword stuffed or spun content. Wow. What a messy way to have to determine whether or not content is SPAM.

    Mark

  10. I hope Damon’s not right. Optimized titles are one of the only successful strategies a small business can use to get into the search results. It used to be relatively easy for that tree service, drain cleaner or dog groomer to get new customers – just purchase a yellow page ad. I’m not at all for splogs, but these poor home improvement/service guys need a break.

  11. Hi Andreas,

    There may be an element involving looking at the range of topics that appear on a page to try to classify that page, though it’s possible that a site that offer multiple topics in a manner like that might be as likely to be a news site as it is a spam page. In one way, the kind of document segmentation that Andy mentions above might be helpful in ranking those kinds of pages.

    And, it is possible that certain topics might trigger a manual review as well, such as when a site might cover topics that seem to be attracting a lot of spam, such as nfl jerseys, or supplements of many types.

  12. Hi Joseph,

    Good points. Thank you.

    It can be hard to tell when Google has captured changes to your document, and it’s often tempting to change things back after you’ve made a change and aren’t seeing corresponding changes in search results and the snippets shown. I would probably be a mistake to say that Google doesn’t consider changes made to a page over time when it does do something like classify a document.

  13. HI Ted,

    Thanks for the link to your article.

    The patent on document inception dates was originally part of Google’s patent on Information Retrieval based on Historic Data, which was published in March of 2005, and provides many potential things that a search engine might look at to determine if a page might be web spam.

    The entropy concept was one of many approaches from that historic data patent that opened a lot of eyes about how closely Google was paying attention to many different aspects of a web page and information about that page. It’s difficult to tell how many of the methods described in the patent filing have been adopted by the search engine, but it’s likely that some percentage of them have been.

  14. Hi Damon,

    It’s not unusual to use keywords in places like page titles, headings, anchor text pointing to pages, and in the content body of pages in an effort to get a page to rank for a specific keyword. It’s also not unusual to include related terms and synonyms on those pages as well while optimizing a page.

    If a page is about the terms that you’re optimizing for, it really isn’t unnatural to include those terms in those places, and to also use synonyms within your content.

    But there probably is a point where people overdo it when trying to optimize a page for specific words, and the content of those pages can be painful to read.

    How might a search engine recognize when an author of a page becomes heavy handed with the way that they include keywords on pages? I think it might be easier for a person to recognize what you’re writing about, but I’m not quite as sure that its as easy for a computer to do the same.

  15. Hi Chris,

    I pulled the Google patent on language detection out as an example of a document level classifier, but I agree with you that the search engines are exploring approaches using n-grams that are more advanced than what we find in the patent.

    You might like the following document from the SIGIR 2010 Web N-Gram Workshop, which includes a few documents that describe how microsoft is using n-grams in a number of ways:

    http://research.microsoft.com/en-us/events/webngram/sigir2010web_ngram_workshop_proceedings.pdf

  16. Hi Steve,

    How does someone (or a computer) distinguish between low quality content and spam? Do you start doing things like calculating the reading levels of articles? The length of those articles? The number of related concepts that might be introduced on a page?

    A page might be low quality, and not be spam. There are topics on the web where there just may not be many pages that provide quality results for a search engine to return. A lot of the content farms that we see on the web have been acting to fill those gaps, but the pages created might just not be very helpful.

    Is the best approach to combating these kinds of low quality pages to remove them from search results?

    Google did publish a patent last year where they suggested that they might let searches know that there just aren’t very many relevant and quality results for some queries. They also suggested that they might create a specialized search engine that people could use to find queries or topics where there could be better pages created for those terms or subjects. I wrote about it in How Google Might Suggest Topics for You to Write About

    I actually like that approach, and think it might inspire people to start creating some quality pages for the Web on topics where them might not be many. Interestingly, Google’s head of Web Spam, Matt Cutts, is one of the authors/inventors of that patent filing.

  17. Hi Criação de Sites,

    Interestingly, we found that Google decided to focus upon pages of the kind that you describe before they’ve launched something targeting pages from “content farms.”

    So, it’s possible that content that might have been scraped or copied from other sites might not rank as well as the original content. Hopefully Google finds a way to do that right, and the original content is what ends up ranking higher than copies.

  18. Hi Mark,

    The patent that I wrote about involving language identification for web pages uses an n-gram approach to classify web documents. It’s possible that the approach that Google comes up with might also use n-grams to classify documents, but there’s no guarantee that they do. I used the language identification patent as an example of one that uses a document level classifier to achieve its goals.

    Google may not use that n-gram approach, and instead just find a number of other ways to determine whether or not a page is webspam. But, it is possible that they may use n-grams as part of that approach.

  19. Hi Juliemarg,

    I’m not sure that Google really has a problem with titles and headings and content that contains keywords in an intelligent manner, and I wouldn’t be too worried about that practice. There is a point where someone might overly stuff keywords upon a page, and there’s a risk that might set off some kind of web spam trigger, but I wouldn’t worry too much about a page that’s been optimized well.

  20. I have noticed that spam is creeping up too, just the other day I hit a redirect page that made it look to be scanning my computer. Even at that for the most part my searches return pretty relevant results.

  21. Hi Crystal,

    Really don’t like that kind of redirect that masquerades as antivirus or antimalware software. I can’t close those pages fast enough. I also hate the ones that have a pop-up window that ask you to click “OK” if you really want to leave the page.

  22. Hi Eric,

    I don’t think that Google considers wire news articles to be spam, but they may have a process in place specifically for those when they display them in Google News. I wrote about it in Google News Rankings and Quality Scores for News Sources.

    Since many of those news stories are taken from wire services, they do tend to present the same information, though a news organization is free to publish as much of those syndicated stories as they want, and even add to the stories. It’s not unusual for a news paper that is local to a breaking news event to add to a wire service article, and the “novel” or additional material may result in one of those stories ranking higher than others. A Google whitepaper discusses how they might identify those stories with added information: Detecting the Origin of Text Segments Efficiently (pdf).

  23. Thanks for looking up the patent so as to provide us further information on this topic.
    The issues of abusive keyword density or stuffing are pretty easy for Google to detect. I’m glad to hear that hx tags are not becoming outlawed if used in an normal way.
    To answer your question about what to look for, I guess one of the challenges is detecting the different strategies used to multiply original content by spinning articles or manipulating titles and descriptions while the content is mostly the same. Especially if the language is not in english.

  24. Hi Lisa,

    You’re welcome. I’ve had the challenge in the past of trying to create content for a number of pages where the topics where similar but the details were different.

    For instance, it’s not easy writing about the legal processes to do the same thing in 50 different states, and trying to come up with 50 different ways to say something similar is tough. But what I found after trying to do that on the same site for a number of years is that the magic is in understanding the differences and the details, and while there might be considerably more work in crafting unique content, the value that you get out of it is worth the effort. Not sure that the language really makes much of a difference.

  25. Pingback: 3 How Search Engines Work | Ignite Research

Comments are closed.