How Google Might Fight Web Spam Based upon Classifications and Click Data

When you enter a set of keywords into Google, the search engine attempts to find all the pages that it can which contain those keywords, and return a set of results ordered based upon a combination of relevance and importance scores. But it’s possible that many of the pages that could possibly be returned in response to such a search may not be very good matches for a topic related to the query terms used, or may be spam pages.

According to a Google patent filed in 2006 and granted today, around 90 percent of web pages that could be returned for topics such as computer games, movies, and music are spam pages, which exist only to “misdirect traffic from search engines.” The patent tells us that those pages are usually unrelated to those “topics of interest” and try to get a visitor to purchase things such as pornography, software, or financial services.

The patent presents an automated process that might be used by the search engine to classify documents based in part upon user-behavior data, to help weed out web spam.

There are multiple steps behind this process, but it begins with identifying a number of “seed” queries related to a specific topic. The queries are searched upon at the search engine, and the pages appearing as results of those queries are analyzed for common features.

For instance, the words that appear within a certain top number of those documents might be analyzed to see how often certain n-grams, or combinations of words appear within them. An “n-gram” might be consecutive combinations of words of different lengths, such as bi-grams, or two word combinations, or tri-grams consisting of three word combinations, or larger combinations. These combinations of words, taken from a phrase on a page such as “The quick brown fox jumps over the lazy dog,” might appear like the following, as an example of tri-grams:

  • The quick brown
  • quick brown fox
  • brown fox jumps
  • fox jumps over
  • jumps over the
  • over the lazy
  • the lazy dog

Another feature of words upon pages might look at something like how frequently specific words might appear on a page compared to how frequently those words appear on other pages on the web that contain the same word or words.

Those and similar types of features might be used to classify web pages based upon the words that appear within them, and to annotate those documents so that classification information is associated with them.

The topic information for the queries used is compared to the classification information for the pages that appear within the search results, to attempt to determine whether a page is.

  • Related to the specific topic
  • A spam document
  • Not related to the specific topic or is off-topic

While some pages may contain the keywords used in a query, that doesn’t necessarily mean that those pages are on the same topic as the query itself. Because of that, the patent tells us that user input might also be considered, such as looking at:

Click-through rates – how often certain pages are selected in search results in response to a query compared to how often those pages are shown in response to that query.

Click durations – The amount of time that someone remains on a given page when visiting that page after finding it in search results.

Other unnamed associated navigational operations, may also be used to determine whether or not pages should be associated with the classifications given to documents based upon user-behavior.

This combination of classification based upon topics, and click information can be used to determine whether pages are on topic, or may be off topic, and/or spam. On topic documents might be boosted in search results, while off topic or spam pages might be lowered in rankings or removed from those search results.

The patent is:

Method and apparatus for classifying documents based on user inputs
Invented by Jun Wu, Zhengzhu Feng, Quji Guo, and Zhe Qian
Assigned to Google
United States Patent 7,769,751
Granted August 3, 2010
Filed January 17, 2006

Abstract

One embodiment of the present invention provides a system that automatically classifies documents (such as web pages) based on user inputs. During operation, the system obtains a “classified” set of documents which are classified as relating to a specific topic. The system also obtains queries related to the specific topic. These queries produce “query results” which enable the user to access documents related to the query.

The queries also include “click information” which specifies how one or more users have accessed the query results. The system uses this click information to identify documents in the classified set of documents which are not related to the specific topic or are off-topic. If such documents are identified, the system shifts the identified documents so that they are regarded as off-topic and/or spam, and removes the identified documents from the classified set of documents.

Conclusion

This patent was originally filed in 2006, and while it describes one way that Google might attempt to identify whether or not pages which appear in search results or either spam, or are unrelated to the topic behind a set of query terms, it likely isn’t the only approach that Google may have found to try to filter out web spam.

Another approach to identify spam pages that the search engine could possibly use is described as part of Google’s Phrased-Based Indexing process, which I wrote about back in 2006, in Phrase Based Information Retrieval and Spam Detection. The process detailed in the patent behind that post doesn’t include a look at user-behavior data such as the click throughs and click duration mentioned in this patent, but it’s possible that it could.

Unfortunately, there’s still plenty of web spam to be found in search results, but it’s possible that approaches like the one described in this patent have been effective in filtering out some of the spam that we’ve seen in the past by combining the use of classification of pages and queries with user-behavior data. Google’s Matt Cutts, the head of their web spam team recently asked the visitors of his blog to identify web spam projects that they would like to see Google tackle over the next year.

Given the large number of responses to Matt Cutts, and the wide variety of problems identified as web spam within those responses, it’s pretty clear that the problem of web spam is far from being solved, and that the definition of web spam may be expanding to include content on web pages that may be relevant to a topic but of fairly low quality.

Share

40 thoughts on “How Google Might Fight Web Spam Based upon Classifications and Click Data”

  1. Data such as click-thru rates and click durations as you mentioned are really the only logical next step behind link query data for modern search engines. IMO this makes perfect sense, especially since results based on links alone can be manipulated. This information would be much harder for webmasters to manipulate. It forces a webmaster to produce fresh quality content if a high SERP is to be maintained.

  2. Does “click duration” refer to the actual time spent on a page (which I doubt Google can measure) or to a bounce rate? If they’re checking to see if the user returns quickly to the SERP, this won’t help in the case of MFA sites: the user clicks through from the SERP, sees that there’s no useful content on the page, but decides to click an AdSense link on the page rather than their back button.

    In such a case, Google wouldn’t receive the signal that the page was potentially spam unless they were also measuring the time between an ad unit loading and being clicked.

  3. @mark, it’s actually surprisingly easy to manipulate both the ctr and bounce rate/time on site of a webpage in the serps. The issue is doing it on high traffic volume keywords, where you would need significant scale to actually begin to influence the “real” data.

    Nevertheless, the guys who spam the serps will naturally gravitate to spamming the ctr given enough time…

  4. Hi Bob,

    Here’s what the patent tells us about click duration:

    Next, the system feeds “click information” 124 from the identified queries 122 into a document-filtering module 126 which filters out irrelevant/spam pages. For example, this click information 124 can include a “click-through rate,” which indicates the number of times a given document is selected divided by the number of times the given document is presented for selection. It can also include a “click duration,” which indicates an amount of time that a user remains on a given document while accessing query results. If either of these rates is low, the associated web page is likely to be a spam page.

    It’s not clear if this time is measured by how quickly someone might return to the search result, or might be measured in some other manner, possibly perhaps by something like the Google toolbar which might collect a trail of pages visited by a browser, including the search result URL before the selection and a new URL visited afterwards, with timestamps associated with each action.

    The issue of MFA pages is something worth considering within this context too.

    If an ad is clicked upon on a MFA page, Google should be able to tie that click to that search result as well.

    While the thought of Google leading people to pages where it’s more likely that they might click upon an ad may seem like it would be tempting to Google from a financial stance, would having search results filled with low quality MFA pages lead to less actual searches on Google itself? Is this something that Google considers when determining whether a MFA page appears in search results?

  5. Hi dchuk,

    Very good points. That’s another reason why data like this might not be viewed in isolation, but rather within the context of things like larger query sessions and search trails. The cost of spoofing clicks and visit durations becomes more expensive when those visits aren’t viewed alone, but rather as part of a larger process, and one that would need to be random enough to not seem to follow an identifiable pattern.

  6. Hi Mark,

    Good points. It does make sense that the more “meaningful” signals a search engine has to use to determine whether or not a page is a good result for a specific query, the better the results we’ll see.

    Even then, I can see a couple of potential problems.

    One is that some pages may be idea results for certain queries, yet provide answers to those queries so quickly that people viewing the pages return to search results almost immediately. In that instance, it may be worth looking at those searchers next queries to see if they’ve moved on to a new topic, for example.

    Other pages may not be on topic, but may be fairly engaging and interesting on their own, and may get clicked upon and have people spend a great deal of time there as well. It may also be worth looking at their future queries within their query session to see if they continue searching for more information on a topic even if their behavior might indicate that an off topic page was actually on topic.

    This patent really doesn’t raise the ideas of looking at query suggestions and user-behavior within the context of choices made over a larger query session, or within the context of a search trail. Those may be areas where a process like this has evolved.

  7. Hi Bill,
    I use sitemeter for my blog’s traffic statistics, and occasionally Google analytics. Lately I’ve noticed the “googlebot” visiting my site several, and I mean several times a day. Each time accessing random pages, sometimes for one or two minutes other times the bot’s on board at least 20 minutes.

    Does this mean anything?

  8. I believe that Google only measure CTR in search result. I have enough graded MFA, in which carefully selected description, so that they resemble “normal” sites. In fact, Google ne comparing CTR and so on normalized CTR, because CTR% difference in the outcome of the search.

  9. Patents can be notorious and expensive grants to obtain. There probably isn’t too many scientists/engineers who can assess and qualify the “newness” in some of the patents claim. This will hold the granting of a patent up. If they have applied for a world-wide patent (which I’m pretty sure they would have) then it has to be tested in all the patent offices from which the patent is sought.

    As technology progresses, these claims may well take longer and longer to obtain. Once a patent has been applied for, the technology is covered but they can’t take any legal action against any infringements until the patent has been granted.

  10. Hi Roschelle,

    It’s not unusual for a search engine to visit, leave, and come back to a site to crawl pages. That’s considered being polite – if the crawling program stayed for a longer period, and visited pages really quickly, it could potentially use up your server’s resources and that could impact whether or not other people could visit your pages as well.

  11. Hi Andrew,

    The patent process can take a fair amount of time, and four years isn’t all that unusual. Here’s the patent office transaction history for this particular patent:

    Transaction History
    Date Transaction Description
    08-03-2010 Recordation of Patent Grant Mailed
    07-14-2010 Issue Notification Mailed
    08-03-2010 Patent Issue Date Used in PTA Calculation
    06-25-2010 Dispatch to FDC
    06-25-2010 Application Is Considered Ready for Issue
    06-22-2010 Issue Fee Payment Verified
    06-22-2010 Issue Fee Payment Received
    06-02-2010 Mail Response to 312 Amendment (PTO-271)
    06-02-2010 Response to Amendment under Rule 312
    05-13-2010 Amendment after Notice of Allowance (Rule 312)
    03-24-2010 Mail Examiner’s Amendment
    03-24-2010 Mail Notice of Allowance
    03-24-2010 Document Verification
    03-24-2010 Notice of Allowance Data Verification Completed
    03-14-2010 Examiner Interview Summary Record (PTOL – 413)
    03-23-2010 Examiner’s Amendment Communication
    01-28-2010 Miscellaneous Incoming Letter
    01-29-2010 Date Forwarded to Examiner
    01-28-2010 Request for Continued Examination (RCE)
    01-29-2010 Disposal for a RCE / CPA / R129
    01-28-2010 Workflow – Request for RCE – Begin
    12-07-2009 Miscellaneous Incoming Letter
    12-14-2009 Mail Examiner Interview Summary (PTOL – 413)
    12-08-2009 Examiner Interview Summary Record (PTOL – 413)
    10-28-2009 Mail Final Rejection (PTOL – 326)
    10-26-2009 Final Rejection
    09-26-2009 Date Forwarded to Examiner
    09-04-2009 Response after Non-Final Action
    09-14-2009 Mail Examiner Interview Summary (PTOL – 413)
    09-03-2009 Examiner Interview Summary Record (PTOL – 413)
    06-04-2009 Mail Non-Final Rejection
    06-03-2009 Non-Final Rejection
    04-03-2009 Date Forwarded to Examiner
    04-03-2009 Date Forwarded to Examiner
    04-01-2009 Request for Continued Examination (RCE)
    04-03-2009 Disposal for a RCE / CPA / R129
    04-01-2009 Workflow – Request for RCE – Begin
    03-31-2009 Case Docketed to Examiner in GAU
    03-20-2009 Mail Advisory Action (PTOL – 303)
    03-19-2009 Advisory Action (PTOL-303)
    03-14-2009 Date Forwarded to Examiner
    03-12-2009 Amendment after Final Rejection
    01-12-2009 Mail Final Rejection (PTOL – 326)
    01-12-2009 Final Rejection
    12-19-2008 Date Forwarded to Examiner
    11-25-2008 Response after Non-Final Action
    11-07-2008 Mail Examiner Interview Summary (PTOL – 413)
    11-05-2008 Examiner Interview Summary Record (PTOL – 413)
    10-30-2008 Correspondence Address Change
    08-27-2008 Mail Non-Final Rejection
    08-26-2008 Non-Final Rejection
    07-27-2008 Date Forwarded to Examiner
    07-27-2008 Date Forwarded to Examiner
    07-15-2008 Request for Continued Examination (RCE)
    07-27-2008 Disposal for a RCE / CPA / R129
    07-15-2008 Workflow – Request for RCE – Begin
    06-25-2008 Mail Advisory Action (PTOL – 303)
    06-23-2008 Advisory Action (PTOL-303)
    06-19-2008 Date Forwarded to Examiner
    06-16-2008 Amendment after Final Rejection
    04-16-2008 Mail Final Rejection (PTOL – 326)
    04-14-2008 Final Rejection
    04-01-2008 Case Docketed to Examiner in GAU
    03-20-2008 Date Forwarded to Examiner
    02-28-2008 Response after Non-Final Action
    11-28-2007 Mail Non-Final Rejection
    11-26-2007 Non-Final Rejection
    10-17-2007 Withdraw Flagged for 5/25
    10-15-2007 Flagged for 5/25
    10-09-2007 Case Docketed to Examiner in GAU
    09-04-2007 Case Docketed to Examiner in GAU
    08-30-2007 Case Docketed to Examiner in GAU
    07-05-2007 Case Docketed to Examiner in GAU
    12-12-2006 IFW TSS Processing by Tech Center Complete
    04-07-2006 Application Return from OIPE
    04-07-2006 Application Is Now Complete
    04-06-2006 Application Return TO OIPE
    04-06-2006 Application Dispatched from OIPE
    04-06-2006 Application Is Now Complete
    03-24-2006 Additional Application Filing Fees
    03-24-2006 A statement by one or more inventors satisfying the requirement under 35 USC 115, Oath of the Applic
    02-23-2006 Notice Mailed–Application Incomplete–Filing Date Assigned
    01-17-2006 PGPubs nonPub Request
    02-06-2006 Cleared by OIPE CSR
    02-02-2006 IFW Scan & PACR Auto Security Review
    01-17-2006 Initial Exam Team nn

    As you can see, there’s a lot of interaction between the patent office and the people or person who filed the patent. Some of the claims in the patent were rejected a few times, and the patent was amended at least three times time before finally being granted.

  12. Hi Tihomir,

    I’m not quite sure of the point that you’re making. But, I do think it’s important to think about how Google treats MFA sites. There are a number of fairly large sites that primarily scrape together content from a mix of sources, while adding only a little new content, such as Mahalo, that seem to do well in search results at Google. Where does Google draw the line between spam and thin MFA pages?

  13. Hi Jan,

    Thank you. Those are very good points about the patent process. Patent Examiners not only need to know enough about the processes behind a patent to make a decision (or many decisions about them), but they also need to be very good at finding patents involving similar processes, and know a considerable amount about the patent process itself.

    The professional rules of responsibility (ethical codes) for Attorneys that are in place in most jurisdictions in the United States don’t allow lawyers to advertise specialties in most areas of the law with two exceptions – Admiralty Law and Patent Law. The reason for that is that both areas can be extremely complex.

  14. @Bob Gladstein and @Bill Slawski (about Click durations)
    Hi guys.
    I am wondering exactly the same thing. How can Google know how much time every user spends on a page. It is impossible. BUT it is possible for SOME of the people. If the user has google toolbar installed or if he uses Google Chrome, then Google can get that information and I am sure that they do that.

  15. Thanks you for spending some time to answer my question. I am really amazed and touched. Now I know why, at least now we can rest in the knowledge that patents are examined really hard before being granted. Again, thank you.

  16. Wondering why this has taken so long. Surely anything that helps combat spam in the search results and therefore making things more relevant and useful for the search is something that should be implemented as soon as possible.

  17. I might not be understanding things correctly, but I thought Google has been using click-through rates for quite awhile–at least the past three years. I don’t have any hard data or technical evidence, but I’ve experienced it first hand.

    One common example of the power of CTR is abused by spammers already. If you find a popular company or product, write a review with the word “scam” in it, the click through rate is typically very high I would imagine. It certainly draws peoples attention. You see affiliate marketers abuse this technique by using scam in the title, then trying to flip the person into buying the product by writing a positive review.

  18. @karen
    Ha-Ha. Very interesting method. I have not thought of that. I guess they could say in the title: “Product X – scam or not?” This should rank for “product X scam”.
    But CTR is not enough for good ranking :)

  19. Hi Derek,

    Yes, the toolbar and the use of the Google browser can help the search engine in those situations. When we think about how much user-behavior data may affect things like rankings, filtering of spam, and in other ways that a search engine may operate, I think we do need to be concerned about how the search engines might be collecting that data, and if it’s really capturing information from a wide enough audience. If more people start using Chrome, will the data being collected start to become more useful to the search engine? I’m guessing it will.

  20. Hi Ian,

    It’s possible that Google started using something like this back when the patent was filed, rather than waiting for the day that it was granted.

  21. Hi Andrew,

    You’re welcome.

    I’m not sure that every patent is scrutinized as deeply as it should be, but I often find myself impressed when I start digging through some of the documents like the ones that I’ve listed, and read about why specific claims in original patent filings might have been rejected, and then look at the amendments to patents that address those rejections. The patent process can be pretty serious, and it should be.

  22. Hi Karen,

    It’s likely that click-throughs have been used by a number of search engines since at least the late 1990s, when a search engine named Direct Hit started using click-throughs to influence the rankings of web pages. Like many of the processes that Google uses, chances are that click-throughs are likely only one of a large number of signals that a search engine might use for ranking, for spam identification, and so on, to avoid things like manipulation.

  23. Hi Sam,

    I agree – click-throughs by themselves aren’t enough for good rankings. For example, in the process described in this patent filing, they discuss how they go through a classification process for both queries and for web pages based upon which actual words are used on pages, and the frequencies of some words or phrases as well. Of course the process here is more about deciding whether or not a page should be filtered out of results as spam rather than for ranking purposes, but search engines do look at as much data as they think might be helpful.

  24. One thing to bear in mind is that although the patent was only recently granted, it doesn’t mean Google aren’t using this method right now, it’s just that they now have the exclusive right to use this particular method of document classification.

    Google SERP links are JavaScript now, and have been for at least a couple of years since I checked. Each link you click triggers a JS event before sending you to the page. What’s the official line on what these are for?

    As far as bounce rate is concerned how can this be determined unless Google Analytics is on the page? Cuttsy says here http://www.youtube.com/watch?v=CgBw9tbAQhU that his team will never ask the Analytics people for their data. Hope he’s not a liar.

  25. @Jon. The JS links aren’t standard. I’ve seen them used at times, but not all the time. If you’re seeing them on every SERP, then it may be that Google has included you in some test. I just ran a search, and the source code for the link in the first result was this (using square brackets so I don’t generate HTML here):

    [h3 class="r"][a target="_self" href="http://www.xxx.com/" class="l"]

  26. Wow everything that I want to say and want to ask are already being answered by the comments. But I would like to say that this is really a great post because it really gives us newbies some information about SEO. I also get to see the video provided by Jon. You guys are really pros. Hope to be like you guys someday.

  27. Hi Jon,

    I’m not sure if Google has published an “official line” on what may specifically happen when you click on a link in their search results. I don’t think it hurts to look at the source code from Google search results every so often to see what might be happening under the page, though.

    The amount of time someone spends on a web page after clicking on a link in search results could be collect in other ways than looking at Google Analytics, such as looking at Google toolbar data or Google Chrome data.

  28. Hi Bob,

    Thanks for the example of what you’ve been seeing in Google’s search results. I’m seeing an “onmousedown” js event attached to what you’ve written, as well.

  29. I just checked again, and I’m still not seeing any JS on the link anchored by the page title, but it is checking for onmousedown on the link to the cached copy of the page.

    However, it appears that if I disable Greasemonkey (I have a few scripts running when I use Google), the javascript does appear. Whether I’m randomly getting SERPs with the script and I happened to get one right after I temporarily disabled Greasemonkey or one of the scripts I run is somehow disabling Google’s tracking script I don’t know, but I may have been wrong in telling Jon that the scripts aren’t standard.

  30. That’s an interesting approach, but probably not too useful against today’s dynamic web spam. Click information won’t be available for a page until it has been up long enough to see a significant number of clicks. Spam pages (and especially hostile code pages) tend to have short lifetimes today.

    This also requires very intrusive logging of user activity. If you have the Google Toolbar installed, and have Web History enabled, Google knows what pages you visit and when. (Google could compute, for example, “time spent watching porn”.) Only with that level of intrusive logging is enough information captured for this approach to work.

  31. Hi Bob,

    Interesting that your use of Greasemonkey might keep Google’s tracking script from appearing. I haven’t regularly made a habit of viewing the source code for search results pages, but I might start making that something that I do daily.

  32. Hi John,

    If Google is going offer search personalization in an effective manner, they do need to collect a considerable amount of information about search and browsing behavior. That level of logging of information also can help them in many other ways, such as in deciding which query refinements to show searchers, determining spelling corrections, deciding when to show onebox results of different types, and more.

    Very interesting point about the short lifetimes of spam pages. What would you attribute those short lifetimes to?

  33. I don’t think spam of any sort will be manageable by computers until you have a computer smart enough to say, “hey, that’s a pretty cool webpage/email/im — I’d better tell the boss”. Until then, hands-on human edited/flagged results seem the only way.

  34. Hi SeaninBali,

    The search engines have come up with a number of ways to automate the identification of spam pages. I’ve written about a number of them in my Web Spam category. There’s still a lot of spam on the web, but manual editing and removal of automated spam is likely never to keep up.

Comments are closed.