Google Turns to Deep Learning Classification to Fight Web Spam

In the past few years, Google has been busy building what has become known as the Google Brain team, which started out by having its deep learning approach watching videos until it learned to recognize cats.

Google has been hiring a number of people to add to the abilities of their deep learning team, including a pricy acqui-hire in the UK earlier this year, as described in More on DeepMind: AI Startup to Work Directly With Google’s Search Team

Web Spam Classification Patent

This patent describes methods that include:

  • Receiving an input comprising a plurality of features of a resource, wherein each feature is a value of a respective attribute of the resource
  • Processing each of the features using a respective embedding function to generate one or more numeric values
  • Processing the numeric values using one or more neural network layers to generate an alternative representation of the features of the resource, wherein processing the floating point values comprises applying one or more non-linear transformations to the floating point values
  • Processing the alternative representation of the input using a classifier to generate a respective category score for each category in a pre-determined set of categories, wherein each of the respective category scores measure a predicted likelihood that the resource belongs to the corresponding category

That “pre-determined set of categories” can include a search engine spam category. The category score for the “resource” measures a predicted likelihood that the resource is a search engine spam resource.

category-scorer

The pre-determined set of categories can include a respective category for each of a plurality of types of search engine spam. The pre-determined set of categories includes a respective category for each resource type in a group of resource types. Category scores can be used to:

  • Determine whether or not index resources in a search engine index.
  • Generate and order search results in response to received search queries.

A deep network can be effectively used to classify resources into categories. For example, resources can be effectively classified as being spam or not spam, as being one of several different types of spam, or as being one of two or more resource types. The patent tells us:

Using the deep network to classify resources into categories may result in a search engine being able to better satisfy users’ informational needs, e.g., by effectively detecting spam resources and refraining from providing search results identifying those resources to users or by providing search results that identify resources that belong to categories that better match the user’s informational needs.

Classifying Resources Using a Deep Network
Invented by Qingzhou Wang, Yu Liang, Ke Yang and Kai Chen
Assigned to Google
US Patent Application 20140279774
Published September 18, 2014
Filed: March 13, 2013

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for scoring concept terms using a deep network.

One of the methods includes:

  • Receiving an input comprising a plurality of features of a resource, wherein each feature is a value of a respective attribute of the resource
  • Processing each of the features using a respective embedding function to generate one or more numeric values
  • Processing the numeric values using one or more neural network layers to generate an alternative representation of the features, wherein processing the floating point values comprises applying one or more non-linear transformations to the floating point values
  • Processing the alternative representation of the input using a classifier to generate a respective category score for each category in a pre-determined set of categories, wherein each of the respective category scores measure a predicted likelihood that the resource belongs to the corresponding category.
  • The patent tells us that this resource classification system could classify resources as “search engine spam resources or not search engine spam resources.” It doesn’t define web spam in much detail, but does tell us that it might look at typical types of spam such as:

    • Content spam
    • Resources that include link spam
    • Cloaking spam resources, and
    • So on

    The resources on the pages of a site can include words from the content of the site in a tokenized form, URLs from the site, the title of the site, its domain name, categories or entity types relevant to the site, the age of the site. Each of these many features might be used to calculate a probability that the site is spam, which could determine whether or not it get indexed or reduced in rankings:

    For example, when the scores represent a likelihood that a resource is a search engine spam resource, the search system can use the score in a decision process so that a resource that is more likely to be spam is less likely to be indexed in the index database. As another example, when the scores represent likelihoods that a resource is one of several different types of search engine spam, the search system can determine that resources having a score that exceeds a threshold score for one of the types not be indexed in the index database.

    In some other implementations, the search system can make use of the generated scores in generating search results for particular queries. For example, when the scores represent a likelihood that a resource is a search engine spam resource, the search system can use the score for a given resource to determine whether or not to remove a search result identifying the resource before providing the search results for presentation to the user or to demote the search result identifying the resource in an order of the search results. Similarly, when the scores represent a likelihood that a resource belongs to one of a pre-determined group of resource types, the search system can use the scores to promote or demote search results identifying the resource in an order of search results generated in response to particular search queries, e.g., search queries that have been determined to be seeking resources of a particular type.

    While the patent doesn’t provide much in the way of details on training and classification of features under this machine learning model, it does refer to a paper that does, at:

    Large Scale Distributed Deep Networks, Jeffrey Dean, et al., Neural Information Processing Systems Conference, 2012.

    Google’s long time head of Web Spam, Google Distinguished Engineer Matt Cutts has been on his first extended leave after 15 years at Google. He is due to return in October. So that’s pretty interesting timing, with the patent released during his first real vacation in years. I wonder if it was turned on while he was gone?

    15 thoughts on “Google Turns to Deep Learning Classification to Fight Web Spam”

    1. Bill, when you think about what Google saves in cycles and bandwidth costs fighting spam in this manner not only theoretically improves results it saves a lot of cycles and bandwidth and that ==$$$$$$.

    2. On another note….. what are the chances that the delayed Penguin…. is a result of this at play?

    3. Hi Terry,

      Agreed completely. I feel like I’m seeing some smart stuff coming out from Google lately. I was talking with my sister and mother this morning about how hard it was to find a puppy two years ago when they were looking. My sister told me that search was so difficult that she gave up on Google for a while. Same search today, enter the dog breed in the search box and one of the top two results is the American Kennel Club, which has a site link to a search box where you can find puppies of different types. It’s a lot easier.

      The only patent I could find that’s already published that seemed like a good fit for Penguin is an old one from 2003 or 2004. It’s on annoying and manipulative links, found through dense subgraphs of links of the type that usually indicate the existence of things like doorway pages. This one is better.

    4. Giant article your have written. Where did you get that category score and resource feature date store graph?

      Another thing is you mentioned about content spam. How do they measure content spam? Any idea?

    5. Hi Ekhlas,

      The diagram is from the patent office, and is one of the pictures that was filed with the patent.

      The patent tells us that part of these category score involves looking at a lot of features around a web site and looking at patterns involving those that might be a sign of spam. There’s not a lot of description in the patent about how this process works, but if you have the time, I’d recommend reading Ray Kurzweil’s book “How to Create a Mind.” There’s a section in there were he talks about the architecture of the neocortex and how it captures patterns of information, and builds upon them. I suspect that this deep learning approach does something very similar.

    6. Google fight with web spam through his new search algorithm updates, like Panda and Penguin updates. It is so difficult for a site owner to getting better ranking on Google search result page.

    7. Great article . It is great to see such complex issues put so simply. Spam is a big issue.

    8. Bill,

      Another great article, but I wonder if this will open MORE opportunities for those Negative SEO’s out there?

      I battle Neg SEO almost daily – it is very real in my business, and this concerns me that it gives more opportunity to spam innocent websites to achieve de-indexing.

      Someone could easily create a spam category “pattern” at any time and “blast my site out of the water” – no?

    9. Hey Bill,

      This is really good and I must say a great attempt which only Google can do to minimize the web spam.

      But its just an attempt right now and I’m not sure of the results since there are lots of spam bloggers these days who are continuously working to find new spammy methods that really works indeed.

      Only Google knows what’s upcoming and its surely going to be good for genuine bloggers. 🙂

    10. Hi Kailash,

      There are a lot of people out there who have automated spamming methods and attempt to manipulate Google’s search results. Hopefully Google can make it cost too much to keep on doing that. I think that’s one of their aims.

    11. Hi Bill,

      Thanks for sharing this helpful stuff. The diagram describes well.

      Definitely Web spam is not a good thing and Google is very strict on it. Google makes and change it’s policies to decrease web spam at very high rate.

    12. Hi Steven,

      Thank you.

      I’m just not sure that there is enough in the patent itself that would give us an idea of how it might go about trying to distinguish actual spam on a page or site from spam that might be the result of a negative SEO campaign.

      It’s possible that this system might create a “negative SEO” category, covering whoever it might be who is acting to make another site look like it is spamming.

      Again, the patent itself doesn’t address the issue of negative SEO enough to give us an idea of how it might address it.

    13. This is awesome. I just found your blog and am ecstatic to say the least. Having a technical analysis of the algorithms and a review of the patents being put out by Google is something I have yet to see many people write about. I’m curious to see what you think about Dwayne’s negative SEO comment above.

    14. If Google could actually make their webspam algos work then they wouldn’t need to have such an massive manual action web spam team manually taking action against then networks it deems to have violated its T & C’s. The fact that Matt Cutts and his team do much of this manually indicates that the algo’s can’t detect half as much as they would hve us believe it does??
      Neg SEO is a whole can of worms that hasn’t even begun to bit yet, but given time it will i believe be seen in more and more high profile cases.

    Comments are closed.