Yesterday, Google’s Distinguished Engineer Matt Cutts published a post on the Google Webmaster Central Blog titled Another step to reward high-quality sites that started out by praising SEOs who help improve the quality of web sites they work upon. The post also noted:
In the next few days, we’re launching an important algorithm change targeted at webspam. The change will decrease rankings for sites that we believe are violating Google’s existing quality guidelines.
We’ve always targeted webspam in our rankings, and this algorithm represents another improvement in our efforts to reduce webspam and promote high quality content.
This isn’t something new, but it sounds like Google is turning up the heat some on violations of their guidelines, and we’ve seen patents and papers in the past that describe some of the approaches they might take to accomplish this change.
A good starting point is the Google patent Methods and systems for identifying manipulated articles
There are a couple of different elements to this patent.
One is that a search engine might identify a cluster of pages that might be related to each other in some way, like being on the same host, or interlinked by doorway pages and articles targeted by those pages.
Once such a cluster is identified, documents within the cluster might be examined for individual signals, such as whether or not the text within them appears to have been generated by a computer, or if meta tags are stuffed with repeated keywords, if there is hidden text on pages, or if those pages might contain a lot of unrelated links.
I wrote about this patent in more detail when it was granted back in 2007, in Google Patent on Web Spam, Doorway Pages, and Manipulative Articles
Google’s computing and indexing capacity has grown by leaps and bounds since the 2005 paper, Spam: It’s not Just for Inboxes Anymore (pdf), which describes many of the types of web spam mentioned in the Google Blog post, and approaches that a search engine might take to address them.
Infrastructure updates like Google’s Big Daddy and Caffeine, and Google’s very recent move to Software Defined Networks provide the search engine the capacity to handle more complex tasks than they’ve been able to in the past.
Google has also been developing more sophisticated approaches, such as statistical language models to Identify “unnatural word distribution” on web pages. Google has been working with N-grams for a few years to build upon the statistical language models they use for a number of purposes, from speech recognition, to machine translation, to even identifying synonyms within contexts.
Google also has access to a much greater amount of data as well, and in the 2010 paper, The Unreasonable Effectiveness of Data (pdf), Alon Halevy, Peter Norvig, and Fernando Pereira from Google describe how having very large amounts of data can make even relatively unsophisticated algorithms work well.
See my post and the video within it at Big Data at Google for a deeper look at how having access to a great amount of data can make simple algorithms more effective.
Google also has more potential approaches in their hands to identify Web spam than they did in the past. For example, in my recent series on the 10 most important SEO patents, one family of patents I wrote about was Phrase-Based Indexing, which provide for ways to identify scraped content aggregated on pages, anchor text within links on pages that might be considered unusual, and even possibly the ability to combat Google Bombing.
I’ve written about a couple of other Google patents which were granted since the identifying manipulated articles patent came out, that describe some other ways that Google might identify Web spam.
One of them looks at how prevalent redirects of different types might be on a site when determining whether or not the search engine should to apply a duplicate content filter to a page on that site. Google would prefer to filter out a page that it might consider more likely to be spam than one that it doesn’t. The post I wrote about that patent is How Google Might Filter Out Duplicate Pages from Bounce Pad Sites.
The other patent takes a different approach by using n-grams to come up with a classification for a site, with the idea that a very high percentage of sites about topics such as computer games, movies, and music tend to be spam, and then by looking at how often pages within those spammier categories are clicked upon when shown in search results. My post on that patent is How Google Might Fight Web Spam Based upon Classifications and Click Data.
While Google has provided us with a number of examples of how they might identify spam pages within patents, chances are that Google has also come up with other methods and approaches that they might consider trade secrets, so that people creating Web spam are given less of a chance of learning about methods the search engine might use.
Interestingly though, researchers from the major search engines and academia have also been sharing notes since 2005 in a series of workshops referred to as AIRWeb, or Adversarial Information Retrieval on the Web. Even before the AIRWeb workshops, we’ve seen whitepapers like Microsoft’s Spam, Damn Spam, and Statistics (pdf) shared with the search community to help combat web spam.
Since Matt Cutt’s announcement yesterday, we don’t know exactly how much of an impact this new update will have. The post tells us that it might impact around 3% of all queries in languages like English, German, Chinese, and Arabic, and that in some languages that have more web spam such as Polish, it could affect 5% of all queries.