Google Scoring Gibberish Content to Demote Pages in Rankings?

This week, Google was awarded a patent that describes how they might score content on how much Gibberish it might contain, which could then be used to demote pages in search results. That gibberish content refers to content that might be representative of spam content.

The patent defines gibberish content on web pages as pages that might contain a number of high value keywords, but might have been generated through:

  • Using low-cost untrained labor (from places like Mechanical Turk)
  • Scraping content and modifying and splicing it randomly
  • Translating from a different language

Gibberish content also tends to include text sequences that are unlikely to represent natural language text strings that often appear in conversational syntax, or that might not be in text strings that might not be structured in conversational syntax, typically occur in resources such as web documents.

The patent tells us that spammers might generate revenue from the traffic to gibberish web pages by including:

  • Advertisements
  • Pay-per-click links
  • Affiliate programs

It also tells us us that since those pages were created “using high value keywords without context, the web page typically does not provide any useful information to a user.”

This process involves:

  • Creating language models for pages on the Web, and applying those models to the text of pages.
  • Generating a language model score for the resource including applying a language model to the text content of the resource
  • Generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index
  • Calculating a gibberish score for the resource using the language model score and the query stuffing score
  • Using the calculated gibberish score to determine whether to modify a ranking score of the resource

These gibberish scores might be created for each page based upon multiple queries that are contained on those pages.

The pages may be ranked initially by information retrieval relevance scores and importance scores such as PageRank.

Pages may then be re-ranked or demoted based upon a statistical review where content on those pages is broken down into different n-gram, such as 5 word long n-grams that would break the content of a page into consecutive groupings of the words found on a page, and create statistics about those groupings and compare them to other n-gram groupings on other pages on the Web. An example n-gram analysis of a well known phrase using 5 words:

The quick brown fox jumps
quick brown fox jumps over
brown fox jumps over the
fox jumps over the lazy
jumps over the lazy dog

The statistical patterns found in a language model can be used to identify languages, to apply machine translation, and to do optical character recognition.

The patent is:

Identifying gibberish content in resources
Invented by Shashidhar A. Thakur, Sushrut Karanjkar, Pavel Levin, and Thorsten Brants
Assigned to Google
US Patent 8,554,769
Granted October 8, 2013
Filed: June 17, 2009

Abstract

This specification describes technologies relating to providing search results.

One aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a network resource, the network resource including text content; generating a language model score for the resource including applying a language model to the text content of the resource; generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index; calculating a gibberish score for the resource using the language model score and the query stuffing score; and using the calculated gibberish score to determine whether to modify a ranking score of the resource.

It’s not a surprise that Google might use natural language statistical models like the one described here to identify content that they might consider low quality content. Having a technical name (gibberish content) to refer to that kind of content is helpful, as well as a patent to point others to when describing the dangers of creating low quality content through one approach or another.

Share

28 thoughts on “Google Scoring Gibberish Content to Demote Pages in Rankings?”

  1. This is really interesting. To be honest, I thought Google must use something like this in their ranking algorithm already, but presumably they don’t, if they’ve only just been awarded this patent.

    Anyway, this is good news for legitimate website owners, but do you think it might inadvertently penalise some legit web writers for whom English is not their first language?

  2. A great spot on the patent. I’d expect 99.9% of website owners don’t try and spam Google. In today’s SEO world the changes mean it’s impossible to “get away” with spammy gibberish content. It’s good to know that penalties will be applied where applicable.

  3. What exactly do they class as “gibberish” content? It would be helpful if there weren’t so gosh darned vague all the time. I presume this has been announced in line with the Hummingbird search changes – the searches should be more accurate now, so it figures they’d want to demote nonsense articles. That’s fine by me so long as people stop doing spun articles etc. If you work on good content you want it to rise above the dross.

  4. Wow, this is eye opening indeed. I thought this would have been the easiest spam to combat. This should take down many link networks for sure! Or at least the ones that use low grade writers or spin software.

  5. If this is a new patent award, does that mean Google might already be using it? Or they will be using it in the future? Either way, it will reward the sites that truly want to serve their audience by creating engaging, high quality information.

  6. Hi Bill,

    Kudos to google for raising the quality of blogs and web sites for that matter.

    Although as noted above, I would have thought the Big G was using such protocols for years.

    They probably were ;)

    Thanks for sharing!

  7. Being able to better identify gibberish sounds like a step in the right direction, so hopefully this patent will help to cut down on spammers. It seems like most websites now are focusing on better content that’s more natural due to the Google updates, but this patent has potential to crack down on those remaining sites that use short cuts!

  8. Hey Bill,
    First of all thanks for an amazing article blended with research. I have a query . Is this going to effect the existing websites and blogs that have managed to climb to the top despite the poor quality of content and violating the natural language static model?

    Thanks
    Susan

  9. I hope we see these changes sooner rather than later. Still seeing a lot of gibberish, spam and link farms ranking high. Frustrating as a consumer, frustrating as someone who has relevant content and deserves to be there more :)

  10. It’s possible that G has been filtering this for some time now, but the patent is a good sign – unless you’re a spammer, of course. I can see where link farms would obviously be affected, but that’s been true for some time now. Thanks for the research …

  11. Pingback: Hoffnung für Text-Profis: Google versteht jetzt auch Journalismus
  12. One thing which has come out of the concerted efforts of Google is that the search queries now give quality results instead of the old spam low quality content which they used to return some years back. Great article. thanks.

  13. It’s about time they found a way to put an end to spun content all over the web, I have high hope for this patent, just hopefully they make sure it doesn’t have too many flaws and ends up punishing good original content as well.

  14. I have seen this many times for JustAnswer.com aka Pearl.com aka Answerbag.com aka Eanswer.com etc etc.
    They will be found in the most unexpected places, forinstance an Australian car website allegedly for entusiasts is chock full of fake posts mentioning JustAnswer.com, even advertiseents in the middle of a fellows post, I never heard anyone mention it before but it seems crooked.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>