This week, Google was awarded a patent that describes how they might score content on how much gibberish content it might contain, which could then be used to demote pages in search results.
That gibberish content refers to content that might be representative of spam content.
The patent defines gibberish content on web pages as pages that might contain a number of high value keywords, but might have been generated through:
- Using low-cost untrained labor (from places like Mechanical Turk)
- Scraping content and modifying and splicing it randomly
- Translating from a different language
Gibberish content also tends to include text sequences that are unlikely to represent natural language text strings that often appear in conversational syntax, or that might not be in text strings that might not be structured in conversational syntax, typically occur in resources such as web documents.
The patent tells us that spammers might generate revenue from the traffic to gibberish content web pages by including:
- Advertisements
- Pay-per-click links
- Affiliate programs
It also tells us that since those pages were created “using high-value keywords without context, the web page typically does not provide any useful information to a user.”
This gibberish content identification process involves:
- Creating language models for pages on the Web, and applying those models to the text of pages.
- Generating a language model score for the resource including applying a language model to the text content of the resource
- Generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index
- Calculating a gibberish score for the resource using the language model score and the query stuffing score
- Using the calculated gibberish score to determine whether to modify a ranking score of the resource
These gibberish content scores might be created for each page based upon multiple queries that are contained on those pages.
The pages may be ranked initially by information retrieval relevance scores and importance scores such as PageRank.
Pages may then be re-ranked or demoted based upon a statistical review where content on those pages is broken down into different n-grams, such as 5 word long n-grams that would break the content of a page into consecutive groupings of the words found on a page, and create statistics about those groupings and compare them to other n-gram groupings on other pages on the Web. An example n-gram analysis of a well-known phrase using 5 words:
The quick brown fox jumps
quick brown fox jumps over
brown fox jumps over the
fox jumps over the lazy
jumps over the lazy dog
The statistical patterns found in a language model can be used to identify languages, to apply machine translation, and to do optical character recognition.
The gibberish content patent is:
Identifying gibberish content in resources
Invented by Shashidhar A. Thakur, Sushrut Karanjkar, Pavel Levin, and Thorsten Brants
Assigned to Google
US Patent 8,554,769
Granted October 8, 2013
Filed: June 17, 2009
Abstract
This specification describes technologies relating to providing search results.
One aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving a network resource, the network resource including text content; generating a language model score for the resource including applying a language model to the text content of the resource; generating a query stuffing score for the reference, the query stuffing score being a function of term frequency in the resource content and a query index; calculating a gibberish score for the resource using the language model score and the query stuffing score; and using the calculated gibberish score to determine whether to modify a ranking score of the resource.
It’s not a surprise that Google might use natural language statistical models like the one described here to identify content that they might consider low-quality content. Having a technical name (gibberish content) to refer to that kind of content is helpful, as well as a patent to point others to when describing the dangers of creating low-quality content through one approach or another.
Last updated June 8, 2019.
This is really interesting. To be honest, I thought Google must use something like this in their ranking algorithm already, but presumably they don’t, if they’ve only just been awarded this patent.
Anyway, this is good news for legitimate website owners, but do you think it might inadvertently penalise some legit web writers for whom English is not their first language?
A great spot on the patent. I’d expect 99.9% of website owners don’t try and spam Google. In today’s SEO world the changes mean it’s impossible to “get away” with spammy gibberish content. It’s good to know that penalties will be applied where applicable.
A win for us all! 🙂
What exactly do they class as “gibberish” content? It would be helpful if there weren’t so gosh darned vague all the time. I presume this has been announced in line with the Hummingbird search changes – the searches should be more accurate now, so it figures they’d want to demote nonsense articles. That’s fine by me so long as people stop doing spun articles etc. If you work on good content you want it to rise above the dross.
Wow, this is eye opening indeed. I thought this would have been the easiest spam to combat. This should take down many link networks for sure! Or at least the ones that use low grade writers or spin software.
If this is a new patent award, does that mean Google might already be using it? Or they will be using it in the future? Either way, it will reward the sites that truly want to serve their audience by creating engaging, high quality information.
Hi Bill,
Kudos to google for raising the quality of blogs and web sites for that matter.
Although as noted above, I would have thought the Big G was using such protocols for years.
They probably were 😉
Thanks for sharing!
Bill – I couldn’t resist, here’s Eric Idle’s old Rutland Weekend Television sketch called “Gibberish”, which seems very apropos here. Enjoy!
– Ted
http://www.youtube.com/watch?v=hU0QZQRTNr0
Being able to better identify gibberish sounds like a step in the right direction, so hopefully this patent will help to cut down on spammers. It seems like most websites now are focusing on better content that’s more natural due to the Google updates, but this patent has potential to crack down on those remaining sites that use short cuts!
Good find, Bill. This isn’t a new thing despite the patented date, but it’s interesting to see it surface.
Hey Bill,
First of all thanks for an amazing article blended with research. I have a query . Is this going to effect the existing websites and blogs that have managed to climb to the top despite the poor quality of content and violating the natural language static model?
Thanks
Susan
I hope we see these changes sooner rather than later. Still seeing a lot of gibberish, spam and link farms ranking high. Frustrating as a consumer, frustrating as someone who has relevant content and deserves to be there more 🙂
It’s possible that G has been filtering this for some time now, but the patent is a good sign – unless you’re a spammer, of course. I can see where link farms would obviously be affected, but that’s been true for some time now. Thanks for the research …
It’s really good to see Google’s efforts to reward real, high content sites. I just hope this one doesn’t penalize sites written by people whose first language isn’t English, as sometimes those appear quite unnaturally written yet can still contain great articles and information.
One thing which has come out of the concerted efforts of Google is that the search queries now give quality results instead of the old spam low quality content which they used to return some years back. Great article. thanks.
Great article, Bill. Patents take a long time, so it seems Google has had this on their “get done asap list” for a while, which is promising to us non-spammy SEO/internet marketing professionals out there. Loving your blog as well and all the neat content and articles. I’ve got you bookmarked to come back for me. Thanks for sharing your info about Google’s most recent patent. Let’s see what else the file for and get approved for in the months/years to come! – Patrick with Whiteboard Creations
It’s about time they found a way to put an end to spun content all over the web, I have high hope for this patent, just hopefully they make sure it doesn’t have too many flaws and ends up punishing good original content as well.
I have seen this many times for JustAnswer.com aka Pearl.com aka Answerbag.com aka Eanswer.com etc etc.
They will be found in the most unexpected places, forinstance an Australian car website allegedly for entusiasts is chock full of fake posts mentioning JustAnswer.com, even advertiseents in the middle of a fellows post, I never heard anyone mention it before but it seems crooked.