Google’s Pierre Far announced on his Google+ page that Google was releasing a new Panda update that supposedly included some new signals that could potentially help “identify low-quality content more precisely.”
The Google+ post also tells us that this change can help lead to a “greater diversity of high-quality small- and medium-sized sites ranking higher, which is nice.”
A new patent application shows off a quality scoring approach for content, based upon phrases. More on that patent filing below, but it might have something to do with this update.
So it sounds like this release of the Panda update could potentially be good news for some sites that were impacted by Panda in the past.
I looked through a few forum threads linked to by Barry Schwartz’s post on Search Engine Roundtable, Google Panda 4.1 Now Rolling Out; Aims To Help Smaller Web Sites
In one thread, a poster stated he noticed a change in traffic levels to his site starting on September 19. Another thread had someone suggesting that the change was one targeting spun and poor content.
I noticed that Navneet Panda, whom Google’s Panda Update was named after, has released another patent recently. When the first patent with his name on it came out, I asked if it was The Panda Patent. With many updates to Panda (and “data refreshes), at least one of the changes to the algorithm may have been described in that patent. And this latest update on the quality scoring of the content may be the cause of an update like we are seeing now.
The patent application is at:
Predicting Site Quality
Invented by Yun Zhou and Navneet Panda
US Patent Application 20140280011
Published September 18, 2014
Assigned to Google
Filed: March 15, 2013
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for predicting a measure of quality for a site, e.g., a web site.
In some implementations, the methods include obtaining baseline site quality scores for multiple previously scored sites;
- Generating a phrase model for multiple sites including the previously scored sites, wherein the phrase model defines a mapping from phrase specific relative frequency measures to phrase specific baseline site quality scores;
- For a new site that is not one of the previously scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site;
- Determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of phrases in the new site; and
- Determining a predicted site quality score for the new site from the aggregate site quality score.
The patent describes the use of a phrase algorithm, where content from pages is broken down into tokens (individual words plus some things like punctuation), and the frequency of phrases is counted on those pages to be calculated into a score for each page.
The patent doesn’t explain in much detail what a “phrase” is, like Google’s “phrase-based indexing patents do. We have no idea if Google ever used those patents, but it is possible.
Errors that appear on in tokens on pages might be counted in rather than ignored in a normalization process. Some very rare tokens (words that don’t appear on the web much at all) might be ignored in this quality score calculation.
Anchor text pointed to a page might be treated as a phrase that actually appears on the page being pointed to itself. This was an interesting statement in the patent, and its significance wasn’t explained. What it might end up doing is adding a lot of phrases of a specific type to a page, if there are a lot of links pointing to that page using the same anchor text.
These tokens might be broken down into groups of 1,2,3,4 or 5 tokens (words and punctuation) or n-grams (where “n” can be a specific number. Google has used n-grams in other ways as well, such as the n-gram viewer
A Google Research Blog post, All Our N-gram are Belong to You, tells us of a number of experiments at Google that used n-grams, involving work such as:
- Statistical machine translation
- Speech recognition
- Spelling correction
- Entity detection
- Information extraction
- Others
I’ve linked to the patent filing if you want to go through it, and discuss different aspects of it. While it may discuss a separate and different content scoring algorithm other than an update to Panda, the timing is interesting, and it’s worth thinking about.
Thanks for sharing insights from those tests, Michael. The frequency counts could show potentially a very high level of repetition of phrases, especially if anchor text is treated as described in the patent.
Interesting find, Bill. This patent describes a process that would explain the results of a reverse-engineering project I participated in back in 2011-2012, where we tested many different factors on Panda-downgraded sites. It came down to high-repetition of phrases on the page for our test site.
That finding agreed with what we were hearing from Googlers who talked with a few friends: that Panda was looking at repetition of text. I don’t think we have enough evidence to say conclusively this is the Panda algorithm but I’ll put my money on this patent until something better comes along.
Bill, thanks for sharing this! And it totally supports my analysis of Panda 4.1 (which I published yesterday).
I saw a major uptick in keyword stuffing on sites hit by 4.1. You should check out my post, especially based on analyzing the patent.
Thanks for the interesting article, Bill. One of my sites has been penalized by Panda.
I was able to recover deoptimizing it: I simply reduced the keyword density, and I deoptimized the H1 tag.
This was enough.
At the end, it seems we just need to write naturally, without trying to game Google.
The first part is the part I find most interesting, “Generating a phrase model”.
I would suspect that, in addition to repetitive text, they are also looking to improve the ability to spot content farm copy/paste jobs. A bulk of the low quality articles are copies of an original template where the words are changed around, but overall the ideas and sentence structure are the same. It should be easier to identify these cloned articles once you build a database of phrases from existing articles to compare against.
I wonder how they use this on eCommerce sites? They have gridviews filled with product names that would have the same keyword over and over. Have you seen anything in the tests that shows they are handled differently?