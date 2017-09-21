Navneet Panda, whom the Google Panda update is named after, has co-invented a new patent that focuses on site quality scores. It’s worth studying to understand how it determines the quality of sites.
Back in 2013, I wrote the post Google Scoring Gibberish Content to Demote Pages in Rankings, about Google using ngrams from sites and building language models from them to determine if those sites were filled with gibberish, or spammy content. I was reminded of that post when I read this patent.
Rather than explaining what ngrams are in this post (which I did in the gibberish post), I’m going to point to an example of ngrams at the Google n-gram viewer, which shows Google indexing phrases in scanned books. This article published by the Wired site also focused upon ngrams: The Pitfalls of Using Google Ngram to Study Language.
An ngram phrase could be a 2-gram, a 3-gram, a 4-gram, or a 5-gram phrase; where pages are broken down into two-word phrases, three-word phrases, four-word phrases, or 5 word phrases. If a body of pages are broken down into ngrams, they could be used to create language models or phrase models to compare to other pages.
Language models, like the ones that Google used to create gibberish scores for sites could also be used to determine the quality of sites, if example sites were used to generate those language models. That seems to be the idea behind the new patent granted this week. The summary section of the patent tells us about this use of the process it describes and protects:
In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of obtaining baseline site quality scores for a plurality of previously-stored sites; generating a phrase model for a plurality of sites including the plurality of previously-scored sites, wherein the phrase model defines a mapping from phrase-specific relative frequency measures to phrase-specific baseline site quality scores; for a new site, the new site not being one of the plurality of previously-scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of the plurality of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.
The newly granted patent from Google is:
Predicting site quality
Inventors: Navneet Panda and Yun Zhou
Assignee: Google
US Patent: 9,767,157
Granted: September 19, 2017
Filed: March 15, 2013
Abstract
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for predicating a measure of quality for a site, e.g., a web site. In some implementations, the methods include obtaining baseline site quality scores for multiple previously scored sites; generating a phrase model for multiple sites including the previously scored sites, wherein the phrase model defines a mapping from phrase specific relative frequency measures to phrase specific baseline site quality scores; for a new site that is not one of the previously scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.
In addition to generating ngrams from text upon sites, in some versions of the implementation of this patent will include generating ngrams from anchor text of links pointing to pages of the sites. Building a phrase model involves calculating the frequency of n-grams on a site “based on the count of pages divided by the number of pages on the site.”
The patent tells us that site quality scores can impact rankings of pages from those sites, according to the patent:
Obtain baseline site quality scores for a number of previously-scored sites. The baseline site quality scores are scores used by the system, e.g., by a ranking engine of the system, as signals, among other signals, to rank search results. In some implementations, the baseline scores are determined by a backend process that may be expensive in terms of time or computing resources, or by a process that may not be applicable to all sites. For these or other reasons, baseline site quality scores are not available for all sites.
Comments
Robin Khokhar says
Hi Biil,
Hearing about the Ngram phrase fro the first time. Thanks for sharing this useful information.
Robin
Ankit says
Same here, Read about Ngram for the first. Thanks for sharing 🙂
Tony says
Hi Bill what are the differences between this new patent and the older version which you wrote about found here: http://www.seobythesea.com/2014/09/new-panda-update-new-panda-patent/
Bill Slawski says
Hi Tony,
The earlier post I wrote was about the same patent, but as a patent application. That application has now been officially granted here by the USPTO. I thought it was worth pointing it out how the gibberish patent and this one both used ngrams to identify site quality, or lack of quality.
Clint Butler says
Thanks for the summary Bill, its interesting how this will potentially carry over to anchor text as well which I can see being used to determine link quality over time. The only issue I see is that not all sites will get a grade due to computing power which leaves the question to be asked, “Which sites will Google deem important enough to grade?”
Bill Slawski says
Hi Clint,
It was interesting seeing that anchor text is included in this site quality scoring. You raise an important issue with looking at patents to try to get an idea of what is happening at Google – we may be left with questions that we don’t have answers to, and likely can’t ask anyone in particular about. Since patents are written by the marketing or PR people at Google, they may leave gaps of information that we might want to find out more about. I respect the people at Google who work as webmaster evangelists, but sometimes they don’t know the answers to some questions (For instance, I asked how the “result scores” from the Google KG API are calculated, and they had no idea.)
Tobias Würflein says
Dear Bill,
thank you very much for the update on ngram models!
Nothing really new, but now it’s absolutely clear.
This is what I call truly helpful information in terms of SEO.
Best wishes
Tobias
Bill Slawski says
Hi Tobias,
Yes, It’s good seeing how some patents might be connected, and those connections can show us interesting things about how they are being used.
Deanna Friel says
fares says
Hi Bill,
just to know the reason, why you are interested in this patent when other patents exist like this one :
https://www.google.com/patents/US9760641?dq=inassignee:”Google+Inc.”+predicting+site+quality&hl=fr&sa=X&ved=0ahUKEwj71qKxhsbWAhVMvRoKHa59CGIQ6AEIXDAG
Bill Slawski says
Hi fares,
I’ve written about that one already: How Google May Calculate Site Quality Scores (from Navneet Panda) They are both co-invented by Navneet Panda, and they have differences, so it’s worth comparing them to see how they are similar and different from each other.