Using Ngram Phrase Models to Generate Site Quality Scores

Sharing is caring!

Scrabble-phrases
Source: https://commons.wikimedia.org/wiki/File:Scrabble_game_in_progress.jpg
Photographer: McGeddon
Creative Commons License: Attribution 2.0 Generic

Navneet Panda, whom the Google Panda update is named after, has co-invented a new patent that focuses on site quality scores. It’s worth studying to understand how it determines the quality of sites.

Back in 2013, I wrote the post Google Scoring Gibberish Content to Demote Pages in Rankings, about Google using ngrams from sites and building language models from them to determine if those sites were filled with gibberish, or spammy content. I was reminded of that post when I read this patent.

Rather than explaining what ngrams are in this post (which I did in the gibberish post), I’m going to point to an example of ngrams at the Google n-gram viewer, which shows Google indexing phrases in scanned books. This article published by the Wired site also focused upon ngrams: The Pitfalls of Using Google Ngram to Study Language.

An ngram phrase could be a 2-gram, a 3-gram, a 4-gram, or a 5-gram phrase; where pages are broken down into two-word phrases, three-word phrases, four-word phrases, or 5 word phrases. If a body of pages are broken down into ngrams, they could be used to create language models or phrase models to compare to other pages.

Language models, like the ones that Google used to create gibberish scores for sites could also be used to determine the quality of sites, if example sites were used to generate those language models. That seems to be the idea behind the new patent granted this week. The summary section of the patent tells us about this use of the process it describes and protects:

In general, one innovative aspect of the subject matter described in this specification can be embodied in methods that include the actions of obtaining baseline site quality scores for a plurality of previously-stored sites; generating a phrase model for a plurality of sites including the plurality of previously-scored sites, wherein the phrase model defines a mapping from phrase-specific relative frequency measures to phrase-specific baseline site quality scores; for a new site, the new site not being one of the plurality of previously-scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of the plurality of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.

The newly granted patent from Google is:

Predicting site quality
Inventors: Navneet Panda and Yun Zhou
Assignee: Google
US Patent: 9,767,157
Granted: September 19, 2017
Filed: March 15, 2013

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for predicating a measure of quality for a site, e.g., a web site. In some implementations, the methods include obtaining baseline site quality scores for multiple previously scored sites; generating a phrase model for multiple sites including the previously scored sites, wherein the phrase model defines a mapping from phrase specific relative frequency measures to phrase specific baseline site quality scores; for a new site that is not one of the previously scored sites, obtaining a relative frequency measure for each of a plurality of phrases in the new site; determining an aggregate site quality score for the new site from the phrase model using the relative frequency measures of phrases in the new site; and determining a predicted site quality score for the new site from the aggregate site quality score.

In addition to generating ngrams from text upon sites, in some versions of the implementation of this patent will include generating ngrams from anchor text of links pointing to pages of the sites. Building a phrase model involves calculating the frequency of n-grams on a site “based on the count of pages divided by the number of pages on the site.”

The patent tells us that site quality scores can impact rankings of pages from those sites, according to the patent:

Obtain baseline site quality scores for a number of previously-scored sites. The baseline site quality scores are scores used by the system, e.g., by a ranking engine of the system, as signals, among other signals, to rank search results. In some implementations, the baseline scores are determined by a backend process that may be expensive in terms of time or computing resources, or by a process that may not be applicable to all sites. For these or other reasons, baseline site quality scores are not available for all sites.

Sharing is caring!

32 thoughts on “Using Ngram Phrase Models to Generate Site Quality Scores”

  1. Hi Tony,

    The earlier post I wrote was about the same patent, but as a patent application. That application has now been officially granted here by the USPTO. I thought it was worth pointing it out how the gibberish patent and this one both used ngrams to identify site quality, or lack of quality.

  2. Thanks for the summary Bill, its interesting how this will potentially carry over to anchor text as well which I can see being used to determine link quality over time. The only issue I see is that not all sites will get a grade due to computing power which leaves the question to be asked, “Which sites will Google deem important enough to grade?”

  3. Hi Clint,

    It was interesting seeing that anchor text is included in this site quality scoring. You raise an important issue with looking at patents to try to get an idea of what is happening at Google – we may be left with questions that we don’t have answers to, and likely can’t ask anyone in particular about. Since patents are written by the marketing or PR people at Google, they may leave gaps of information that we might want to find out more about. I respect the people at Google who work as webmaster evangelists, but sometimes they don’t know the answers to some questions (For instance, I asked how the “result scores” from the Google KG API are calculated, and they had no idea.)

  4. Dear Bill,

    thank you very much for the update on ngram models!
    Nothing really new, but now it’s absolutely clear.
    This is what I call truly helpful information in terms of SEO.

    Best wishes
    Tobias

  5. Hi Tobias,

    Yes, It’s good seeing how some patents might be connected, and those connections can show us interesting things about how they are being used.

  6. Hearing about the Ngram phrase fro the first time. Thanks for sharing this useful information. Thanks for the summary Bill, its interesting how this will potentially carry over to anchor text as well which I can see being used to determine link quality over time. The only issue I see is that not all sites will get a grade due to computing power which leaves the question to be asked, “Which sites will Google deem important enough to grade?”

  7. Hi Bill, cool stuff! It raises all kinds of interesting questions, like if site quality scores are based on aggregate quality scores which are based on “commonly used phrases” (I’m simplifying here)… then what kind of impact does the devolution of language have? Doesn’t it lower the bar rather than raising it because fewer and fewer people know how to write grammatically correct sentences, which will be reflected in the aggregate score over time? Another unintended consequence could be giving people an incentive to create (near) duplicate content simply because it’s believed you can rank higher by saying what everyone else in the SERPs you’re gunning for is saying “because of aggregate scores man!” – end of rant – haha

  8. Hi Bill, Once again an amazingly informative article. The challenge here is getting content structured uniformally benefit from this.
    Thanks again for all the insight you bring to your articles.

  9. Hi Hijhem,

    I’m not quite sure that the things you are worried about, such as a devolution of language or near duplicate content are issues that might take place because of the use of using ngram models to score pages based on Site quality scores.

  10. I guess I’m also on the list of people who wasn’t aware of “Ngram Phrase”. You have provided in-depth information about it.

    Thanks

  11. Hey, Bill,
    I am a little bit of confused with the new pattern.Can i have a sentence which includes nGram phrase? If provide than i will be clear about this concept. BTW, you have discussed all things very well, but i am not a good student ):

  12. N-gram phase is new concept for me. But after read it I felt happy t know something new. Very informative post, thanks for sharing.

  13. Hi TI,

    If I were to turn the first sentence of your comment into a 2-gram, it would be look like this:

    I am
    am a
    a little
    little bit
    bit of
    of confused
    Confused with
    With the
    The new
    new pattern.

    I included punctuation in there because the patent said that it would likely include punctuation in the ngrams they broke content down into. I hope that makes it clearer for you.

    Bill

  14. At its most basic, it would likely easily detect ‘spun’ content. But I suspect they are way past that stage by now -this confirms some beliefs I’ve had about how Google looks at content 🙂

  15. Hey Bill
    Such a wonderful information and I m hearing about Ngram for the first time, thanks for fantastic post keep posting good work…!!!

    See You
    Tc 👍

  16. The author of this patent is Navneet Panda. There are at least a couple more search engineers at Google with the last name Panda. When the Panda Update came out from Google, I was guessing that Biswanath Panda was the Google employee that the panda update was named after. I later came across the Google+ page from Navneet Panda, and a “bragging rights” section on his profile page, where he had written that he was the Panda that the update was named after. He has been a co-inventor on a few other patents, which show that he is both very smart, and very good at search; like the patent that this post is about.

  17. Hi Andy,

    Google likely has been using ngrams for a few years to identify spun content. I suspect that they’ve gotten pretty good reviewing content using machine learning to gauge its quality pretty well now. I imagine it might be a little frightening seeing how good they are with it, considering the Books ngram studies they have done, which covered a lot of material (as does the web these days).

  18. hi Billi,
    Very nice one, Before reading this article I don’t know Ngram Phrase models and how to use them to generate a good score for sites.After reading I got some useful information, Thank you very much.

  19. Such a brilliant data and I m catching wind of Ngram out of the blue, a debt of gratitude is in order for fabulous post continue posting great work… !!!

  20. This phrase was a little confusing:
    “Building a phrase model involves calculating the frequency of n-grams on a site “based on the count of pages divided by the number of pages on the site.””

    “count of pages” == “numbers of pages”, no?

    Presumably, this should have read “based on the count of pages with a certain number of specific n-grams divided by the number of pages on the site”?

  21. Also, the phrase “a plurality of previously-stored sites” implies surely that their starting point is a set of quality sites that have been assessed manually by humans. A bit like Majestic did with Trust Flow?

    Presumably, they take the standard n-gram pattern of a set of ‘quality sites’ and compare that to another site?

    For example, take the phrase “I spent some time looking for the cheapest credit card I could find.” 13 words, so 13 x 1-grams, 12 x 2-grams, 11 x 3-grams, etc

    Now, they could look at the overall pattern of 1-grams, 2-grams, 3-grams from site to site, but that would be ridiculous surely if they don’t remove 1-gram prepositions, articles, etc from their calculations. So they therefore surely have to look at the individual n-grams. Take “cheapest credit card” as an example of a 3-gram. If the sample set of ‘quality sites’ has a frequency of x=>y and your site falls within that range, then your site would get a ‘tick against that particular n-gram’ as your use of that phrase doesn’t look spammy. And so on…. see how the pattern matches for all n-grams and assess whether one site looks like it matches the pattern for a quality site…..

    Just thinking aloud… your thoughts?

  22. Hi Matt,

    Having a training set of sites makes sense, and gives them something to compare sites to. If you look at the Google Books, ngram viewer (https://books.google.com/ngrams), you’ll see that It tends to work better if you have phrases that contain more than one word. I think that makes a lot of sense.

Comments are closed.