Is This Really the Panda Patent?

Does Google’s newly granted patent co-invented by Navneet Panda describe Google’s Panda Update?

Search Quality vs. Web Spam

Many of the patent filings that I’ve written about from Google address Web Spam issues, and how the search engine may take steps or follow approaches to keep its search results from being manipulated. An early example of Google tackling such issues is their patent filed in 2003 titled Methods and systems for identifying manipulated articles.

Is this Google's Panda?

But many of the patents I’ve written about involve ways that Google is trying to improve the quality of search results that searchers see.

For example, one of Google’s first patents (remember that PageRank was Stanford’s patent and not Google’s) involved looking at a top number of search results in response to a query, and boosting some of those in search rankings for that query if they were linked to by other top ranking results for the same query. That patent, Ranking search results by reranking the results based on local inter-connectivity, was aimed at improving the quality of the top ranking results.

Google’s Phrase-Based Indexing patents involve looking at meaningful words and phrases that tend to co-occur or show up in the search results for a specific query, and then boosting the rankings of pages where those phrases do appear, or boosting how much weight is passed along through anchor text using one of those related co-occurring terms. These are search quality patents.

There are a number of phrase-based indexing patents, and there’s at least one of those that also addresses Web Spam by checking to see if there is a statistically abnormal amount of co-occurring words from the results on a page. So, the phrase-based indexing approach also included a way to detect web spam as well.

Focus On Quality

The patent, granted to Navneet Panda and Vladimir Ofitserov, Ranking search results, is aimed at improving search results rather than penalizing sites or identifying attempts to manipulate search results.

The patent does list only one “advantage” to following the process it describes:

Search results identifying low-quality resources can be demoted in a presentation order of search results returned in response to a user’s query. Thus, the user experience can be improved because search results higher in the presentation order will better match the user’s informational needs.

Before the Panda update was launched, there had been a number of highly public criticisms of the quality of search results showing up in searches at Google.

Here are a few examples:

December 13, 2009 – Dishwashers, and How Google Eats Its Own Tail – Paul Kedrosky

Google has become a snake that too readily consumes its own keyword tail. Identify some words that show up in profitable searches — from appliances, to mesothelioma suits, to kayak lessons — churn out content cheaply and regularly, and you’re done. On the web, no-one knows you’re a content-grinder.

December 13, 2009 – Content Farms: Why Media, Blogs & Google Should Be Worried – Richard MacManus

From my analysis of Demand Media and similar sites, such content is very generic and lacks depth. While I wouldn’t go as far as wikiHow founder Jack Herrick and say that it “lacks soul,” it certainly lacks passion and often also lacks knowledge of the topic at hand. Arrington’s analogy with fast food is apt – it is content produced quickly and made to order.

January 2, 2011 – On the increasing uselessness of Google….. – Alan Patrick

But this year it really hit home just how badly Google’s systems have been spammed, as typically anything on Page 1 of the search results was some form of SEO spam – most typically a site that doesn’t actually sell you anything, just points to other sites (often doing the same thing) while slipping you some Ads (no doubt sold as “relevant”). The other main scamsite type is one that copies part of the relevant Wikipedia entry and throws lots of Ads at you

January 3, 2011 – Trouble In the House of Google – Jeff Atwood

Like any sane person, I’m rooting for Google in this battle, and I’d love nothing more than for Google to tweak a few algorithmic knobs and make this entire blog entry moot. Still, this is the first time since 2000 that I can recall Google search quality ever declining, and it has inspired some rather heretical thoughts in me — are we seeing the first signs that algorithmic search has failed as a strategy? Is the next generation of search destined to be less algorithmic and more social?

It’s a scary thing to even entertain, but maybe gravity really is broken.

January 27th, 2011 – Google Search Quality Decline or Elitism? – AJ Kohn

Google could certainly do that. They could stand up and say that fast food content from Demand Media wouldn’t gain prime SERP real estate. Google could optimize for better instead of good enough. They could pick fine dining over fast food.

But is that what the ‘user’ wants?

Improving Quality

As you can see from the quotes I included from the blogs above, there was definitely a sense of Google results being broken, and showing results that were more focused upon matching queries than returning quality results.

These criticisms were being heard, even in the halls of the Googleplex, and in February of 2011, we were told of an update by the Official Google Blog, in a post titled Finding more high-quality sites in search. The impact of this change covered a fair number of searches, and was clearly aimed at surfacing higher quality content:

But in the last day or so we launched a pretty big algorithmic improvement to our ranking—a change that noticeably impacts 11.8% of our queries—and we wanted to let people know what’s going on. This update is designed to reduce rankings for low-quality sites—sites which are low-value add for users, copy content from other websites or sites that are just not very useful. At the same time, it will provide better rankings for high-quality sites—sites with original content and information such as research, in-depth reports, thoughtful analysis and so on.

So the question that I have, after watching the Panda update, reading a lot of threads in forums and other places about sites that were impacted by Panda, and working on some sites that definitely were, is whether or not the patent from Navneet Panda describes the update, and its attempts to improve the quality of search results.

Here’s a quick summary from the patent of what happens in the process it describes:

  • Determining, for each of a plurality of groups of resources, a respective count of independent incoming links to resources in the group;
  • Determining, for each of the plurality of groups of resources, a respective count of reference queries;
  • Determining, for each of the plurality of groups of resources, a respective group-specific modification factor, wherein the group-specific modification factor for each group is based on the count of independent links and the count of reference queries for the group; and
  • Associating, with each of the plurality of groups of resources, the respective group-specific modification factor for the group, wherein the respective group-specific modification for the group modifies initial scores generated for resources in the group in response to received search queries.

So the patent has multiple parts, which work together.

The first involves looking at the links pointing to the pages of a site, and removing all of the back links that look like they might be affiliated (under co-ownership or control) with the site, or reducing the number of independent links to account for things like site-wide links. It’s quite possible that this is done to get a sense of how many different unrelated pages and sites are linking to the pages of this site, the more independent links from more sources could be seen as a sign of quality.

The second is an analysis of whether or not pages appear to be targeted at specific referring queries. While it’s not unusual for someone doing SEO to a site to try to make every page on a site a potential landing page, many of the sites that we refer to as content farm sites often use every page to target highly commercial queries and multiple variations of those queries. So a content farm type of site might include many pages that attempt to refer to a lot of queries.

The independent links count and the referring queries count for the different groups that a site might be broken down into, are looked at as a ratio, with independent link count over referring query count. If there are a lot of independent links and few referring queries, this number could be over one. If there are few independent links and lots of referring queries, the number could be a fraction of one.

This number based upon the links and the queries would then be multipled by a score that has been modified by whether or not each page is seen as a navigational type result for a query term or phrase. The more it is like a navigational term or phrase, the higher this part of the score. The final score could boost ranking scores for some results and diminish scores for other results.

Groups Rather Than Pages

Instead of targeting specific pages or sites as many ranking algorithms do, the patent tells us that it looks at “groups” of resources. A group might be defined a number of different ways. Resources within groups can only be included in one group.

A group might be address based, so that all of the resources within the group are all in the same domain name, such as http://www.example.com. Or all in the same host name on a domain, such as http://host1.example.com or http://host2.example.com.

Groups of resources might be partitioned by a count of reference queries for each of the groups, “so that each partition includes groups of resources whose counts of reference queries are within a respective range of counts of reference queries.”

Under this approach, one website might be broken into more than one group, or might be part of a group that contains more than one web site. To rank the pages within these groups, the ratio of independent links to reference queries might be multiplied by a score involving navigational signals to determine a final rank.

Independent Links

If the purpose behind this patent is to rank pages higher that are higher quality, one way to do that is to look at the number of independent links to those pages, or groups of pages.

For each of these groups of resources, the patent tells us that it might count the number of links to those groups – but not all of the links. And not necessarily express links – links that you can click upon to get to another page. It might also count implied links, which sound more like what we often tend to refer to as citations. An express link can be used to navigate to a place, where an implied link can’t be clicked upon to bring a person to the target of that link.

Why doesn’t the patent mention PageRank? Both this metric and PageRank are supposed to be signals of quality, but not every signal from Google has to include PageRank. This reliance on independent links eliminates the benefit of having a site with lots and lots of pages to be linked to from the same site, or sites that are under the same ownership or control. Or linked to site wide from other sites.

An independent link is where the source of the link and the target are determined to be independent of each other. The source group that a link is in, and the target group can be checked to see if they are independent of each other as well.

Determining that links from one group to another are not independent links can also involve determining that those groups of resources are likely to be related, such as owned, hosted or created by the same entity.

If the resources have similar or identical content, or images, or foratting, or CSS or so on, their similarity is another sign that the resources are not independent.

Where there may be multiple links from one resource to a targeted group, only one of the links might be counted as an independent link. Though it’s not said in the patent, this could keep site-wide links from being counted more than once.

Reference Queries

In addition to an analysis of the links pointing to the different reference groups, this process also looks at the pages of the site, and the queries that each might be targeting. How well do those pages satisfy those queries?

If a page includes the term “example.com”, it might be said to refer to the home page of a site. If it includes terms that are commonly used by searchers to refer to the pages of a site, it might be said to include referring queries that refer to those pages. The patent provides an example of others, by telling us that:

…if the terms “example sf” and “esf” are often used by searchers to refer to the resource whose URL is “http://www.sf.example.com,” queries that contain the terms “example sf” or “esf”, e.g., the queries “example sf news” and “esf restaurant reviews,” can be counted as reference queries for the group that includes the resource whose URL is “http://www.sf.example.com.”

Navigational Queries

In the post, How Google May Identify Navigational Queries and Resources I wrote about how Google used a document classification approach to determine whether or not a page was one that searchers entered a query for, expecting to find a specific page, such as the official homepage of the product or service included within the query.

To a degree, this kind of inquiry isn’t too different than the set of questions that are raised in Amit Singhal’s Official Google Blog post, More guidance on building high-quality sites. It’s possible that such questions were worked into an analysis of a site at a stage like at this point, though the patent doesn’t refer to them specifically.

The patent is:

Ranking search results
Invented by Navneet Panda and Vladimir Ofitserov
Assigned to Google
United States Patent 8,682,892
Granted: March 25, 2014
Filed: September 28, 2012

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for ranking search results.

One of the methods includes:

  • Determining, for each of a plurality of groups of resources, a respective count of independent incoming links to resources in the group;
  • Determining, for each of the plurality of groups of resources, a respective count of reference queries;
  • Determining, for each of the plurality of groups of resources, a respective group-specific modification factor, wherein the group-specific modification factor for each group is based on the count of independent links and the count of reference queries for the group; and
  • Associating, with each of the plurality of groups of resources, the respective group-specific modification factor for the group, wherein the respective group-specific modification for the group modifies initial scores generated for resources in the group in response to received search queries.

Observations

Chances are that Google tweaked and changed the Panda Algorithm in the weeks and months after it was first applied, and may have made many changes to it as it went through an initial beta period.

I’ve seen a number of denials from people about this particular patent describing the Panda update since I wrote about finding it last week in Google’s Panda Granted a Patent on Ranking Search Results. These denials were based upon the existence of a link analysis described within the patent, without looking more closely at the actual process involved, and claiming that the patent more likely detailed the Penquin approach than the Panda approach.

But the link analysis here involving independent links, and referring queries are more of an attempt to gauge the quality of a site than the back link profile of that site. The “navigational” query analysis that could involve issues such as the example 23 questions that Amit Singhal provided us with, are also attempts to understand the quality of pages.

I ask, with the title of this post, if this patent is “really the Panda Patent”, but I do think it really is. But I am open to the possibility that the Panda Updates followed a somewhat different course as they were implemented and tested.

Share

60 thoughts on “Is This Really the Panda Patent?”

  1. Thanks for sharing for such a detail article Panda patent, Bill. You have input such great analysis into this blog post. Thanks again for the great share of your thoughts!

  2. Very interesting Bill.

    The first few parts of the patent, as you describe them, seemed aimed at Demand Media. Removing ‘controlled’ links (from other properties under corporate ownership) and then the links to referring queries ratio, which would target those sites that were ‘sharding’ keywords (i.e. – how to boil water, how to boil hot water, how to boil cold water).

    And the groups of resources certainly dovetails with Panda too, since we all saw that it was generally a site based demotion and not done on the document level.

    I’m guessing this was a material part of the Panda update. And it would make sense since it was undertaken because of the chatter you reference. Hence they started with a target group – a learning set so to speak.

    Not only that, but it underscores the fact that links were (and still are) an important way to determine quality. That doesn’t mean PageRank but simple link graph analysis.

  3. Hi Bill,

    Great analysis!

    A small detail caught my attention so I’ve checked it in the official doc you’ve mentioned – in regard to reference queries the patent quotes: “For example, a term that refers to a resource may be all of or a portion of a resource identifier, e.g., the URL, for the resource.” But isn’t this statement in conflict with a keyword stuffed domains issue Google had said to have reduced in vale as a ranking factor.

    For instance if you are reaching to a given resource thanks to a keyword query and it matches the domain of a given relevant website (with keyword stuffed url) then it should not be taken so much into account as to boost the given website in the SERPs … or I am missing something. Anyway – perfect explanation, nicely done:)

  4. People should not discount the link analysis aspect of the patent.

    BTW — it’s probably prudent to suggest that Google might have applied for more than one patent to protect the Panda process.

  5. Stellar stuff as always Bill.

    Like AJ I find the “Groups Rather Than Pages” section a compelling possible Panda paw print. Not just because sites were hit, but because even hard hit domains parts of a site were hit harder than others.

    Which makes sense in the content of sites that were targeting “highly commercial queries and multiple variations of those queries.” What one would typically find on content farms were what might be described as “target query page clusters” which, ironically, exposed themselves for classification as low quality by virtue of the nature of their targeting. Pretty clever of Google to leverage this particular signal.

  6. Hey Bill,

    Illuminating post. Could you clarify the meaning of ‘reference queries’ for me in the context of “Determining, for each of the plurality of groups of resources, a respective count of reference queries”.

    I’m surprised to see you refer to PageRank as a quality signal, instead of a popularity signal. I’m guessing you mean that quality is inferred by popularity.

    I agree with Michael M that there must be many other patents that cumulatively protect Panda.

    BTW, I like the site’s new header image and design.

  7. My understanding is that a U.S. patent must be filed within 12 months of the invention’s first sale, use, or publication.

    This patent was filed in September 2012, more than 18 months after the initial Panda update in February 2011.

    Wouldn’t that indicate that this patent describes something other than the Panda algorithm?

  8. Hi Dave,

    Given changes to US patent law in the last couple of years, and the changes those have made to prior art, I don’t know for certain. Would the use of the Panda Update so much earlier than the filing of this patent, and the manner in which it was used, have made it count as prior art? I am not a patent attorney, and I can’t give you a clear answer on that question.

    The claims within the patent, and the description do seem like they would fit the Panda algorithm, though the algorithm is one that appears to have been evolving over time as well, and this patent might cover a process that was adopted along the path of its development rather than at the start of it.

    And, as Michael Martinez notes earlier, we don’t know if there might or might not be other patents that are possibly related that have been filed as well. If so, they could cover parts of the Panda algorithm as well.

  9. Hi AJ,

    The patent does seem like it was written with the kinds of sites that were talked about in some of the criticisms of Google’s search results, and the approaches described do seem to address aspects of what we’ve seen with Panda. I would suspect that there may be other ways to accomplish some of those types of targeted goals, so this might not necessarily be the first iteration of how the algorithm might work. But, I do like seeing an approach spelled out in a way that would address some of the issues involving boosting higher quality resources.

  10. Hi Nevyana

    Thank you. There do seem to be some similarities with identifying referring queries – pages that seem to be targeted at specific queries based upon content within them. What we don’t know is how Google addressed that keyword stuffed domain issue. Google did publish one patent, which they originally filed in 2003, which wasn’t granted until 2011. I wrote about it in Google’s Exact Match Domain Name Patent (Detecting Commercial Queries), and it presented a number of different ways that Google might determine whether or not terms in queries might be commercial, and if so, whether or not they should be devalued.

    Doing an analysis to determine whether or not specific pages could be associated with referring queries because those terms might be in the domain name isn’t the same as discontinuing to give that same page a slight boost in search results for the query terms because the query terms are in the domain name.

  11. Hi Michael,

    I agree on the importance of the link analysis part of the patent. I also agree that there may be one or more additional relevant patents out there involving the Panda process. I mentioned the phrase-based indexing patents in the post, and there are a number of those that apply to different aspects of how phrase-based works, with different filing dates for many of them.

  12. Hi Aaron,

    Thanks. I do like the approaches described in the patent, and it makes sense to use them in a manner such as this. I am expecting to see more on the Panda algorithm at some point, and I’m hoping that we don’t have to wait too long.

  13. Hi Matthew,

    If you visit a page on the Web that has been optimized for a specific query, chances are that you will see keyword terms or phrases in page titles, in headings, in page content, and so on. You can get a sense of what query or queries someone might have attempted to optimize a page for.

    PageRank is based in part on popularity, but the search engines usually refer to it as a “quality” measure. It’s not based solely upon a sheer number of links, but instead is based upon a concept that is analogous to academic citations in scientific papers. The notion that important pages tend to link to important pages isn’t just a vote for the page with the most links, and those votes aren’t equal. A link from the home page of the NY Times is much more important than a link from the Fauquier Times (my local paper that comes out 2 times a week). A link from the NY Times Homepage might be worth hundreds or thousands of links from the home page of a small town newspaper online.

    It is definitely quite possible that there are other related patents out there that haven’t been made public yet involving Panda.

    Thanks regarding the new look for the site.

  14. Solid read here Bill! I’ve got to admit though, that a lot went over my head. I’m a noob in this space compared to some of y’all mavericks! But man, I completely enjoy reading and learning from your articles. Always thorough and “complete”! I’m especially fond of your aversion to speculation, unlike some other folks in the SEO space, but enough of that, nobody likes folks that name names.. Keeping doing this great work! Also, like one of the other guys up above, gotta congratulate you on the new look even though I do miss the “big fish” :-]! Great job. Modern and clean. Nice.

    Skolar.

  15. Thanks for posting this Bill, it makes a lot of sense.

    Shortly after Panda was rolled out there was speculation if Google is using a ratio of the number of links / number of pages on a site. A number of links / number of referring queries though seems much more interesting.

    You could think of it this way – if a group of pages is receiving a disproportionately high amount of search referrals while having a disproportionate number of links, what does this say about user satisfaction?

    It could also explain why many traditional recommendations of dealing with Panda were unsuccessful – removing duplicate, empty or even low content pages wouldn’t necessarily reduce the number of referring queries. Maybe SEOs shouldn’t have been removing pages that don’t have content, but rather pages that were ranking for large numbers of queries.

  16. Here’s an interesting thought experiment: forget for a moment that this patent has Navneet Panda’s name on it. If you were to read it without knowing the inventor’s name, what Google update would you most associate it with?

    A lot of the mechanics described in the patent — especially those regarding the application of a group (site) modification factor to individual search results — fit well with the conventional wisdom about the Panda algorithm.

    However, the basic value judgement embedded in the patent — based on the ratio of independent links to reference queries — doesn’t seem line up well with Google’s public statements about Panda, nor with individual stories of Panda penalty and recovery. Most of those have centered around content quality rather than link factors.

    The patent seems to make the assumption that there is a “natural” ratio between the number of independent links to a site and the number of reference queries to that site. That is, for every person who naturally creates a link to any given site, some number of people also perform searches referencing that site.

    This makes intuitive sense: people are going to link to a high-quality site, and they’re also going to search for it. Not as many people will link to a low-quality site; nor will they search for it. Regardless of a site’s size, how much traffic it receives, or even it’s quality, the ratio of links to reference queries probably remains fairly consistent.

    Google knows what this ratio is. While it may vary somewhat by niche (“plurality of groups of resources”), the ratio for most sites falls within bounds that are well-understood by Google. This algorithm seeks to penalize those sites that fall outside those bounds.

    So what sorts of factors might affect the ratio of independent links to reference queries?

    Site quality seems unlikely to affect this ratio. A high-quality site is going to attract both links and reference queries (people searching for the site by name). A low-quality site will not attract many links, but is also unlikely to be searched for by many people. While the absolute numbers of links and reference queries would both vary with quality, the ratio between the two is likely to stay fairly consistent, regardless of site quality.

    What sorts of things would affect the ratio of links to queries? The most obvious one is unnatural link building to a low-quality site. Let’s say a site owner launches a massive link-building campaign and doubles the number of independent links to the site. This effort is unlikely to affect the number of people searching for that same site, and would thus result in a big change in the site’s ratio of links to reference queries, which would in turn affect the site’s group modification factor.

    What I take away is that this specific algorithm is better-suited for detecting and penalizing unnatural linking than it is for site quality. It would also seem to open the door wide for negative SEO attacks.

    In those regards, the core value judgement embedded in the patent seems to be more closely aligned with Penguin than with Panda.

    That all said, Navneet Panda’s name is on this patent. I wouldn’t discount the possibility that he is responsible for both of those algorithms, nor that this patent describes an amalgamation of the two.

  17. Hi Dave,

    As soon as I hit the section of the patent involving “navigational queries”, that nailed it for me that the patent was about the Panda update. There was some talk from Google about the Panda update being a change to a document classifier, and the patents Google has published about navigational queries are all about classifications of documents.

    The “reference queries” section isn’t about what queries people are or were finding pages on a site for, but rather what queries the pages appear to be targeting – like the multiple keyword variations that content farms were targeting, such as “How to tie a shoe,” How to knot a shoe,” “How to put on a shoe.” That’s Panda, and not Penguin, regardless of the name Navneet Panda being on the patent.

  18. Hi Nemek,

    Thanks. I do like the approach described in the patent.

    Pages that are created in a way so that they are purposefully responsive to specific queries isn’t anything unusual, but it seems to have been a characteristic of content farm type sites that the Panda Update seems to be targeting to have groups of pages that reference content in a way that targets lots of specific queries.

    As the New York Times article, Google’s War on Nonsense puts it:

    Mr. Miller’s job, as he made clear in an article last week in The Faster Times, an online newspaper, was to cram together words that someone’s research had suggested might be in demand on Google, position these strings as titles and headlines, embellish them with other inoffensive words and make the whole confection vaguely resemble an article. AOL would put “Rick Fox mustache” in a headline, betting that some number of people would put “Rick Fox mustache” into Google, and retrieve Mr. Miller’s article.

    Getting rid of the low quality or substantially duplicate content pages probably wasn’t a bad idea, and would be something that I would recommend as part of a site audit regardless of Panda. But, it does seem like the target should be on the content aimed not at being quality content, but rather aimed more at inducing people to come to a site for specific queries being targeted by pages – not referring queries, but rather reference queries.

    Again, it’s not unusual for people to create pages that seem to be about something that people might search for, but pages that are set up like that, and yet have few or no independent links to them are probably going to rank lower.

  19. Hi RogueSkolar,

    Thanks. It’s hard to leave some speculation out, and given the impact that both the Panda and Penguin updates have had on many people and their sites, there definitely are going to be some strong feelings and emotional attachments to specific ideas and theories about how each of those updates work.

    Chances are that if there are any other patents that might be involved in the Panda and the Penguin updates, that we won’t see those until they are granted, and sometimes it can take a while for a patent to make it through the prosecution process and become public, so we don’t know what else is out there and when it might become public knowledge.

    Thanks, regarding the new look for the site as well. I’m still getting used to it myself. :)

  20. The patent seems pretty clear that a “reference query” is one in which the user is trying to reach a specific site or page:

    A reference query for a group of resources is a search query that has been submitted to a search engine and has been classified as referring to a resource in the group. A query can be classified as referring to a particular resource if the query includes a term that is recognized by the system as referring to the particular resource. For example, a term that refers to a resource may be all of or a portion of a resource identifier, e.g., the URL, for the resource. For example, the term “example.com” may be a term that is recognized as referring to the home page of that domain, e.g., the resource whose URL is “http://www.example.com”. Thus, search queries including the term “example.com” can be classified as referring to that home page. As another example, if the system has data indicating that the terms “example sf” and “esf” are commonly used by users to refer to the resource whose URL is “http://www.sf.example.com,” queries that contain the terms “example sf” or “esf”, e.g., the queries “example sf news” and “esf restaurant reviews,” can be counted as reference queries for the group that includes the resource whose URL is “http://www.sf.example.com.”

    I’m having a hard time squaring that paragraph with your definition and the shoe-tying example.

  21. Hi Dave,

    If you read the second part of that first sentence, it says “…and has been classified as referring to a resource in the group.”

    It doesn’t say that the pages of a site being looked at have to rank for those query terms, though they possibly could. It tells us that Google will look at queries that have been submitted to it, and see if the pages on the site in question appear to be targeted at any of those queries, or “refer” to those queries. From the passage you quoted, the classification of a referral query is the important thing here.

  22. Bill,

    I agree that whether or not the site ranks for a given query term is unmentioned and irrelevant to this section of the patent.

    But I don’t see anywhere in the patent that it discusses the concept of a page referring to or targeting a query. (In fact, I see no mention of on-page factors at all.)

    The paragraph I quoted above discusses (and fairly explicitly for a software patent) the concept of a user search query that refers to a specific website, usually by name or URL:

    A reference query for a group of resources is a search query that has been submitted to a search engine and has been classified as referring to a resource in the group.

    Replacing some of the cumbersome patent-speak with more common terms makes it even clearer:

    A reference query for a group of resources website is a search query that has been submitted to a search engine and has been classified as referring refers to a resource page in the group site.

    The query refers to the page or site, not the other way around.

    The examples provided in the same paragraph further emphasize that they’re talking about a query referring to a site, rather than a site targeting a query:

    search queries including the term “example.com” can be classified as referring to that home page

    queries that contain the terms “example sf” or “esf”, e.g., the queries “example sf news” and “esf restaurant reviews,” can be counted as reference queries for the group that includes the resource whose URL is “http://www.sf.example.com.”

    I’m not trying to pick a fight here, but the definition of “reference query” seems pretty central to understanding the patent, and I don’t think that your interpretation is supported by the content of the patent.

  23. Hi Dave,

    The query refers to a resource on the site. So the question is, how does that happen?

    There are a few different ways. One would be that specific pages on the site are optimized for those query terms that “refer to the resource.”

    Yes, this could be done through on-page optimization, or it could be done by pointing specific anchor text at pages within the resource, or both.

    Unfortunately, that doesn’t tell us that this is Penguin or Panda, though.

  24. Could you please cite the portions of the patent upon which you’re basing your conclusions?

    I don’t see anything in there to suggest that a reference query is anything other than a query that contains the URL or name of a specific site.

  25. Hi Dave,

    I am basing what I am saying on a knowledge of how pages rank for something. The patent’s description provides an example of a reference query that does more than just contain the URL or the name of the specific site:

    “…queries that contain the terms “example sf” or “esf”, e.g., the queries “example sf news” and “esf restaurant reviews,” can be counted as reference queries for the group that includes the resource whose URL is “http://www.sf.example.com.”

    That second example of a reference query, “esf restaurant reviews” doesn’t contain the name of the URL or the name of the site.

  26. Just a quick thought

    “this could keep site-wide links from being counted more than once.”

    Do you think it’s more likely Google simply ignores these links, or that sites they determine to not be independent with mass links to each other may actually be caught up in an algo “penalty”?

  27. Hi Todd,

    I think Google just ignores them when it’s making this calculation, so that the number of independent links isn’t prejudiced by the existence of some site-wide links. The idea seems to be to gauge the quality of pages being examined without the influence of dependent links (under the same control and/or ownership) and without counting multiple links from the same independent sources more than once. In other words, how likely is it that people will link to pages on the site (or groups of resources on the site) without having some kind of relationship with the site? Counting multiple links from the same site can throw that off.

  28. In that example, “ESF” an abbreviation for “Example SF” — the name of the site http://www.sf.example.com.

    The next paragraph of the patent drives home the point, equating reference queries with navigational queries:

    In addition or in the alternative, a query can be categorized as referring to a particular resource when the query has been determined to be a navigational query to the particular resource. From the user point of view, a navigational query is a query that is submitted in order to get to a single, particular web site or web page of a particular entity.

  29. Hi Dave,

    ESF is definitely an abbreviation for “Example SF”, but it’s not the domain, domain and host name, or site name. Some kind of actual processing had to happen for Google to understand that. If we go back to the first part of that sentence and look at it again, we get a hint of where Google learned that “ESF” refers to the site:

    As another example, if the system has data indicating that the terms “example sf” and “esf” are commonly used by users to refer to the resource whose URL is “http://www.sf.example.com,” queries that contain the terms “example sf” or “esf”, e.g., the queries “example sf news” and “esf restaurant reviews,” can be counted as reference queries for the group that includes the resource whose URL is “http://www.sf.example.com.

    There are a few different places where Google could learn what might be “commonly used by users to refer to the resource”, and those could be in query sessions that might include searches for the domain, or in searches that one or more pages of site might be optimized for that might be returned for a query such as “esf restaurant reviews.”

    Like the referring query can also be a navigational query, that’s a little like saying that a square can be a rectangle. Not all referring queries are going to be navigational queries. In the post, I did link to an earlier post I wrote about navigational queries, and a Google patent for navigational queries.

    So, what does any of this have to do with the Penguin update? How do you see that fitting in with the process described in the patent?

  30. For all intents and purposes “ESF” is a name for the site in question, in the same way that “National Football League” and “NFL” are both names for ww.nfl.com.

    As you correctly point out, there’s necessarily a classifier (whose details are not described in this patent) that determines whether or not a term refers to a specific site.

    The important point is that it’s classifying certain queries as referring to specific web sites. It’s not classifying web sites as referring to specific queries.

    What does all this have to do with Penguin? Not necessarily anything. But comparing the number of people linking to a site (“independent links”) to the number of people searching for it (“reference queries”) seems like a great way to detect unnatural linking patterns. And that seems much more Penguin than Panda. See my earlier comment for details.

  31. Hi Dave,

    A search engine will not make the association that “national football league” is nfl.com automatically. It needs to do some kind of analysis, and apply some level of confidence that the longer name is the same as nfl. The patent refers to user data as being one source of information about what queries might be classified as referring to resource groups that make up a site. That kind of user data can be helpful in making such a determination.

    The patent does refer to classifying queries as being referring queries for a site. When I stated that a website might be used to help determine what those queries were, I was saying that the words on the pages of the site and the words that the pages of the site might be optimized for can be determining factors in that classification of queries as referring queries for a resource.

    As for Penguin, I don’t see how this process can be used to determine unnatural linking patterns when it’s easy enough just to analyze the link graph to find things like dense bipartite graphs to find unnatural linking patterns. The 2003 Google patent I mentioned does just that.

    It’s not a matter or a question of a specific ratio, but rather just that more independent links makes it more likely that a “resource” will be higher quality since they aren’t nepotistic. The more reference queries for a “resource,” the less likely that it’s higher quality. If the count is close to equal for the two (independent links/reference queries), then this initial score is pretty close to one. This really doesn’t act to ferret out unnatural linking patterns in any way – it seems instead to be trying to find good links to determine a final score for a resource.

  32. Bill, I respect your experience in this area, but in this case I think it’s doing you a disservice and leading you to attribute properties to this process that aren’t supported by the text of the patent.

    Of course there’s a classifier, and obviously it’s going to have to look at a variety of factors factors to determine that “national football league” and “NFL” both refer to nfl.com. But for the purposes of the process described in this patent, how the classifier reaches its conclusions is irrelevant. (If the full classification process was a part of the invention, the patent would have to describe it, which it does not.)

    As defined in this patent, “NFL” is a reference query to nfl.com, and only to nfl.com. It does not refer to any other sites, regardless of how much they have tried to optimize for the term. A plain reading of the patent should make that abundantly clear.

    If you want to imbue “reference query” with all sorts of properties that aren’t mentioned in the patent, so be it. We can agree to disagree. But the simplest explanation is usually the right one, so I’ll choose to accept the term as it’s actually defined in the patent.

  33. Interesting post, very detailed. I’m not sure, to be honest, I don’t particularly study Google in such detail.

    There are some things about Google which do interest/bother me, one of which cropped up yesterday on Matt Cutts’ video section. A guy asked him, “How do you separate simple popularity over true authority?” And his response was pretty vague. I remember the Panda update rolling out and the widespread sense of panic which seemed to be about the place. The next one is on the way, no doubt, and I hope they call it Walrus.

  34. Bill, a very interesting exchange of thoughts between you and Dave on “reference queries” and I always understood it as how Dave has done. I truly believe it as how users use queries to reference resources within a group and not the otherway round.

    The problem with content farms was they had links but they were unlike wikipedia or NYT as people never used that many reference queries to them as they did while searching for resources in wikipedia or NYT. This is also confirmed by how some sites like wiki.answers.com still rank very well though them seem to have been created like a content farm as you had described above.

    But there are some content farms which still rank well because of their brand (reference queries). Just do a search for “Smallest Bird in the World” and see how ask.com, wiki,answers.com rank on the first page. And if you look at those pages, they actually do the same as you describe what a content farm does. But they still rank well because of their brand (reference queries) strengths.

  35. Hi Alex,

    Thanks. I did see the video you’re referring to, responding to a question from AJ Kohn (Blind Five Year Old). The question was:

    “As Google continues to add social signals to the algorithm, how do you separate simple popularity from true authority?”

    Matt purposefully ignored the first part of that question about social signals, but focused upon the “simple popularity vs. true authority” part of the question.

    I thought his answer was pretty interesting, especially in light of some of the discussion above, especially when he starts talking about algorithms working to find evidence that some sites might be good matches for specific topical queries.

    I don’t know if we will see another big update like Panda or Penguin soon, but we’ll definitely see new ranking signals and approaches. Walrus? :)

  36. Hi Rajesh

    I’m not sure that you could have “always” viewed reference queries the way that Dave had, since there really hasn’t been a reference to them like in this patent anywhere else before that I can locate. The only time you had to begin viewing them that way was in reading the patent.

    The definition from the patent is a definition of examples rather than a definition of what a “reference query” actually is, but unfortunately, it’s easy to ignore the part of one example that clearly states that Google would look at things like user data, to equate something like “ESF Restaurant Reviews” with the site in the example.

    If reference queries were only queries that included the domain name or host name and domain name or site name, then there just wouldn’t be many reference queries for any one site. That was an example, not a full definition.

  37. Bill, I used the word “always” to mean how i understood the term “reference queries” from the time i read it on the patent.

    Since “Group” as defined in that patent is address based and is defined as having the same domain name or host name, “reference queries” would mean any phrase that includes terms to refer to a resource in the group, such terms can only be domain names or host names or anything widely used by searchers to refer to that group. and this could include abbreviations like NFL etc.. So in the context of the patent, reference queries are determined by search users.

    Another interesting development immediately after panda was introduced was people complaining about sits copying their content ranking ahead of pages (on their site) when they do a quoted search of longer phrases or sentences unique to their pages. This was said to happen (and I have seen it myself) even if the copying site linked back to the originator.The following does gives a clue of why this problem could have occurred.

    “As another example, the system may have access to data that indicates how similar two resources are in one or more aspects, e.g., based on whether the two resources have identical or similar content, identical or similar images, identical or similar formatting, e.g., identical or similar Cascading Style Sheets (CSS), and so on.”

    From the above, it is clear that link backs from sites copying substantial portion of content of the originating site weren’t being considered as independent links and hence the value of such link backs have been zeroed out for the target site by the “independence” factor of incoming links in Panda algo.

  38. First of all, thank you Bill for drawing our attention to this patent. I’ve been studying Panda and trying to figure it out since April 2011 – so this was a huge find.

    Having analysed the patent in detail – I must confess that my understanding was that “reference queries” was pretty much synonymous with “navigational queries” – ie. queries intended to find a particular site.

    So when you said (above):

    ———————————-
    The more reference queries for a “resource,” the less likely that it’s higher quality. If the count is close to equal for the two (independent links/reference queries), then this initial score is pretty close to one.
    ———————————-

    I was surprised – because my understanding was the exact opposite: For a given number of backlinks, the more reference queries a site has, the better quality (or actually the bigger brand) it’s likely to be.

    So for example – searchers are more likely to type:

    “New York Times Malaysia Airways Flight”
    or
    “CNN Barak Obama Speech”
    or
    “WebMD Blood Pressure”

    rather than “Noname spammy site blood pressure”

    and that’s because the big brand sites have built a reputation for quality that searchers recognise – and they have a good experience on those sites that they want to repeat – and hence search for them by name. Hence the correlation Google has found between navigational searches and higher quality.

    Remember Matt Cutts said here: http://www.wired.com/2011/03/the-panda-that-hates-farms/2/

    ——–
    And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons
    ——–

    So he specifically mentions big brand sites that people are more likely to do navigational searches for.

    And that is what we saw after Panda – a huge boost for the major brands.

    For all Panda’s shortcomings – this IS quite a clever way of distinguishing between sites people would actively look for – and those who are merely good at SEO.

    I’ve also done some research in my own market – and saw some indication that people who did well after Panda have a better ratio of navigational searches to links than those who did badly – (but it’s a very small sample, and certainly not conclusive).

    My plan for my Panda-hit site having studied the patent is to try to get more natural navigational searches by boosting our brand recognition. But you’ve read a LOT more patents than me, so I hope I’m not heading in precisely the wrong direction…

  39. Hi Bill,
    I really don’t see how this could possibly be the initial Panda patent. Panda was very clearly about Google making a judgement call on site quality, and I don’t see enough of that described in this patent to justify calling it ‘the’ Panda patent.

    Yes, I think this is probably part of the Panda algorithm as it currently stands, but I don’t agree that this describes the initial update.

    Do you think it is possible that this was part of an update to the full Panda algorithm (e.g. http://www.seroundtable.com/google-panda-20-15789.html)?

  40. Hi Patrick,

    This patent starts off with pages that already rank in response to a query based upon things like relevance and PageRank.

    It is completely and totally about site quality and about page classification. Given that, I think it could easily describe the initial update. I don’t know if search engine watch or search engine roundtable had any reasoning behind how they numbered the Panda Updates, but I would guess that their numbers mean little to nothing to Google. I would not call the version of Panda linked to by search engine roundtable in that post the “full” Panda algorithm.

  41. Thanks for another in depth article- (makes my head spin every time but it’s always a welcomed one)- I’m on fence with this, although it’s leaning slightly more to this being actually Panda then not.

  42. Bill,
    Once again, you’ve shed light to the likes of me who take Google’s patents to be hieroglyphics.

    I’d like to cite this as a resource and post this a step simpler in my blog. This has definitely shed some light to co-citation. That there are more specific ways to do it than just having your brand cited in a website.

  43. Hi Sean,

    This has nothing to do with co-citation, so I’d warn you about doing that. It also doesn’t mention the word “brand” at all, and I’ve seen a couple of blog posts written about that which overdo it a lot.

  44. Hmm,

    So wait, did I understand this phrase wrong:
    “It might also count implied links, which sound more like what we often tend to refer to as citations. An express link can be used to navigate to a place, where an implied link can’t be clicked upon to bring a person to the target of that link.”

    Because how I understand it is that if you co-cite in such a way that it is a navigational indication, it affects rankings somewhat.

    Correct me if I’m wrong.

  45. Hi Sean

    I don’t understand what you are trying to say when you write: “Because how I understand it is that if you co-cite in such a way that it is a navigational indication, it affects rankings somewhat.”

    I would also seriously suggest not using “co-cite” or “co-citation,” especially if you mean something that might have been written about at moz or seomoz on those things because there were some confusingly written things posted about co-occurrence that used the term “co-citation” incorrectly instead.

  46. Bill,

    That clarifies things. Knowing this, I definitely have to look deeper into the differences between “Co-citation” and “Co-occurrence” so I don’t make the mistake of using one when I mean the other.

    Would you be so kind as to help me understand what this phrase means in layman’s terms:
    “It might also count implied links, which sound more like what we often tend to refer to as citations. An express link can be used to navigate to a place, where an implied link can’t be clicked upon to bring a person to the target of that link.”

    Thanks Bill!

  47. Hi Sean

    An “implied” link on a page is one where the person who created the page didn’t insert an actual link to that page, but might have instead included the URL for the page (in a non-clickable fashion) or referred to “the homepage of ESPN” or referred to a site in a manner which isn’t clickable. An example is a citation (citation, not ‘co-citation”) like the kind that we often refer to in local search, where we see the name of a business and some additional geographical information such as a phone number or part or all of a street address.

    There have been a couple of articles posted which have made a big deal of express links verses implied links, and some kind of ratio between the two (which is not part of this patent at all, but if you’re not careful when reading it might think you see). There is a reference to a ratio, but it’s a ratio of “independent links” to “referring queries.”

    It’s really a waste of your time regarding better understanding the difference and distinction between co-citation and co-occurrence when it comes to this patent itself since the patent has nothing to do with either. If you are going to research the topic, please ignore anything written about co-citation or co-occurrence at Moz though, since Rand just confused a lot of things about it the last couple of times he wrote about it (I hate having to write that, but it’s true). Hopefully Rand will try again, and straighten that out (I can hope).

  48. Bill,

    I see. That explains it.
    Knowing this, has there ever been a patent that tells us that co-citation and co-occurrence is a factor when determining rankings? Or is it just a baseless hype that the big blogs have predicted or, worse, concocted?

    Thanks a lot for your time clarifying things Bill.

  49. Hi Sean,

    There are a handful of patents from Google that mention co-citation, but they primarily focus upon how Google might use co-citation to find sites that are similar, like when you see in search results a message that similar sites to one you are looking up might be example.com, example1.com, example2.com, etc. This is because those sites tend to be linked to by many of the same sites (or “co-cited”).

    There are over 100 granted patents assigned to Google that include and discuss co-occurrence which talk about query re-writing and other topics that can include boosting some results in a set of SERPS for a particular query. I’ve written blog posts on some of them (there are a number of other posts that I’ve written about co-occurrence in as well):

    HOW GOOGLE MAY REFORM QUERIES BASED ON CO-OCCURRENCE IN QUERY SESSIONS
    RANKING WEBPAGES BASED UPON RELATIONSHIPS BETWEEN WORDS (GOOGLE’S CO-OCCURRENCE PATENT)
    HOW GOOGLE MAY SUBSTITUTE QUERY TERMS WITH CO-OCCURRENCE

  50. Great article Bill, I am a bit of a newb when it comes to understanding the way Google works but articles like this really help gain a better understanding of this update.

  51. I’m a bit late to the party here and I’m not even clear why this is even a question because it was fully and contemporaneously answered by Matt and Amit in the interview here: http://www.wired.com/2011/03/the-panda-that-hates-farms/2/

    For us geeks, the algorithm is pretty clearly explicated as a feature based machine learning algorithm that uses (chiefly) link text, link topology, and agreement with site content as the prime features.

  52. I am pretty certain that this is the Panda patent. I was working with a network of sites that had most of the sites slammed by Panda 1.0. I was always baffled as to why a few of the sites, with just as much problematic content, were untouched and even benefited from the update.

    The fact that network links were taken into account in the algo update now makes perfect sense to me. The entire network crosslinked the hell out of each other, but the handful of sites that survived the update actually had a significant amount of good non-network links.

  53. Hi Eli,

    Appreciate your feedback on this one. I’ve started on a followup to this post, and I’ll be looking forward to seeing what you think about that one.

    Thanks.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>