Added – 3/4/2022 – I have written a follow-up to this post about this patent and the Google Panda Update, which I recently updated. I will edit this post some more too, but I wanted to make the updates from that post available here. It is essential to look at What Navneet Panda means by “Ranking Search Results” and the Ratio behind measuring site quality that looks at the referral queries that pages on the site get optimized for and the implied links (not actual links) that get mentioned on other sites that may point towards areas covered in this patent.
The other post is at: Google’s Panda Granted a Patent on Ranking Search Results
This post has a lot of comments, and some of the commentators do grasp that the point behind the patent is to understand one of the earliest versions of determining the quality.
Historically, the “Panda” update’s focus has been viewed as penalizing written content and not links to the site. however, under the “Farmer approach,” some areas were being seen as targeting an extensive range of very similar queries such as “how to tie a know” and “how to knot a Tie.” Because of that, this patent focus on what queries that pages are being optimized for in search engines or in the page content to reflect the query choices and how those pages are referred to in implied links. e ratio that the patent points at are what is being used to decide whether a site is of low quality and whether higher-quality pages should outrank it’s – thus the “Rankling Search Results” name.
Search Quality vs. Web Spam
Many patent filings I’ve written about from Google address webspam. They also look at how search engines may follow approaches to keep search results from manipulation. A early example is the patent from 2003 titled Methods and systems for identifying manipulated articles.
But many patents I’ve written about involve Google trying to improve the quality of search results for searchers.
One of Google’s Patents (remember that PageRank was Stanford’s patent and not Google’s) detailed results in response to a query, boosting some for that query if they get linked to other top-ranking results for the same query T at patent, Ranking search results by reranking the results based on local inter-connectivity, got aimed at improving the quality of the top-ranking results.
Google’s Phrase-Based Indexing patents involve meaningful words and phrases that tend to co-occur in the search results for a specific query T e rankings of those Pages could get boosted when those phrases do appear or get inspired by how much weight gets passed along through anchor text using one of those related co-occurring terms T ese are search-quality patents.
There are several phrase-based indexing patents. At least one of those also addresses Web Spam by checking to see if there is a statistically abnormal amount of co-occurring words from the results on page S. The phrase-based indexing approach included a way to detect web spam.
Focus On Quality
A patent granted to Navneet Panda and Vladimir Ofitserov, Ranking search results, aims at improving search results rather than penalizing sites or identifying attempts to manipulate search results.
The Ranking Search Results patent does list only one “advantage” to following the process it describes:
Search results identifying low-quality resources can be demoted in a presentation order of search results returned in response to a user’s query. Thus, the user experience can be improved because search results higher in the presentation order will better match the user’s informational needs.
Just before the Panda update, there had been many public criticisms of the quality of search results showing up in searches at Google.
Here are a few examples:
December 13, 2009 – Dishwashers, and How Google Eats Its Tail – Paul Kedrosky
Google has become a snake that too readily consumes its own keyword tail. Identify some words that show up in profitable searches “from appliances, to mesothelioma suits, to kayak lessons churn out content cheaply and regularly, and you’re done O the web, no one knows you’re a content-grinder.
December 13, 2009 – Content Farms: Why Media, Blogs & Google Should Get Worried – Richard MacManus
From my analysis of Demand Media and similar sites, such content is very generic and lacks depth. W ile I wouldn’t go as far as wikiHow founder Jack Herrick and say that it “lacks soul,” it certainly lacks passion and often also lacks knowledge of the topic at hand A ringtone analogy with fast food is apt – it is content produced quickly and made to order.
January 2, 2011 – On the increasing uselessness of Google….. – Alan Patrick
But this year it hit home how Google’s systems have gotten spammed, as < vital>typically anything on Page 1 of the search results was some form of SEO spam – most commonly a site that doesn’t sell you anything, points to other sites (often doing the same thing) while slipping you some Ads (no doubt sold as “relevant”) T e other primary scam site type copies part of the relevant Wikipedia entry and throws lots of Ads at you
January 3, 2011 – Trouble In the House of Google – Jeff Atwood
Like any sane person, I’m rooting for Google in this battle. I’d love nothing more than for Google to tweak a few algorithmic knobs and make this entire blog entry moot S ill; this is the first time since 2000 that I can recall Google search quality ever declining. It has inspired some somewhat heretical thoughts in me — are we seeing the first signs that algorithmic search has failed as a strategy? Is the next generation of inquiry destined to be less algorithmic and more social?
It’s a scary thing to even entertain, but maybe gravity has broken.
January 27th, 2011 – Google Search Quality Decline or Elitism? – AJ Kohn
Google could certainly do that. T ey could stand up and say that fast food content from Demand Media wouldn’t gain prime SERP real estate G ogle could optimize for better instead of good enough. They could pick fine dining over fast food.
But is that what the ‘user’ wants?
As you see from those quotes, there was a sense of Google results getting broken and showing results more focused upon matching queries than returning quality results.
These criticisms got heard, even at the Googleplex I February of 2011, the Official Google Blog told us of an update in Finding more high-quality sites in search T is change covered a fair number of inquiries and was aimed at surfacing higher quality content:
But in the last day or so, we launched a pretty significant algorithmic improvement to our rankings, “a change that impacts 11.8% of our queries,” and we wanted to let people know what’s going on T is update gets designed to reduce rankings for low-quality sites,” which are low-value add for users, copy content from other websites or areas that are not very useful A the same time, it will provide better rankings for high-quality sites with original content and information such as research, in-depth reports, thoughtful analysis, and so on.
After watching the Panda update and reading a lot of threads in forums and other places about sites impacted by Panda, and working on areas that were, is whether the patent from Navneet Panda describes the update and attempts to improve the quality of search results.
Here’s a quick summary from the patent of what happens that it describes:
- Determining, for many groups of resources, a count of independent incoming links to help in the group
- Deciding, for each of the plurality of groups of resources, a count of reference queries
- Choosing, for each of the plurality of groups of resources, a respective group-specific modification factor, wherein the group-specific modification factor for each group get based on the count of independent links and the count of reference queries for the group
- Associating, with each of the plurality of groups of resources, the respective group-specific modification factor for the group, wherein the respective group-specific modification for the group modifies initial scores generated for help in the group in response to received search queries
So the patent has many parts which work together.
The first involves looking at the links pointing to the pages of a site and removing all the backlinks that look like they might get affiliated (under co-ownership or control) with the site or reducing the number of independent links to account for things like sitewide links T is may get a sense of how many different pages and sites are linking to the pages of this site M re independent links from more sources could get seen as a sign of quality.
The second is an analysis of whether pages appear to get targeted at specific referring queries. W ile it’s not unusual for someone doing SEO to try to make every page on a site a potential landing page, many of the places that we refer to as content farm sites often use every page to target commercial queries and many variations of those queries S a content farm type of site might include many pages that attempt to refer to any questions.
The independent links count and the referring queries count for the different groups that a site might get broken into and looked at as a ratio, with separate link count over referring query count I there are a lot of independent links and few referring queries, this number could be over one I there are a few separate links and lots of referring queries, the number could be a fraction of one.
This number based upon the links and the queries would then get multiplied by a score modified by whether each page gets seen as a navigational type result for a query term or phrase T e more it is like a navigational term or phrase, the higher the score T e final score could boost ranking scores for some results and diminish scores for other effects.
Groups Rather Than Pages
Instead of targeting specific pages or sites as many ranking algorithms do, the patent tells us that it looks at “groups” of resources. A group might get defined in several ways, and group resources can only be included in one group.
A group might be address-based so that all the resources within the group are all in the same domain name, such as http://www.example.com O all in the same hostname on a domain, such as http://host1.example.com or http://host2.example.com.
Groups of resources might get partitioned by a count of reference queries for each of the groups “so that each partition includes groups of resources whose counts of reference queries are within a respective range of counts of reference queries.”
Under this approach, one website might get broken into more than one group or might be part of a group that contains more than one website T rank the pages within these groups, the ratio of independent links to reference queries might get multiplied by a score involving navigational signals to determine a final rank.
Independent Links in Ranking Search Results
If the purpose behind this patent is to rank pages higher that are higher quality, one way to do that is to look at the number of independent links to those pages or groups of pages.
For each group of resources, the patent might count the number of links to those groups – but not all the links A d not express links – links that you can click upon to get to another page I might also count implied links, which sound more like what we often refer to as citations A express link can get used to navigating to a place where a suggested link can’t get clicked on to bring a person to the target of that link.
Why doesn’t the patent mention PageRank T is a metric. Pagerank are quality signals, but not every ranking from Google includes PageRank T is reliance on independent links eliminates the benefit of having a site with lots and lots of pages to get linked to from the exact location or areas under the same ownership or control or linked to sitewide from other sites.
An independent link is where the source and the target get determined to be independent of each other T e source group that a link is in, and the target group can get checked to see if they are separate.
Determining that links from one group to another are not independent can also involve determining that those resources are likely to get related, such as owned, hosted, or created by the same entity.
If the resources have similar or identical content, images, formatting, or so on, their similarity is another sign that the resources are not independent.
There may be many links from one resource to a targeted group, and only one link might get counted as an independent link. Though it’s not said in the patent, this could keep sitewide links from getting counted more than once.
Besides analyzing the links pointing to the different reference groups, this process looks at the site’s pages and queries that each might target H w well. Do those pages please those queries?
If a page includes the term “example.com,” it might refer to the home page I it contains words that searchers use to refer to the pages of a place, it might get said to involve referring queries that refer to those pages T e patent provides an example of others by telling us that:
…if the terms “example sf” and “esf” are often used by searchers to refer to the resource whose URL is “http://www.sf.example.com,” queries that contain the terms “example of” or “esf,” e.g., the queries “example of news” and “esf restaurant reviews,” can get counted as reference queries for the group that includes the resource whose URL is “http://www.sf.example.com.”
In the post, How Google May Identify Navigational Queries and Resources, I wrote about how Google used a document classification approach to determine whether a page was one that searchers entered a query for, expecting to find a specific page, such as the official homepage of the product or service included within the question.
To a degree, this kind of inquiry isn’t too different than the set of questions that get raised in Amit Singhal’s Official Google Blog post, More guidance on building high-quality sites. Such questions got worked into an analysis of a spot at a stage like at this point, though the patent doesn’t refer to them specifically.
The Ranking Search Results patent is:
Ranking search results
Invented by Navneet Panda and Vladimir Ofitserov
Assigned to Google
United States Patent 8,682,892
Granted: March 25, 2014
Filed: September 28, 2012
Methods, systems, and apparatus for ranking search results, including computer programs encoded on computer storage media.
The methods include:
- Determining, for each of a plurality of groups of resources, a respective count of independent incoming links to help in the group
- Deciding, for each of the plurality of groups of resources, a respective count of reference queries
- Selecting, for each of the plurality of groups of resources, a respective group-specific modification factor, wherein the group-specific modification factor for each group got based on the count of independent links and the count of reference queries for the group
- Associating, with each of the plurality of groups of resources, the respective group-specific modification factor for the group, wherein the respective group-specific modification for the group modifies initial scores generated for help in the group in response to received search queries
Ranking Search Results Observations
The chances are that Google tweaked and changed the Panda Algorithm in the weeks and months after it was first applied, and I may have made many changes to the patent after an initial beta period.
I’ve seen several denials from people about this particular patent describing the Panda update since I wrote about finding it last week in Google’s Panda Granted a Patent on Ranking Search Results. These denials got based upon the existence of a link analysis described within the patent, without looking more at the actual process involved and claiming that the patent more likely detailed the Penquin approach than the Panda approach.
But the link analysis here involving independent links and referring queries is more of an attempt to gauge the quality of a site than the backlink profile of that site. The “navigational” query analysis that could involve issues such as the example 23 questions that Amit Singhal provided us with also attempts to understand the quality of pages.
I changed the title of this post to stress that it is the Panda patent. I m open to the possibility that the Panda Updates followed a somewhat different course as they got implemented and tested.
There Have been many patents co-authored by Navneet Panda. He hasn’t been Prolific, but he has created some interesting inventions:
- 3/25/2014 – Google’s Panda Granted a Patent on Ranking Search Results
- 4/14/2015 – Early Panda and Concept Templates
- 5/12/2015 – How Google May Calculate a Site Quality Score (from Navneet Panda)
- 6/27/2017 – A Panda Patent on Website and Category Visit Duration
- 6/28/2017 – Click a Panda: High Quality Search Results based on Repeat Clicks and Visit
Last Updated March 4, 2022.
61 thoughts on “Is Ranking Search Results the Panda Patent?”
Thanks for sharing for such a detail article Panda patent, Bill. You have input such great analysis into this blog post. Thanks again for the great share of your thoughts!
Very interesting Bill.
The first few parts of the patent, as you describe them, seemed aimed at Demand Media. Removing ‘controlled’ links (from other properties under corporate ownership) and then the links to referring queries ratio, which would target those sites that were ‘sharding’ keywords (i.e. – how to boil water, how to boil hot water, how to boil cold water).
And the groups of resources certainly dovetails with Panda too, since we all saw that it was generally a site based demotion and not done on the document level.
I’m guessing this was a material part of the Panda update. And it would make sense since it was undertaken because of the chatter you reference. Hence they started with a target group – a learning set so to speak.
Not only that, but it underscores the fact that links were (and still are) an important way to determine quality. That doesn’t mean PageRank but simple link graph analysis.
A small detail caught my attention so I’ve checked it in the official doc you’ve mentioned – in regard to reference queries the patent quotes: “For example, a term that refers to a resource may be all of or a portion of a resource identifier, e.g., the URL, for the resource.” But isn’t this statement in conflict with a keyword stuffed domains issue Google had said to have reduced in vale as a ranking factor.
For instance if you are reaching to a given resource thanks to a keyword query and it matches the domain of a given relevant website (with keyword stuffed url) then it should not be taken so much into account as to boost the given website in the SERPs … or I am missing something. Anyway – perfect explanation, nicely done:)
Stellar stuff as always Bill.
Like AJ I find the “Groups Rather Than Pages” section a compelling possible Panda paw print. Not just because sites were hit, but because even hard hit domains parts of a site were hit harder than others.
Which makes sense in the content of sites that were targeting “highly commercial queries and multiple variations of those queries.” What one would typically find on content farms were what might be described as “target query page clusters” which, ironically, exposed themselves for classification as low quality by virtue of the nature of their targeting. Pretty clever of Google to leverage this particular signal.
Very in-depth. Thanks for the explanation! 😀
People should not discount the link analysis aspect of the patent.
BTW — it’s probably prudent to suggest that Google might have applied for more than one patent to protect the Panda process.
Illuminating post. Could you clarify the meaning of ‘reference queries’ for me in the context of “Determining, for each of the plurality of groups of resources, a respective count of reference queries”.
I’m surprised to see you refer to PageRank as a quality signal, instead of a popularity signal. I’m guessing you mean that quality is inferred by popularity.
I agree with Michael M that there must be many other patents that cumulatively protect Panda.
BTW, I like the site’s new header image and design.
Given changes to US patent law in the last couple of years, and the changes those have made to prior art, I don’t know for certain. Would the use of the Panda Update so much earlier than the filing of this patent, and the manner in which it was used, have made it count as prior art? I am not a patent attorney, and I can’t give you a clear answer on that question.
The claims within the patent, and the description do seem like they would fit the Panda algorithm, though the algorithm is one that appears to have been evolving over time as well, and this patent might cover a process that was adopted along the path of its development rather than at the start of it.
And, as Michael Martinez notes earlier, we don’t know if there might or might not be other patents that are possibly related that have been filed as well. If so, they could cover parts of the Panda algorithm as well.
The patent does seem like it was written with the kinds of sites that were talked about in some of the criticisms of Google’s search results, and the approaches described do seem to address aspects of what we’ve seen with Panda. I would suspect that there may be other ways to accomplish some of those types of targeted goals, so this might not necessarily be the first iteration of how the algorithm might work. But, I do like seeing an approach spelled out in a way that would address some of the issues involving boosting higher quality resources.
Thank you. There do seem to be some similarities with identifying referring queries – pages that seem to be targeted at specific queries based upon content within them. What we don’t know is how Google addressed that keyword stuffed domain issue. Google did publish one patent, which they originally filed in 2003, which wasn’t granted until 2011. I wrote about it in Googleâ€™s Exact Match Domain Name Patent (Detecting Commercial Queries), and it presented a number of different ways that Google might determine whether or not terms in queries might be commercial, and if so, whether or not they should be devalued.
Doing an analysis to determine whether or not specific pages could be associated with referring queries because those terms might be in the domain name isn’t the same as discontinuing to give that same page a slight boost in search results for the query terms because the query terms are in the domain name.
I agree on the importance of the link analysis part of the patent. I also agree that there may be one or more additional relevant patents out there involving the Panda process. I mentioned the phrase-based indexing patents in the post, and there are a number of those that apply to different aspects of how phrase-based works, with different filing dates for many of them.
Thanks. I do like the approaches described in the patent, and it makes sense to use them in a manner such as this. I am expecting to see more on the Panda algorithm at some point, and I’m hoping that we don’t have to wait too long.
My understanding is that a U.S. patent must be filed within 12 months of the invention’s first sale, use, or publication.
This patent was filed in September 2012, more than 18 months after the initial Panda update in February 2011.
Wouldn’t that indicate that this patent describes something other than the Panda algorithm?
If you visit a page on the Web that has been optimized for a specific query, chances are that you will see keyword terms or phrases in page titles, in headings, in page content, and so on. You can get a sense of what query or queries someone might have attempted to optimize a page for.
PageRank is based in part on popularity, but the search engines usually refer to it as a “quality” measure. It’s not based solely upon a sheer number of links, but instead is based upon a concept that is analogous to academic citations in scientific papers. The notion that important pages tend to link to important pages isn’t just a vote for the page with the most links, and those votes aren’t equal. A link from the home page of the NY Times is much more important than a link from the Fauquier Times (my local paper that comes out 2 times a week). A link from the NY Times Homepage might be worth hundreds or thousands of links from the home page of a small town newspaper online.
It is definitely quite possible that there are other related patents out there that haven’t been made public yet involving Panda.
Thanks regarding the new look for the site.
Solid read here Bill! I’ve got to admit though, that a lot went over my head. I’m a noob in this space compared to some of y’all mavericks! But man, I completely enjoy reading and learning from your articles. Always thorough and “complete”! I’m especially fond of your aversion to speculation, unlike some other folks in the SEO space, but enough of that, nobody likes folks that name names.. Keeping doing this great work! Also, like one of the other guys up above, gotta congratulate you on the new look even though I do miss the “big fish” :-]! Great job. Modern and clean. Nice.
Thanks for posting this Bill, it makes a lot of sense.
Shortly after Panda was rolled out there was speculation if Google is using a ratio of the number of links / number of pages on a site. A number of links / number of referring queries though seems much more interesting.
You could think of it this way – if a group of pages is receiving a disproportionately high amount of search referrals while having a disproportionate number of links, what does this say about user satisfaction?
It could also explain why many traditional recommendations of dealing with Panda were unsuccessful – removing duplicate, empty or even low content pages wouldn’t necessarily reduce the number of referring queries. Maybe SEOs shouldn’t have been removing pages that don’t have content, but rather pages that were ranking for large numbers of queries.
As soon as I hit the section of the patent involving “navigational queries”, that nailed it for me that the patent was about the Panda update. There was some talk from Google about the Panda update being a change to a document classifier, and the patents Google has published about navigational queries are all about classifications of documents.
The “reference queries” section isn’t about what queries people are or were finding pages on a site for, but rather what queries the pages appear to be targeting – like the multiple keyword variations that content farms were targeting, such as “How to tie a shoe,” How to knot a shoe,” “How to put on a shoe.” That’s Panda, and not Penguin, regardless of the name Navneet Panda being on the patent.
Thanks. It’s hard to leave some speculation out, and given the impact that both the Panda and Penguin updates have had on many people and their sites, there definitely are going to be some strong feelings and emotional attachments to specific ideas and theories about how each of those updates work.
Chances are that if there are any other patents that might be involved in the Panda and the Penguin updates, that we won’t see those until they are granted, and sometimes it can take a while for a patent to make it through the prosecution process and become public, so we don’t know what else is out there and when it might become public knowledge.
Thanks, regarding the new look for the site as well. I’m still getting used to it myself. 🙂
Here’s an interesting thought experiment: forget for a moment that this patent has Navneet Panda’s name on it. If you were to read it without knowing the inventor’s name, what Google update would you most associate it with?
A lot of the mechanics described in the patent — especially those regarding the application of a group (site) modification factor to individual search results — fit well with the conventional wisdom about the Panda algorithm.
However, the basic value judgement embedded in the patent — based on the ratio of independent links to reference queries — doesn’t seem line up well with Google’s public statements about Panda, nor with individual stories of Panda penalty and recovery. Most of those have centered around content quality rather than link factors.
The patent seems to make the assumption that there is a “natural” ratio between the number of independent links to a site and the number of reference queries to that site. That is, for every person who naturally creates a link to any given site, some number of people also perform searches referencing that site.
This makes intuitive sense: people are going to link to a high-quality site, and they’re also going to search for it. Not as many people will link to a low-quality site; nor will they search for it. Regardless of a site’s size, how much traffic it receives, or even it’s quality, the ratio of links to reference queries probably remains fairly consistent.
Google knows what this ratio is. While it may vary somewhat by niche (“plurality of groups of resources”), the ratio for most sites falls within bounds that are well-understood by Google. This algorithm seeks to penalize those sites that fall outside those bounds.
So what sorts of factors might affect the ratio of independent links to reference queries?
Site quality seems unlikely to affect this ratio. A high-quality site is going to attract both links and reference queries (people searching for the site by name). A low-quality site will not attract many links, but is also unlikely to be searched for by many people. While the absolute numbers of links and reference queries would both vary with quality, the ratio between the two is likely to stay fairly consistent, regardless of site quality.
What sorts of things would affect the ratio of links to queries? The most obvious one is unnatural link building to a low-quality site. Let’s say a site owner launches a massive link-building campaign and doubles the number of independent links to the site. This effort is unlikely to affect the number of people searching for that same site, and would thus result in a big change in the site’s ratio of links to reference queries, which would in turn affect the site’s group modification factor.
What I take away is that this specific algorithm is better-suited for detecting and penalizing unnatural linking than it is for site quality. It would also seem to open the door wide for negative SEO attacks.
In those regards, the core value judgement embedded in the patent seems to be more closely aligned with Penguin than with Panda.
That all said, Navneet Panda’s name is on this patent. I wouldn’t discount the possibility that he is responsible for both of those algorithms, nor that this patent describes an amalgamation of the two.
Thanks. I do like the approach described in the patent.
Pages that are created in a way so that they are purposefully responsive to specific queries isn’t anything unusual, but it seems to have been a characteristic of content farm type sites that the Panda Update seems to be targeting to have groups of pages that reference content in a way that targets lots of specific queries.
As the New York Times article, Google’s War on Nonsense puts it:
Mr. Miller’s job, as he made clear in an article last week in The Faster Times, an online newspaper, was to cram together words that someone’s research had suggested might be in demand on Google, position these strings as titles and headlines, embellish them with other inoffensive words and make the whole confection vaguely resemble an article. AOL would put “Rick Fox mustache” in a headline, betting that some number of people would put “Rick Fox mustache” into Google, and retrieve Mr. Miller’s article.
Getting rid of the low quality or substantially duplicate content pages probably wasn’t a bad idea, and would be something that I would recommend as part of a site audit regardless of Panda. But, it does seem like the target should be on the content aimed not at being quality content, but rather aimed more at inducing people to come to a site for specific queries being targeted by pages – not referring queries, but rather reference queries.
Again, it’s not unusual for people to create pages that seem to be about something that people might search for, but pages that are set up like that, and yet have few or no independent links to them are probably going to rank lower.
If you read the second part of that first sentence, it says “…and has been classified as referring to a resource in the group.”
It doesn’t say that the pages of a site being looked at have to rank for those query terms, though they possibly could. It tells us that Google will look at queries that have been submitted to it, and see if the pages on the site in question appear to be targeted at any of those queries, or “refer” to those queries. From the passage you quoted, the classification of a referral query is the important thing here.
The patent seems pretty clear that a “reference query” is one in which the user is trying to reach a specific site or page:
I’m having a hard time squaring that paragraph with your definition and the shoe-tying example.
The query refers to a resource on the site. So the question is, how does that happen?
There are a few different ways. One would be that specific pages on the site are optimized for those query terms that “refer to the resource.”
Yes, this could be done through on-page optimization, or it could be done by pointing specific anchor text at pages within the resource, or both.
Unfortunately, that doesn’t tell us that this is Penguin or Panda, though.
I agree that whether or not the site ranks for a given query term is unmentioned and irrelevant to this section of the patent.
But I don’t see anywhere in the patent that it discusses the concept of a page referring to or targeting a query. (In fact, I see no mention of on-page factors at all.)
The paragraph I quoted above discusses (and fairly explicitly for a software patent) the concept of a user search query that refers to a specific website, usually by name or URL:
Replacing some of the cumbersome patent-speak with more common terms makes it even clearer:
The query refers to the page or site, not the other way around.
The examples provided in the same paragraph further emphasize that they’re talking about a query referring to a site, rather than a site targeting a query:
I’m not trying to pick a fight here, but the definition of “reference query” seems pretty central to understanding the patent, and I don’t think that your interpretation is supported by the content of the patent.
I am basing what I am saying on a knowledge of how pages rank for something. The patent’s description provides an example of a reference query that does more than just contain the URL or the name of the specific site:
“…queries that contain the terms “example sf” or “esf”, e.g., the queries “example sf news” and “esf restaurant reviews,” can be counted as reference queries for the group that includes the resource whose URL is “http://www.sf.example.com.”
That second example of a reference query, “esf restaurant reviews” doesn’t contain the name of the URL or the name of the site.
I think Google just ignores them when it’s making this calculation, so that the number of independent links isn’t prejudiced by the existence of some site-wide links. The idea seems to be to gauge the quality of pages being examined without the influence of dependent links (under the same control and/or ownership) and without counting multiple links from the same independent sources more than once. In other words, how likely is it that people will link to pages on the site (or groups of resources on the site) without having some kind of relationship with the site? Counting multiple links from the same site can throw that off.
Could you please cite the portions of the patent upon which you’re basing your conclusions?
I don’t see anything in there to suggest that a reference query is anything other than a query that contains the URL or name of a specific site.
Just a quick thought
“this could keep site-wide links from being counted more than once.”
Do you think it’s more likely Google simply ignores these links, or that sites they determine to not be independent with mass links to each other may actually be caught up in an algo “penalty”?
ESF is definitely an abbreviation for “Example SF”, but it’s not the domain, domain and host name, or site name. Some kind of actual processing had to happen for Google to understand that. If we go back to the first part of that sentence and look at it again, we get a hint of where Google learned that “ESF” refers to the site:
There are a few different places where Google could learn what might be “commonly used by users to refer to the resource”, and those could be in query sessions that might include searches for the domain, or in searches that one or more pages of site might be optimized for that might be returned for a query such as “esf restaurant reviews.”
Like the referring query can also be a navigational query, that’s a little like saying that a square can be a rectangle. Not all referring queries are going to be navigational queries. In the post, I did link to an earlier post I wrote about navigational queries, and a Google patent for navigational queries.
So, what does any of this have to do with the Penguin update? How do you see that fitting in with the process described in the patent?
In that example, “ESF” an abbreviation for “Example SF” — the name of the site http://www.sf.example.com.
The next paragraph of the patent drives home the point, equating reference queries with navigational queries:
A search engine will not make the association that “national football league” is nfl.com automatically. It needs to do some kind of analysis, and apply some level of confidence that the longer name is the same as nfl. The patent refers to user data as being one source of information about what queries might be classified as referring to resource groups that make up a site. That kind of user data can be helpful in making such a determination.
The patent does refer to classifying queries as being referring queries for a site. When I stated that a website might be used to help determine what those queries were, I was saying that the words on the pages of the site and the words that the pages of the site might be optimized for can be determining factors in that classification of queries as referring queries for a resource.
As for Penguin, I don’t see how this process can be used to determine unnatural linking patterns when it’s easy enough just to analyze the link graph to find things like dense bipartite graphs to find unnatural linking patterns. The 2003 Google patent I mentioned does just that.
It’s not a matter or a question of a specific ratio, but rather just that more independent links makes it more likely that a “resource” will be higher quality since they aren’t nepotistic. The more reference queries for a “resource,” the less likely that it’s higher quality. If the count is close to equal for the two (independent links/reference queries), then this initial score is pretty close to one. This really doesn’t act to ferret out unnatural linking patterns in any way – it seems instead to be trying to find good links to determine a final score for a resource.
For all intents and purposes “ESF” is a name for the site in question, in the same way that “National Football League” and “NFL” are both names for ww.nfl.com.
As you correctly point out, there’s necessarily a classifier (whose details are not described in this patent) that determines whether or not a term refers to a specific site.
The important point is that it’s classifying certain queries as referring to specific web sites. It’s not classifying web sites as referring to specific queries.
What does all this have to do with Penguin? Not necessarily anything. But comparing the number of people linking to a site (“independent links”) to the number of people searching for it (“reference queries”) seems like a great way to detect unnatural linking patterns. And that seems much more Penguin than Panda. See my earlier comment for details.
I’m fine with agreeing to disagree. Thanks.
Bill, I respect your experience in this area, but in this case I think it’s doing you a disservice and leading you to attribute properties to this process that aren’t supported by the text of the patent.
Of course there’s a classifier, and obviously it’s going to have to look at a variety of factors factors to determine that “national football league” and “NFL” both refer to nfl.com. But for the purposes of the process described in this patent, how the classifier reaches its conclusions is irrelevant. (If the full classification process was a part of the invention, the patent would have to describe it, which it does not.)
As defined in this patent, “NFL” is a reference query to nfl.com, and only to nfl.com. It does not refer to any other sites, regardless of how much they have tried to optimize for the term. A plain reading of the patent should make that abundantly clear.
If you want to imbue “reference query” with all sorts of properties that aren’t mentioned in the patent, so be it. We can agree to disagree. But the simplest explanation is usually the right one, so I’ll choose to accept the term as it’s actually defined in the patent.
I’m not sure that you could have “always” viewed reference queries the way that Dave had, since there really hasn’t been a reference to them like in this patent anywhere else before that I can locate. The only time you had to begin viewing them that way was in reading the patent.
The definition from the patent is a definition of examples rather than a definition of what a “reference query” actually is, but unfortunately, it’s easy to ignore the part of one example that clearly states that Google would look at things like user data, to equate something like “ESF Restaurant Reviews” with the site in the example.
If reference queries were only queries that included the domain name or host name and domain name or site name, then there just wouldn’t be many reference queries for any one site. That was an example, not a full definition.
Interesting post, very detailed. I’m not sure, to be honest, I don’t particularly study Google in such detail.
There are some things about Google which do interest/bother me, one of which cropped up yesterday on Matt Cutts’ video section. A guy asked him, “How do you separate simple popularity over true authority?” And his response was pretty vague. I remember the Panda update rolling out and the widespread sense of panic which seemed to be about the place. The next one is on the way, no doubt, and I hope they call it Walrus.
Bill, a very interesting exchange of thoughts between you and Dave on “reference queries” and I always understood it as how Dave has done. I truly believe it as how users use queries to reference resources within a group and not the otherway round.
The problem with content farms was they had links but they were unlike wikipedia or NYT as people never used that many reference queries to them as they did while searching for resources in wikipedia or NYT. This is also confirmed by how some sites like wiki.answers.com still rank very well though them seem to have been created like a content farm as you had described above.
But there are some content farms which still rank well because of their brand (reference queries). Just do a search for “Smallest Bird in the World” and see how ask.com, wiki,answers.com rank on the first page. And if you look at those pages, they actually do the same as you describe what a content farm does. But they still rank well because of their brand (reference queries) strengths.
Thanks. I did see the video you’re referring to, responding to a question from AJ Kohn (Blind Five Year Old). The question was:
“As Google continues to add social signals to the algorithm, how do you separate simple popularity from true authority?”
Matt purposefully ignored the first part of that question about social signals, but focused upon the “simple popularity vs. true authority” part of the question.
I thought his answer was pretty interesting, especially in light of some of the discussion above, especially when he starts talking about algorithms working to find evidence that some sites might be good matches for specific topical queries.
I don’t know if we will see another big update like Panda or Penguin soon, but we’ll definitely see new ranking signals and approaches. Walrus? 🙂
Bill, I used the word “always” to mean how i understood the term “reference queries” from the time i read it on the patent.
Since “Group” as defined in that patent is address based and is defined as having the same domain name or host name, “reference queries” would mean any phrase that includes terms to refer to a resource in the group, such terms can only be domain names or host names or anything widely used by searchers to refer to that group. and this could include abbreviations like NFL etc.. So in the context of the patent, reference queries are determined by search users.
Another interesting development immediately after panda was introduced was people complaining about sits copying their content ranking ahead of pages (on their site) when they do a quoted search of longer phrases or sentences unique to their pages. This was said to happen (and I have seen it myself) even if the copying site linked back to the originator.The following does gives a clue of why this problem could have occurred.
“As another example, the system may have access to data that indicates how similar two resources are in one or more aspects, e.g., based on whether the two resources have identical or similar content, identical or similar images, identical or similar formatting, e.g., identical or similar Cascading Style Sheets (CSS), and so on.”
From the above, it is clear that link backs from sites copying substantial portion of content of the originating site weren’t being considered as independent links and hence the value of such link backs have been zeroed out for the target site by the “independence” factor of incoming links in Panda algo.
First of all, thank you Bill for drawing our attention to this patent. I’ve been studying Panda and trying to figure it out since April 2011 – so this was a huge find.
Having analysed the patent in detail – I must confess that my understanding was that “reference queries” was pretty much synonymous with “navigational queries” – ie. queries intended to find a particular site.
So when you said (above):
The more reference queries for a â€œresource,â€ the less likely that itâ€™s higher quality. If the count is close to equal for the two (independent links/reference queries), then this initial score is pretty close to one.
I was surprised – because my understanding was the exact opposite: For a given number of backlinks, the more reference queries a site has, the better quality (or actually the bigger brand) it’s likely to be.
So for example – searchers are more likely to type:
“New York Times Malaysia Airways Flight”
“CNN Barak Obama Speech”
“WebMD Blood Pressure”
rather than “Noname spammy site blood pressure”
and that’s because the big brand sites have built a reputation for quality that searchers recognise – and they have a good experience on those sites that they want to repeat – and hence search for them by name. Hence the correlation Google has found between navigational searches and higher quality.
Remember Matt Cutts said here: https://www.wired.com/2011/03/the-panda-that-hates-farms/
And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons
So he specifically mentions big brand sites that people are more likely to do navigational searches for.
And that is what we saw after Panda – a huge boost for the major brands.
For all Panda’s shortcomings – this IS quite a clever way of distinguishing between sites people would actively look for – and those who are merely good at SEO.
I’ve also done some research in my own market – and saw some indication that people who did well after Panda have a better ratio of navigational searches to links than those who did badly – (but it’s a very small sample, and certainly not conclusive).
My plan for my Panda-hit site having studied the patent is to try to get more natural navigational searches by boosting our brand recognition. But you’ve read a LOT more patents than me, so I hope I’m not heading in precisely the wrong direction…
I really don’t see how this could possibly be the initial Panda patent. Panda was very clearly about Google making a judgement call on site quality, and I don’t see enough of that described in this patent to justify calling it ‘the’ Panda patent.
Yes, I think this is probably part of the Panda algorithm as it currently stands, but I don’t agree that this describes the initial update.
Do you think it is possible that this was part of an update to the full Panda algorithm (e.g. http://www.seroundtable.com/google-panda-20-15789.html)?
This patent starts off with pages that already rank in response to a query based upon things like relevance and PageRank.
It is completely and totally about site quality and about page classification. Given that, I think it could easily describe the initial update. I don’t know if search engine watch or search engine roundtable had any reasoning behind how they numbered the Panda Updates, but I would guess that their numbers mean little to nothing to Google. I would not call the version of Panda linked to by search engine roundtable in that post the “full” Panda algorithm.
Thanks for another in depth article- (makes my head spin every time but it’s always a welcomed one)- I’m on fence with this, although it’s leaning slightly more to this being actually Panda then not.
Once again, you’ve shed light to the likes of me who take Google’s patents to be hieroglyphics.
I’d like to cite this as a resource and post this a step simpler in my blog. This has definitely shed some light to co-citation. That there are more specific ways to do it than just having your brand cited in a website.
This has nothing to do with co-citation, so I’d warn you about doing that. It also doesn’t mention the word “brand” at all, and I’ve seen a couple of blog posts written about that which overdo it a lot.
So wait, did I understand this phrase wrong:
“It might also count implied links, which sound more like what we often tend to refer to as citations. An express link can be used to navigate to a place, where an implied link can’t be clicked upon to bring a person to the target of that link.”
Because how I understand it is that if you co-cite in such a way that it is a navigational indication, it affects rankings somewhat.
Correct me if I’m wrong.
I don’t understand what you are trying to say when you write: “Because how I understand it is that if you co-cite in such a way that it is a navigational indication, it affects rankings somewhat.”
I would also seriously suggest not using “co-cite” or “co-citation,” especially if you mean something that might have been written about at moz or seomoz on those things because there were some confusingly written things posted about co-occurrence that used the term “co-citation” incorrectly instead.
That clarifies things. Knowing this, I definitely have to look deeper into the differences between “Co-citation” and “Co-occurrence” so I don’t make the mistake of using one when I mean the other.
Would you be so kind as to help me understand what this phrase means in layman’s terms:
“It might also count implied links, which sound more like what we often tend to refer to as citations. An express link can be used to navigate to a place, where an implied link can’t be clicked upon to bring a person to the target of that link.”
An “implied” link on a page is one where the person who created the page didn’t insert an actual link to that page, but might have instead included the URL for the page (in a non-clickable fashion) or referred to “the homepage of ESPN” or referred to a site in a manner which isn’t clickable. An example is a citation (citation, not ‘co-citation”) like the kind that we often refer to in local search, where we see the name of a business and some additional geographical information such as a phone number or part or all of a street address.
There have been a couple of articles posted which have made a big deal of express links verses implied links, and some kind of ratio between the two (which is not part of this patent at all, but if you’re not careful when reading it might think you see). There is a reference to a ratio, but it’s a ratio of “independent links” to “referring queries.”
It’s really a waste of your time regarding better understanding the difference and distinction between co-citation and co-occurrence when it comes to this patent itself since the patent has nothing to do with either. If you are going to research the topic, please ignore anything written about co-citation or co-occurrence at Moz though, since Rand just confused a lot of things about it the last couple of times he wrote about it (I hate having to write that, but it’s true). Hopefully Rand will try again, and straighten that out (I can hope).
I see. That explains it.
Knowing this, has there ever been a patent that tells us that co-citation and co-occurrence is a factor when determining rankings? Or is it just a baseless hype that the big blogs have predicted or, worse, concocted?
Thanks a lot for your time clarifying things Bill.
There are a handful of patents from Google that mention co-citation, but they primarily focus upon how Google might use co-citation to find sites that are similar, like when you see in search results a message that similar sites to one you are looking up might be example.com, example1.com, example2.com, etc. This is because those sites tend to be linked to by many of the same sites (or “co-cited”).
There are over 100 granted patents assigned to Google that include and discuss co-occurrence which talk about query re-writing and other topics that can include boosting some results in a set of SERPS for a particular query. I’ve written blog posts on some of them (there are a number of other posts that I’ve written about co-occurrence in as well):
HOW GOOGLE MAY REFORM QUERIES BASED ON CO-OCCURRENCE IN QUERY SESSIONS
RANKING WEBPAGES BASED UPON RELATIONSHIPS BETWEEN WORDS (GOOGLEâ€™S CO-OCCURRENCE PATENT)
HOW GOOGLE MAY SUBSTITUTE QUERY TERMS WITH CO-OCCURRENCE
Thanks a bunch for this.
Stay brilliant 🙂
Great article Bill, I am a bit of a newb when it comes to understanding the way Google works but articles like this really help gain a better understanding of this update.
Thanks for another in depth article . Thanks again.
I’m a bit late to the party here and I’m not even clear why this is even a question because it was fully and contemporaneously answered by Matt and Amit in the interview here: https://www.wired.com/2011/03/the-panda-that-hates-farms/
For us geeks, the algorithm is pretty clearly explicated as a feature based machine learning algorithm that uses (chiefly) link text, link topology, and agreement with site content as the prime features.
I am pretty certain that this is the Panda patent. I was working with a network of sites that had most of the sites slammed by Panda 1.0. I was always baffled as to why a few of the sites, with just as much problematic content, were untouched and even benefited from the update.
The fact that network links were taken into account in the algo update now makes perfect sense to me. The entire network crosslinked the hell out of each other, but the handful of sites that survived the update actually had a significant amount of good non-network links.
Appreciate your feedback on this one. I’ve started on a followup to this post, and I’ll be looking forward to seeing what you think about that one.
Thanks a lot . I was not clear on Panda at all reading so many articles around the web. but this is really good. Thanks
Late to the party, as usual.
Having read through all the discussion, it’s tough to square the patent interpretation presented here. The confusion centers around definition of “reference query”. Bill points out “reference query” hasn’t been used previously, so it appears key to understanding what the patent is about.
Starting with commenters- “Dave” (and someone else seconding him) find “reference query” synonymous with brand related searches. (And navigational queries.) And that there is a “natural” relationship between volume of links and search volume for particular brand (or brand related terms). A high # links only natural if brand related search volume is also high. Site with no brand recognition would be penalized as unnatural if it had a high number of links (which pushed it to the top), but paltry volume of brand searching or awareness.
The depressing aspect of this interpretation is one of the strong getting stronger. Strong brands are allowed high number of links and thus high search visibility. Seems like a horrible search engine to me.
In contrast, Bill reads “reference query” in a broader, more elastic context. It’s the breadth of terms a page is seeking to target. And an inverse relationship between broad targeting and narrow “independent” link base signifies low quality. (That ratio less than 1). The examples provided in patent using brand related queries obscure as examples because they do indeed suggest brand specificity, but need not only be about brand specificity.
Anywho, Leslie Rohde, my optilink won’t update, you’re killing me smalls 😉 😉
Take care everyone.
Comments are closed.