Early Google Panda Patents

Might Google lower the rankings of a page in search results if it detects unusual patterns related to clicks on advertisements on that page, or might Google use a ranking algorithm that can be tested against such unusual click patterns to lower the rankings of pages in search results? A Google patent granted today is the first that I can recall seeing that suggests that information about clicks on ads might cause pages to be lowered in web search rankings or removed from search results altogether:

Once the document engine 146 determines the likelihood that an article is a manipulated article, the method 400 ends. The likelihood that an article is a manipulated article can be used in a variety of ways. For example, the information that an article is likely a manipulated article can be used to lower a ranking associated with that article such that the article will be displayed lower in a listing of search results or not displayed at all*.

Alternatively, the information that an article is likely a manipulated article can be used to test ranking algorithms.*

For example, it may be desirable to use ranking algorithms that function independently of the click-through data associated with an article, but that nevertheless attempt to lower manipulated articles within a listing of search results. The information obtained from the method 400 that an article is likely to be a manipulated article based on the click-through data can be used to test the effectiveness of a ranking algorithm that functions independently of the click-through rate.

For example, if the method 400 determines that articles A, B and C are associated with high click-through rates and therefore are likely to be manipulated articles, this information can be compared to the ranking determined by an algorithm independent of the click-through data associated with the articles for the articles A, B and C. If the articles A, B and C are similarly ranked lowly by an algorithm independent of the click-through rate, this can be an indication that the independent algorithm effectively identifies manipulated articles.

* My emphasis

The claims from the patent don’t include specific language about manipulated articles, but they describe methods that could be used to identify situations where site owners might publish low quality content that might rank well in search engines alongside advertisements, that may appear to lead to higher quality content, or where those site owners might practice search arbitrage where they pay for lower cost advertising to send people to pages with low quality content anticipating that a number of visitors might click upon advertising on those pages with higher payouts. The patent description elaborates on methods that might be used to identify manipulative pages.

For instance, Google may look at the paths that people follow to arrive at an advertisement that they click upon, such as clicking upon an ad in search results, and then quickly clicking upon a more expensive ad on the landing page that the first ad sent them to. The patent defines manipulated articles as including articles designed to rank artificially high in search results, often for popular query terms. The publishers of these manipulated articles may also automatically create links to those articles from other articles to have them rank more highly, or sometimes show different content to a web crawler than they show to other visitors.

These manipulated articles may contain content designed to help those pages rank well for web search and to have them generate content ads associated with those key terms, without really providing substantive information. Because of the lack of quality content on those pages. visitors to those pages will frequently choose an advertisement on the page to find more useful pages.

The patent is:

Methods and systems for establishing a keyword utilizing path navigation information
Invented by Pavan Desikan
Assigned to Google Inc.
US Patent 8,005,716
Granted August 23, 2011
Filed June 30, 2004

Abstract

Systems and methods for determining and utilizing path navigation information. In one aspect, a method includes determining an article containing at least one item, determining a path associated with the article, and identifying at least one term associated with the at least one item based at least in part on the path.

While this patent describes a way to identify low quality content pages such as content farms through higher than expected clickthroughs or through navigational paths that might start with a click on a lower cost ad to a page with low quality content and higher earning advertisements, it doesn’t mention the kind of “quality scores” that we’ve learned to associate with Google’s Panda. In my quote from the patent above, it does say that the clickthrough data might be used to test algorithms that might identify manipulated articles.

If we look at a Google patent filed around a year later which shares an author with the above patent, Pavan Desikan, we do see the concept of quality scores being assigned to pages where advertisements might be published. That patent is:

Reviewing the suitability of Websites for participation in an advertising network
Invented by Pavan Kumar Desikan, Lawrence Ip, Timothy James, Sanjeev Kulkarni, Prasenjit Phukan, Dmitriy Portnov, and Gokul Rajaram
Assigned to Google, Inc.
US Patent 7,788,132
Granted August 31, 2010
Filed June 29, 2005

Abstract

The way in which Websites are reviewed for use in an advertising network may be improved by

(a) accepting a collection including one or more documents,
(b) determining whether or not the collection complies with policies of an advertising network, and
(c) approving the collection if it was determined that the collection complies with the policies.

The collection may be added to the advertising network if the collection is approved such that (e.g., content-targeted) advertisements may be served in association with renderings of documents included in the collection. The collection may be a Website including one or more Webpages. The policy may concern

(A) content of the one or more documents of the collection,
(B) usability of a Website wherein the collection of one or more documents is a Website including one or more Webpages, and/or
(C) a possible fraud or deception on the advertising network or participants of the advertising network by the collection.

Policy compliance violations or low quality scores might be used to flag a page for manual review or remove it from the advertising network.

The patent provides a number of general classifications of policy violations, such as violations related to:

  • Content of the Website (is the content too general to provide specific targeting, is there not enough content, is the content bad or controversial, etc.),
  • The publisher or source of the Website,
  • Usability of the Website (is it under construction, or does it contain broken links, or slow loading pages, or improper use of frames, etc.) and
  • Fraud (e.g., attempting to defraud advertisers and/or the ad network).

The patent includes a number of examples of the kinds of content that might not be acceptable under the policies of the advertising network, and many of those are very similar to those included in the section on “Content Guidelines” on the Google AdSense Program Policies page.

More interestingly, we are told that a manual list of websites with “a given type or class of policy violation may be used to train an expert system (e.g., neural networks, Bayesian networks, support vector machines, etc.) to classify other Websites as having the policy violation or not.” That classification system could look for certain words or phrases and images that might indicate policy violations

Usability and other website violations that might be detected upon pages could include:

  • Websites with domain name server (DNS) errors (such as, for example, URL does not exist, URL down, etc.)
  • Websites with broken links,
  • “Offline” Websites,
  • “Under construction” Websites,
  • Websites with excessive pop-ups/pop-unders (e.g., more than N (e.g., one) pop-up or pop-under ad on any given webpage load),
  • Chat Websites,
  • Non-HTML Websites (e.g., directory indexes, flash, WAP),
  • Spy ware Websites,
  • Homepage takeover Websites,
  • Websites that try to install malicious software on the user’s computer,
  • Websites that affect usability by disabling the back button,
  • Excessive popups/pop-unders

We’re told that “pay-to-click” Websites (e.g., those whose main purpose it to get people to select pay-per-click ads), may also be policed under the advertising policy guidelines, and that those might also be identified by a machine learning process trained upon well known “pay to click” websites. For example, we are told that “click-spam Websites typically have content generated using a template and/or a high ads to text ratio.”

The patent also provides some other things they might use as quality criteria when looking at a site, such as:

  • Usage data from the advertising network or other sources such as impressions, selections, user geolocation, and conversions
  • Whether the site is in the advertising network, or not
  • Popularity, possibly as measured by Google toolbar for example
  • Website spam (“link farms”, hidden text, etc.)

Conclusion

If the aim of Google’s Panda upgrades was to reduce the search rankings of pages that are created to rank highly in search results to show off low quality content and high earning advertisements, then the use of quality scores to identify those manipulative articles fits very well into the processes behind these patents.

The questions that Google Fellow Amit Singhal poses in the Official Google Blog article More guidance on building high-quality sites are probably the best place to go to when trying to identify the kinds of things that might negatively impact a page or site under Panda, and how to improve the rankings of those pages. The aim behind them is to convince publishers to provide high quality content on their pages.

I suspect that Google doesn’t use clickthrough data directly to determine which pages might be manipulative pages, but may be using that information to test a machine learning-based Panda algorithm, as the path navigation information patent suggests as a possibility.

Share

24 thoughts on “Early Google Panda Patents”

  1. Interesting Bill – publishing low quality content which encourages higher rankings in search but encourages a user to click away to another site earning the provider of low quality content an incentive.

    With my cynical hat on it sounds a bit like most financial aggregators’ business model…

    I quickly scanned the patent and this bit interested me:

    In another approach, the abovementioned types of violations may be discovered by measuring a deviation of the actual click-through rate on the examined Website from the average click-through rate for all Websites. Moreover learning algorithms (naive Bayes, SVMs, etc.) may be employed. For example, such algorithms may be used to train networks using the statistics of some known “pay-to-click” Websites. Such trained networks may then be used to identify new Websites.

    The idea of using the universal click through rate of all websites as well as using historical data for previously discovered “pay-to-click” websites seems somewhat devious, but also somewhat obvious…

    I think I’m having a slow day today but this is a good find!

  2. It looks like the first patent describes an attempt to filter out arbitragers. If your guess that the second patent improves the process by assigning quality scores to sites with advertising is correct, then it makes perfect sense that this could be one of the signals the Panda algorithm uses.

  3. I agree with the conclusions you’ve drawn here regarding Google’s use of the patents. Their all about Google protecting its results from low quality content farms that exist to produce revenue through arbitrage of PPC visitors. Not only do these farms compromise the results, but they are extracting revenue that presumably Google would like to capture by removing the middle-man.

    Bottom line, we’ve all been annoyed by these shallow content pages/farms that consist of mostly ad-words and as a Google user, it would be an improvement to at least push them to the second page of the SERPS, if not further back. Hopefully they aren’t mistaking too many quality sites for farms.

  4. I have always seen a flaw in google ads and this latest idea to potentially penalise sites for excessive click throughs is one of them.

    Ok …

    1. site A has google ads..

    2. a major competitor does underhand stuff and outsources( from 100s of different IPs) for major click throughs on site A ads..

    3. google penalizes site A

    4. and BINGO the task is completed by major competitor to get site A out of the running in search

    It’s a MAJOR flaw so lets hope google can see this.

  5. Content farms are a pain in the a** and digital media is the biggest one of them. I would love to see Google take them down…

  6. Hi Tom,

    The aspect of the patent involving using a seed set of known “pay-to-click” websites to identify features related to those sounds very much like an aspect of how Google’s Panda works.

    In the Amit Singhal blog post on the Google Webmaster Central blog, More guidance on building high-quality sites, we’re given a set of questions upon which features could be weighed on sites to determine a quality score for those pages, based upon a machine learning approach.

    The path navigation information patent is the first I’ve seen from Google ever that suggests that patterns around clicks upon ads on pages might be used to lower those pages in search rankings. I suspect that Google isn’t using that information directly though, and as the patent suggests is using it to test the algorithms that they’ve come up with as a feedback mechanism.

  7. Hi Michael,

    The first patent definitely includes a way to identify search arbitrage, by tracking the path that someone followed to their click on an ad, and whether they arrived at that ad through another lower cost advertisement and a low quality content page.

    The second patent provides a way to determine whether or not a publisher’s page is an appropriate one to show advertisements at all, and the quality score for publishers’ pages that it describes sounds very much like the approach that Panda is using, after adding an additional set of features to consider that might be helpful in further distinquishing the quality of pages by looking at what higher quality pages might contain tht lower quality pages don’t.

  8. Hi Jonathan,

    Ultimately, I believe that Google wants to show the best results that it can for searcher’s queries, and if it can find a way to show higher quality results higher in search results, then it makes sense for them to do so, even if the lower quality pages might result in more clicks and revenue for Google.

    If Google’s search results are dominated by high ranking and low quality pages aimed solely at getting people to click upon advertisements, then less people are going to use Google to search with. Unfortunately, it seems that Google’s attempts to do this include some sites that provide high quality content, and there is some collateral damage going on.

  9. Hi Karen,

    I’d like to think that Google has come up with some ways to tell that some clicks are motivated by bad intentions and have developed ways to ignore some clicks as attacks on advertisers.

    But, I also think that it’s more likely that Google is using information about clicks upon advertisements as a way of testing other algorithms, like the Panda algorithms, to determine the quality of content found on pages that carry ads. I think they realized that the potential problem that you describe exists, and that just looking at those clicks wouldn’t be a good idea.

  10. Hi Steph,

    I think that there’s room for sites on the Web that provide useful and helpful information on a wide variety of topics, and I’d love to see the kinds of sites that we often refer to as content farms evolve into places that focus more upon the quality of their content rather than the potential for their pages to earn advertising dollars.

  11. To paraphrase an SEO teacher I used, if the approach does not feel right then it probably is not right. Trying to trick Google or any search engine is certainly a “not right” approach that will eventually send the site to the trash can.

  12. Hi Allen,

    Demand Media had a recent IPO based upon a business model that could collapse at any point in time, providing less that high quality content to searchers so that they might click upon advertisements found on those pages. There are many other sites using similar approaches, and I’m not sure that you could call what they are doing morally wrong, but it also isn’t all that useful to people who might land on their pages. Unfortunately, many of those sites have been profitting from their approach for years, and continue to do so until the well runs dry. Then is supect, they’ll look for a new well.

  13. Hi Bill

    I hope you’re right about google being able to differentiate the clicks motivation on sites. It seems to me though that this is all about closing the gate after the horses have bolted. IF google had been more thorough in the first instance about checking all sites BEFORE allowing them to subscribe to ads then all of this could have been prevented. If you think about a quality ethical business who has an affiliate program … they check all the sites to assure themselves that their ads( affiliate link ads) are placed on respectable sites.

    Maybe it’s easier for them to go back to the drawing board and employ lots of people to check every site one by one and close down the adsense accounts? It’s one way of doing things and the spam sites would soon give up as they can no longer make any money. Then all this scraping content and spamming genuine sites would come to an end also.

  14. This makes me feel a bit scared how Google can change everything that we are used to. I can say that they are already monopolizing the search industry – they are the ones in control of SEOs/web-admins.

  15. Hi Guys
    Firstly I cannot really see how can google get it right with filtering spammy website with no content. Even after the recent update there are still 100’s of thousands website on first or second page result with no content full of affiliate links. The only thing they have is good volume of incoming links which of course they have paid for (which again is against google’s rules)
    Secondly what about the first 3 PPC results? Normally they are just spammy sites with no content only playing probability games that the commissions they will potentially earn from the visitor is less then the click will cost them. Yes of course they pay to google for privilege to be on the top but if google is so much after the relevant results for users should not this be also part of the relevancy?

  16. It’s funny to observe how signals reach out to elements outside of the content itself.
    In other words, Google determines low quality content by looking at anything but the stuff itself.
    Dead end for a search engine to understand what it reads ?

  17. You nailed it LaurentB, the Goog has everything so quantified that they fail to look at the actual subjects (sites) themselves. The same goes for their AdSense partners, they need to filter out the blatantly, low quality sites already. I’ve nearly lost confidence with their content network.

  18. Hi Karen,

    How do you know or predict what people might publish on their sites when their site is fairly new, or anticipate that they might change over time? I’m not sure. For example, I believe that eHow was a subscription only site that focused upon providing higher quality content until it was purchased. After the sale, the new owners removed the subscription requirement and added advertisements. If Google judged them upon the older quality content, then having them show advertisements wouldn’t be a problem.

    Approaches like Panda are an attempt to automate the process of reducing rankings for pages focusing upon earning advertisement dollars while showing lower quality content. I don’t think that Google could run Adsense if they manually checked each advertiser.

  19. Hi Mel,

    Google doesn’t control what I do as an SEO, though they do influence decisions that I make.

    Definitely one thing to do to alleviate those types of concerns is make sure that a site doesn’t rely too much upon being found in the search engines, and that there are other ways to get visitors to come to a web site.

  20. Hi John,

    I agree that I still see some pretty bad sites showing up in search results. I can’t say that I click upon many advertisements, so I have no idea of the quality of the landing pages attached to those.

    I have seen a lot of people writing about how their sites have been impacted by Panda in search results, and some of the traffic charts and analysis that I’ve seen people publish indicate that a good number of sites someone might refer to as content farms have lost a fair amount of traffic as well.

    I do tend to be fairly satisified with a lot of the searches that I do on Google.

  21. Hi Laurent

    There are two different ways that this patent describes how Google can do things. One of them is to look at clickthroughs on advertisements or at the paths that lead to those clicks to see if they are search arbitrage in action, and possibly use that information to lower pages in rankings. I don’t prefer that approach.

    The other method is to come up with a ranking algorithm that might predict those clickthroughs, and then use the click information to test how well the algorithm works. I prefer that approach because I think it stands a much better chance of resulting in higher quality pages.

    It really looks like Google is following the second approach, and I see nothing wrong with it. It focuses upon the content found on the pages rather than how people might use those pages. If you read all of the Google Blog posts about Panda, and the wired interview with Amit Singhal and Matt Cutts, it does appear that they are focusing upon the content, and not the user behavior associated with that content.

  22. Hi Isaac,

    By all appearances, what you suggest appears to be what Google is trying to do when it comes to Panda and low quality sites. They seem to be moving in the right direction.

  23. well, good i had a little read here, never thought about this myself, but I still believe that 2 elemenst;
    1) google needs to show quality result to keep their position
    2) common sense
    explains about 99% of what google is up to

    for example; they were getting under attack for showing scam sites, so they introduced more brands … in other words; simply creating a quality site and genuine link building still works

    all google tries to do is to keep the cheaters out … so far; I’m still trusting google

    although it’s true and sad that whenever they make significant changes there’s always collateral damge

  24. Hi Michiel,

    All good points; I agree with you.

    Some kind of effort like Panda was necessary for Google. They had been targeted in some really high profile places about the quality of search results that they were showing.

    As for brands being more prevalent, sites that take the time and make an effort to build strong brands also seem to think about signals involving things like trust, credibility, quality, etc., and find ways to incorporate those into their sites.

    Unfortunately, there does seem to have been some collateral damage that happened with Panda

Comments are closed.