How Google Might Filter Out Duplicate Pages from Bounce Pad Sites

I hadn’t heard the term “Bounce Pad” being referred to websites before, but it’s useful knowing the language of search engines, and the things they might look for when crawling and indexing webpages, and serving results to searchers. Determining whether a site is a bounce pad involves an analysis about redirects appearing on the site, like in the image below from a Google patent granted this week:

Screen shot from the bounce pad patent showing calculation of redirect score and spam score to determine whether a site is a bounce pad.

One of the mysteries associated with Google’s search results is how it determines which pages to show when there are duplicate or substantially duplicated documents within its index. A search engine doesn’t want to show searchers a list of search results that contains substantially the same pages, so when it finds pages that are pretty close to being the same, it will create a “cluster” of those pages and choose a representative page to display.

That kind of duplication can happen for a number of reasons, such as someone copying content from another page (with or without permission or license to do so), the majority of the content on a page being a manufactor’s or publisher’s description, a content management system set up so that the same page gets published more than once at different URLs, content being republished on a mirror site or sites set up so that if there’s too much traffic to one of the sites that the others may handle overflow, and more.

The Google patent describes one way the search engine may filter out some choices of pages, based upon whether or not those pages appear upon sites that the search engine considers bounce pads, or sites that contain a high number of redirects that tend to redirect to multiple other pages, usually on other sites. Bounce Pad sites are usually considered to contain documents belonging to spammers, especially since many spammers will copy document content and frequently attempt to pass it off as their own.

Search engines don’t necessarily even want to index all of the duplicate documents that they find, since doing so wastes space in their indexes. But they also strive to index content from the legitimate sources of that content, while not showing duplicated content documents from spammers.

Types of Redirects and Uses

When I analyze a website, one of the things that I look at is how the site might be using redirects. It’s usually a good idea to remove as many unnecessary internal redirects from a site as possible to speed up a site, reduce the work that a search engine crawling program needs to do when crawling the pages of a site, and get a better picture of how the pages of a site relate to one another. I have seen at least one site that had chains of redirects at least 8 levels deep, which caused search engine crawlers to balk at indexing the pages of that site.

The patent itself defines redirects very broadly as a way of sending a browser (or search engine) from one web address to another, and provides a number of classifications for redirects:

Manual redirects – a document explicitly requests that a visitor follow a link to another page.

Status code redirects – A browser (or search engine user agent) receives an HTTP status code telling it that the document at the address it is trying to visit has moved to a new address, either temporarily (a 302 status code) or permanently (a 301 status code).

Meta refresh redirects – a Meta tag that tells a web browser to replace a souce document with a target document after a delay specificed in the meta tag (the amount of delay can be listed as “0”.

Frame redirects – A source document includes an HTML frame that contains a target document

Javascript redirects – A document contains javascript that causes a web browser to redirect to a target document when the javascript is executed by the web browser

Full screen pop-ups – A source document causes a full screen popup of a target document to be displayed.

Some Reasons for Using Redirects

Redirects might be used for legitimate reasons, and they may be used for malicious reasons.

For instance, if you change the domain that your content appears upon to a new domain, you’ll want to include the proper redirects from the addresses at the old domain to the new one (while actually changing the on site URLs to reflect the new domain on the pages of your site. This is like informing the post office that you’ve moved to a new address so that the mail will be sent to that address.

If you notice in your error log files that people are trying to reach a certain page on your site, but are following a wrong address because of a typo or common misspelling, you could set up a redirect so that those visitors find the correct pages.

If your domain name is commonly misspelled, you might want to register the misspelling of the address and redirect traffic to the correct domain.

The patent tells us that there are malicious reasons some people may use redirects for as well, such as attempting to fool a search engine into serving a spammer’s page, or to confuse visitors as to which web page they are on to try to get them to reveal private information as part of a phishing attack.

Implications of Being Seen as a Bounce Pad Site

This patent doesn’t actually look at whether the content of the pages of a site include actual web spam, but rather looks at how frequently redirects are used on a site, and how frequently they point to other domains, and may classify sites as bounce pads based upon that analysis. If the content on a page from a site identified as a bounce pad is substantially similar to content found on another page that isn’t on a bounce pad site, the page from the bounce pad site would be filtered out of search results.

In simple terms, a site might be considered a bounce pad site based upon the following analysis:

(1) The content of a site, including information associated with the pages such as addresses, metadata (e.g., refresh meta tags, refresh headers, status codes, etc.), may be analyzed to determine whether or not the documents are redirects.

(2) A redirect score might be given the site based upon the number of pages that are redirects compared to the number that aren’t.

(3) A spam score might be given to the site based upon how many different targets the site might have, with those targets broken into head targets (the three most commonly redirected sites) and tail targets. For instance, Site A includes 100 redirects to Site B, 60 redirects to Site C, and 40 redirects to site D. It also includes 20 redirects to Site E, 20 to Site F, 10 to Site G, and 5 each to Sites H, I, J, and K. The head targets (B,C,D) amount to 200 redirects, and the tail targets (E,F,G,H,I,J,K) amount to 70 redirects.

The patent tells us that it might not count the redirects between sites that are associated between the same organization, so if Site A is “example.com” and Site B is “example.co.uk” it might not include the redirects to Site B in its analysis.

It might then calculate a ratio of tail targets over head targets to come up with a spam score. So, if we don’t count the redirects to Site B, the spam score might be 70/100 or .7. As an alternative which isn’t explained well by the patent,the process might instead use a ratio of head targets over tail targets to come up with a spam score, or 100/70.

4) This combination of redirect score and spam score might be used to come up with a determination of whether or not a site is a bounce pad.

The patent doesn’t provide much in terms of describing why the top 3 targets (or some other number) might be chosen as head targets, and the remainder as tail targets, or why it might prefer to look at tail targets over head targets, or vice versa. It also doesn’t directly describe the function used with the redirect score and the spam score to determine if a site is a bounce pad.

The patent is:

Detection of bounce pad sites
Invented by Rupesh Kapoor, David Michael Proudfoot, Joachim Kupke
Assigned to Google Inc
US Patent 8,037,073
Granted October 11, 2011
Filed: December 29, 2008

Abstract

A system may identify a set of related documents, identify one or more documents in the set of related documents that are sources of redirects, and identify organizations that are targets of the redirects. The system may also determine a redirect score based on the number of the identified documents that are sources of the redirects, determine a spam score based on a number of the organizations that are targets of the redirects, determine whether to classify the set of related documents as a bounce pad based on the redirect score and the spam score, and storing information associated with the result of the determination of whether to classify the set of related documents as a bounce pad.

Conclusion

A website that contains a large number of redirects pointing to pages on a number of other domains appears to be somewhat suspicious to Google (and probably for good reasons). One that has been classified as a bounce pad, and has pages with content that are substantially similar to that found on other pages upon the Web might not be the best choice to show searchers pages from.

I guess it’s possible that some sites that aren’t web spam sites may include a lot of redirects to different sites, and also be the originators of content found on their pages, with that content possibly copied else where on the Web, but it does sound like an unusual setup.

As I wrote at the start of this post, I hadn’t heard of the term “bounce pad” being used to refer to a website before.

There are a number of other methods that a search engine might use to try to identify web spam that can include analyzing the actual content on pages, looking at links between pages and so on, but this one takes on the subject from a different perspective by focusing upon redirects.

Share

23 thoughts on “How Google Might Filter Out Duplicate Pages from Bounce Pad Sites”

  1. A website that contains a large number of redirects pointing to pages on a number of other domains appears to be somewhat suspicious to Google (and probably for good reasons).

    A web directory resembles that statement and Google’s deprecation of directory referrals is in line with this thinking.

  2. I would imagine that article directories (I have many) might be considered to be “bounce pad sites”.

    Additionally, I syndicate heavily and always link republished versions to the original as per Google Webmasters recommendations.

    Hopefully, if your analysis is correct, Bill, this means that Google will show my original above all duplicates in search results.

    I have notice that this does not happen on rare occasions…hence, my “article assassination” post.

    It will be interesting to see how this particular technology develops…especially for syndicators…:)

    Mark

  3. Hi Bill,

    Most of us accept that Google and other search engines keep refining the way that they rank sites to help improve the results that our search queries throw up, but hey! wouldn’t it be good if the goalposts could move a little less often? No sooner have I digested the latest advice about techniques to improve my website than there is another change that renders part of my previous work as useless.
    My knowledge about using redirects is very limited but I do have a business associate that has warned me about their use from a security point of view. Google suggests as you do that there are several legitimate reasons to use them but with Google trying to identify ‘Bounce Pads’ and redirects left open to any arbitrary destination likely to be abused by spammers, are they ever a good idea?

  4. Interesting article Bill.

    At my previous gig we had 3 large websites (75 million pages each) and we redirected most of the duplicate pages of 2 of the sites (from years ago when they thought it would be a good idea to have the same content on 3 highly authoritative domains *shaking my head) into the other because we noticed they were getting filtered on a mass level after the Mayday Update.

    Prior to doing this we sent a re-consideration request to Google letting them know what we were doing to basically give them a “heads up” and asked them to let us know if this was not a good idea (knowing we would not get a response).

    Not sure if it had any effect, but it was more of a precaution to minimize the sites being seen as bounce pad sites.

  5. @ash and @mark – I don’t think directories or article directories would count because they only have links and not redirects. This article is talking about redirects and only explicitly requesting a visitor to follow a link to another page. Directories and article directories don’t explicitly request you to follow a link.

  6. I read the description of a Manual redirect to mean any link on a page: “a document explicitly requests that a visitor follow a link to another page.”

    e.g. “Please visit our sponsors”.

    For a 301, 302 or meta refresh, the visitor has no choice but to bounce.

    Not complaining – happy to see directories still having a tiny role to play.

  7. I think its important to do solid keyword research and be able to plan your link structures well from the start. Limiting the amount of redirects you’ll need to use in the future. Simple and clean is the way you want, fast loads and good clear segmenting. 301 is 100x better than rel=”canonical” though, Matt Cutts has said Google does not like to see the rel=”canonical” tag anymore.

  8. I almost agree that this patent “almost” describe web and niche directories, article directories and syndications, but the difference is about redirection. They’re pointing/linking to these huge number of sites, but not redirecting.
    Thanks for the info Bill.

  9. Hi Ash,

    I thought it was interesting that the patent would consider links that promote people visiting another page to be a “manual redirect” as well. Directories do that, but a good directory wouldn’t necessarily include a lot of links that could be considered “head” redirects to a small number of target pages. But lets say that those manual redirects are something that Microsoft might consider under this patent. Imagine a directory that contained 100 in-main-content links to domain A, 75 similar links to domain B, and 75 similar links to domain C, and then 100 in-main-content links to 100 other domains. Something might seem somewhat odd about a set up like that, with so many of the directory links pointing to three domains.

    I don’t think the patent itself is saying that directories are bad, but when unusual linking patterns might be found in those directories, there just might be something up.

  10. Hi Mark,

    It’s possible that some article directories might be seen as bounce pads, but I don’t think that this patent is specifically focusing upon directories themselves, but rather any site that might have some unusual linking/redirect patterns associated with them.

    While some of Google’s patents and papers describing which pages they might show when pages have substantially the same content has mentioned PageRank as one possible factor they might look at, they’ve also hinted at other factors as well, and things like the redirect patterns described in this patent might be one of a number of things they may be looking at.

    It might not be a bad idea to spend a little time looking at how an article directory is set up before deciding whether or not to syndicate an article there. I’ve pretty much stayed away from article syndication sites myself (I’d rather spend time writing something for here if possible).

  11. Hi Bill,

    That’s a lot of pages. :)

    I wish there was a way to actually get some constructive feedback from the search engines when taking steps like the massive amounts of redirects that you created – there really isn’t any instruction manual when it comes to doing that type of thing, or an intelligent way of gauging what the impact might be when you do.

    It’s not only hard to guess how Google might react to so much content duplicated across multiple sites, but also how they might handle that many redirects. If I read this patent right, a large number of redirects like that might not be the approach that they necessarily want to see either.

  12. Hi Brett and Ash,

    The part of the patent that describes “manual redirects” is hard to interpret in a meaningful way because it’s described early on in the description of the patent in an area that tells us more about redirect, but isn’t discussed in too much detail and isn’t mentioned in the remaining part of the patent filing.

    I don’t think that this patent was specifically aimed at directory sites as much as it was at sites that might include unusual patterns involving status code redirects, meta refreshes, and javascript redirect, where there might be an usually high number of redirects that point to a few pages.

    The “spam score” part of the patent isn’t very clear though on whether it’s good (or bad) to have a large number of redirects pointing to a few pages, or a large number of redirects pointing to a large number of pages, and that’s where this patent sort of falls down for me.

  13. Hi Mike,

    Planning a great structure for your site and where you link to is definitely ideal. I’d also urge that people limit the number of internal redirects to as few as possible and only when necessarily (such as in the situation where there is subscription based content, and a 302 redirect will bring someone to a log-in page on the way to that content when they aren’t logged in).

    I’m not sure that Matt Cutts was saying that Google doesn’t want to see the “canonical” tag used as much as he was trying to clarify how best to use it, and why 301s are often a better approach. Interesting video from Matt on this page from his blog:

    http://www.mattcutts.com/blog/rel-canonical-html-head/

  14. Hi Francis,

    I agree – this patent doesn’t seem to be pointing at directories and syndication as much as it does about sites that might be using a lot of redirects to other sites, while providing content duplicated in other places on the web. It’s possible that pattern might fit some directories or article syndication sites, but it could potentially cover other sites as well.

  15. One other implication of this is affiliate sites that use Gocodes or similar-type redirects for their affiliate links. That’s pretty commonly done. If you have too many of those redirect affiliate links are on your site, you could be considered spammy and would just be another hurdle to overcome for affiliates in this post-Panda world (assuming Google does something similar to the above patent).

  16. Hi Bill,

    I was wondering, is it still OK to add a site to link directories these days or just do old-fashioned linkbuilding for it? I have a new site which I have started about a month ago and it is still fresh but I am doing a little bit of linkbuilding for it – is it now safe to add it to these redirect pools?

    Thanks,
    Angus.

  17. You really analyzed the search engine crawler very deeply. I have never thought that google might be using some redirection signal on actual determination.It is a great patent that I have to observe and think that way.

  18. Hi Brett,

    I was thinking of the potential impacts of this approach on sites that provide affiliate links when reading through this patent the first time. I think there’s something to that. Of course, the quality evaluator guidelines that recently escaped to the public from March 2011 still emphasized the differences between thin and thick affiliate sites, stating that a site that provides value beyond the affiliate links that it presents is one that can still potentially rank well.

    But in the context of a site that might provide a lot of redirects to affilate offers that uses duplicate or near duplicate content, I could see Google possibly filtering those pages out of search results.

  19. Hi Angus,

    I think that there’s still a role for adding your site to link directories, especially if they appear to be ones that people actually use to find information and resources. I tend to stay away from directories that focus upon their ability to help you rank better in search results rather than focusing upon being a useful research for people looking for information. If those directories tend to just copy content from other places on the Web, and they use a lot of redirects, there is a possibility that they might end up being filtered out of search results.

  20. Hi Alex

    Thanks. That’s one of the reasons why I like looking at search related patents. Sometimes they point out things that the search engines might be doing that may not be so obvious.

  21. The amusing thing about this is that Google itself is a “bounce pad” for AdWords ads. That’s how they track and biil. The HREF link for Google ads goes to the destination site, but clicking on the ad triggers Javascript and a trip through a Google ad tally server. On top of that, many Google ads then redirect a second time to sites like “xg4ken.com”, which is operated by an an Israel-based ad network.

    Unnecessary redirection turns out to be a good indicator of ad content, and is helpful in ad blocking. We’ve started doing ad blocking in one of our Firefox add-ons. Old-style ad blockers have a huge list of things to block, and that list needs constant updating. There are better approaches today.

  22. Hi John,

    The analogy is an interesting one.

    I like patent filings like this because they provide a window into some of the possible processes that a search engine like Google might be using, but of course it’s likely that Google is looking at and considering many other approaches as well to try to identify things like duplicate content and possible web spam.

Comments are closed.