How Google Might Disassociate Webspam from Content

Manipulative repetitive anchor text, blog comments filled with spam, Google bombs, and obscene content could be the targets of a system described in a patent granted to Google today that provides arbiters (human and possibly automated), with ways to disassociate some content found on the Web, such as web pages, with other content, such as links to that content.

A couple of images of screens from a content management system that allows someone to make judgments on comments associated with a video and on whether search results for a particular query appear to be manipulated.

In an Official Google Blog post, Another step to reward high-quality sites, Google’s Head of Webspam Matt Cutts wrote about an update to Google’s search results targeted at webspam that they’ve now started calling the Penguin update. The day after, I wrote about some patents and papers that describe the kinds of efforts Google has made in the past to try to curtain web spam in my post Google Praises SEO, Condemns Webspam, and Rolls Out an Algorithm Change.

The patent doesn’t describe in detail an algorithmic approach to identifying practices that might have been used to manipulate the rankings of pages in search results. Instead it tells us about a content management system that people engaged in identifying content impacted by such practices might use to disassociate certain content with webpages and other types of online content.

The patent is:

Content entity management
Invented by Mayur Datar and Ashutosh Garg
Assigned to Google
US Patent 8,176,055
Granted May 8, 2012
Filed: March 27, 2007

Abstract

A first content entity and one or more associated second content entities are presented to one or more arbiters. Arbiter determinations relating to the association of at least one of the second content entities with the first content entity are received.

A determination as to whether the at least one of the second content entities is to be disassociated from the first content entity based on the arbiter determinations can be made.

We’re told that some types of “content entities” such as video and/or audio file, web pages, search queries, news article, and others might have associated second content entities such as “user ratings, reviews, tags, links to other web pages, a collection of search results based on a search query, links to file downloads, etc.”

That association could be created by user input, such as someone entering a review, or by a relevance determination by a search engine.

Like many patents, this one describes the problem it was intended to resolve. In this case, that problem is that sometimes those types of associations can produce results that could negatively impact how a search engine might work:

Frequently, however, the second content entities associated with the first content entity may not be relevant to the first content entity, and/or may be inappropriate, and/or may otherwise not be properly associated with the first content entity.

For example, instead of providing a review of a product or video, users may include links to spam sites in the review text, or may include profanity, and/or other irrelevant or inappropriate content. Likewise, users can, for example, manipulate results of search engines or serving engines by artificially weighting a second content entity to influence the ranking of the second content entity.

Fox example, the rank of a web page may be manipulated by creating multiple pages that link to the page using a common anchor text.

The patent describes how those associations between content entities might be presented to arbiters or reviewers who could decide to disassociate those content entities.

Some other content associations described in the patent include:

  • Items offered for sale in an online retail store, and user comments related to the item
  • An entry in a blog, and readers’ comments on the blog post
  • An image or a video clip, and annotations or comments on that media file
  • A search query at a search engine, and search results for the query

It’s possible that the system described in this patent may have been intended to be used by the human evaluators who Google hires to evaluate search results during testing, or by search engineers who evaluate the real time quality of search results, or by both.

I did write about a couple of patents last November that also focused upon the evaluation of search results. One focused upon the use of human evaluators, while the other described an automated evaluation approach. Both of those described some of the signals that those approaches might look at when deciding upon the quality of search results.

This patent is less about specific signals that might indicate something about the quality of search results, and more about a content system that could be used to act upon associations that might have problems.

For example, if there’s an image of Mount Kilimanjaro on the Web, and it has been tagged as Mt. Fuji. that tag might be disassociated from the image using this content system. The disassociation might take more than one arbiter determination or vote.

This system also tells us that arbiters might be given points for their determinations under this system, as well as time limits to make decisions regarding content associations as well.

We’re also told that an arbiter might not necessarily be human either, but could be “implemented as a software agent.”

Such a software agent could make determinations to disassociate content according to algorithms aimed at making such decisions.

For example, one algorithm might determine a relevance measure for a second content entity compared to a first one, when the relevance measure is below a certain threshold. Another algorithm could determine the presence of profanity and possibly disassociate something like a comment from a blog post, or an annotation from a video.

We’re also told that an arbitrator might be chosen based upon certain factors, such as location:

In another implementation, the content management engine can select the arbiters based on an arbiter location. For example, arbiters in the San Francisco area may be more often selected to review comments related to a video of local landmark, e.g., the Golden Gate Bridge.

When an arbiter makes a decision to disassociate content, they might provide a reason for doing so:

The rationale can, for example, be predefined, e.g., check boxes for categories such as “Obscene,” “Unrelated,” “Spam,” “Unintelligible,” etc. Alternatively, the rationale can be subjective, e.g., a text field can be provided which an arbiter can provide reasons for an arbiter determination.

The rationale can, for example, be reviewed by administrators for acceptance of a determination, or to tune arbiter agents, etc. In another implementation, the rational provided by the two or more arbiters must also match, or be. substantially similar, before the second content entity is disassociated from the first content entity.

The content management system could also help identify irrelevant content and assign a low association score between two different content items. For example, on an article about a programming language, a comment relating to a website about marketing may be considered to not be very relevant.

A ranking freshness signal could also be used by the content management system to identify search results for a specific query that suddenly moved up highly in search results.

Takeaways

If you’re interested in the processes that Google might use to enforce its webmaster guidelines and how it might do things like devalue links to a website that might be used to manipulate the rankings of search results, you might be interested in spending some time with this patent. The patent itself is fairly long, and while I’ve described some of what it covers, it goes into much more detail.

It makes sense for Google to have some kind of interface that could be used to both algorithmically identify webspam and allow human beings to take actions such as disassociating some kinds of content with others. This patent presents a framework for such a system, but I expect that whatever system Google is using at this point is probably more sophisticated than what the patent describes.

While I’ve seen a few people state that Google doesn’t like excessive links pointed to a particular page with the same anchor text, I can’t recall seeing an explicit statement from Google about that before until the mention in this patent. We’re not given any “magical” percentage of how many links pointed to a page might start being considered excessive though.

Share

53 thoughts on “How Google Might Disassociate Webspam from Content”

  1. Bill, so you believe Google might use such a system to create a dashboard for webmasters to be able to disassociate content (irrelevant and/or negative)?

    P.S. The “arbitration” system described sounds very similar to what Google is currently using for Map Maker. I am not sure if you’ve looked into that, or if you’ve seen a patent describing it?

  2. I believe they will incorporate human filters into the algorithm more & more in the future. New startup engines will quickly have an advantage in SERP cleanliness, thus, leaving Google will all the best bells & whistles in the world but constantly focusing on the quality of their results will require more logic. Spammers hide footprints, good spammers hide them well. Google simply can’t keep up. “Software agent” is a good term, I like it!

    It would be good if they do process reviews on products, sentiment on blog post comments and do everything they could do if they all conformed to an open graph..

  3. Sorry, this was a little hard to follow. Are you saying that this latest update is automated rather than manual but that manual spam detection is going to be incorporated more in the future?

  4. I think Brent could be right Bill.
    There will always be spammers and they will always look for and find ways to keep one step ahead of Google.
    Ultimately, the very best way to find out if the results returned from a queery are what the user is looking for, is to ask the user.
    When Penguin arrived, I was all for it but since I have read some real horror stories.
    Businesses that hire SEO companies don’t and wouldn’t know if they are spamming and many good intentioned people have been hit hard. It also seems to have increased the market for nagative SEO and that in my opinion it worse than spam.
    Maybe it’s time Google were more reliant on human filtering but it’s up to the humans to engage as well.

  5. It’s usually most interesting to look at the reason and I believe that’s one of the most important things here as well. The “Likewise, users can, for example, manipulate results of search engines or serving engines by artificially weighting a second content entity to influence the ranking of the second content entity.” part seems to admit that comment & profile spam and similar has been causing a lot of trouble. From my point of view this is absolutely true, I’ve been outranked on some decently competitive keywords with nothing but that.

    It seems this changed at least a bit with the penguin (this mostly on other markets than US, at least for me). So, it seems to me you have a good point here. Will take the time to read through the patent.

  6. I have a lot of sites/blogs in different niches and I will admit I practiced some “bad” techniques on these and they actually ranked very high. Google punished these rankings in the last update, but at the same time really rewarded a few sites I did a lot of work on, taking special note on good content and social media.
    Google is loving social media a lot now, especially when it is relevant to their Google Plus network. Google really stated in a certain way that they are trying to have their engine think like a human and not a piece of software.

  7. I haven’t yet had a time to read the patent but I’m wondering if this system could also help to dissociate negative SEO from otherwise ‘good’ content?

  8. That system *might* be used internally but I doubt that it will be made available to the public.
    There is just too much chance of misuse.
    Also the patent is 5 years old. If this were to be incorporated into Google’s tool set, it would have been done by now.

    In a perfect organic SEO world, anchor text is presented by the linker and is not in the hands of the website owner or agent.

    Organic links are placed by 3rd parties and the anchor text presented is open to their wishes.

    I would think that instead of a percentage of the total, Google would look at patterns in the placement of new links.
    This is confirmed in their statement:
    “Likewise, users can, for example, manipulate results of search engines or serving engines by artificially weighting a second content entity to influence the ranking of the second content entity. Fox example, the rank of a web page may be manipulated by creating multiple pages that link to the page using a common anchor text.”

    best,
    Reg

  9. The one thing that clearly comes out in regards to the latest Webspam Update is that natural is the way to go. Google is try to tell webmasters to go back to the basics i.e. when providing quality information came first and ranking came in second.

  10. I am very confused at this penquin update
    I used to spend a lot of time going over my onpageand off page Seo
    now Google are not rewarding me for putting in the extra work extra work so that there bots can spider my site better.
    I realise there is a lot of manipulation going on but it seems the big G is doing a blanket clamp down on us website owners who try and optimise our sites with Seo, well hopefully by producing good content on a regular basis I might survive who knows?
    Cheers Mick

  11. Great – I’m sure that the Team from Google is all ready aware of the problem and has a nifty and surprising solution to it. Hopefully the spammers wont be aware of this – or they’ll all ready be adapting their spam strategies. I also have the same question in mind as “AJ Kohn”… I guess only time will tell.

  12. It’s exactly the same as the piracy issue – no matter how many torrent sites get blocked by ISP’s, someone comes up with a way to get back in.

    Spam is just a fact of life; the sooner people get used to this fact and just take sensible precautions the better. I used to freelance for a web design company that spent so much time obsessing over this one detail, that all the rest of the work suffered as a consequence.

  13. Don’t know exactly how the Penguin update works but basically Google says this: write quality content, make websites for users not for search engines. So, Google says that if you follow these instructions your website will succeed. However, this is not the case for many webmasters. I for one have a website filled with hundreds of quality and UNIQUE articles and traffic dropped by 50% in the last two weeks…

  14. I recently got into internet marketing, and with all of these updates that google is putting out (panda, penguin), it seemed that once again.. I jumped on the bandwagon a day late. On second thought, however, if google truly is looking for quality, then perhaps I do have a chance. I am not using all of the automated software, article spinners, or posting spammy comments on any site with do-follow links. I write all articles myself, and I do my best to include information with value. I will continue to build up my sights (only one so far)and see what happens. I always liked penguins – hopefully they like me too!

  15. The “repetitive anchor text punishment” made by Penguin update is real. I have over 70 websites and the badly hit ones which have money keywords with lots of same URL backlink texts. And, I have only made white hat backlink building. I am trying to increase the more natural looking backlink count to balance the situation and hope that the ranks will come back with use of “website”, “click here” texts. I think Google should also patent something like “Google dancing” because it is a never-ending fight between us. :)

  16. Web spam has definitely changed, and I think that moderated comments are a good thing. I understand the importance of moderating comments, and leaving only valuable information on websites/blogs. However, it’s also important to weed out what’s really spam, and what is people’s real thoughts and feelings with a link attached. Even if the name isn’t always genuine, read the comment…if it’s valid/valuable/informative, then why not keep it. I guess backlinks are the new web gold, and people are looking for a way to find places to put their links all the time, but the spam bots are what really need to be stopped. Great post, and thanks for sharing!

  17. I agree with the Penguin update and all the sentiments expressed by others about webspam etc; but in a perfect world, we would have none. However, this isn’t a perfect world and like the guy who sticks a sticker on your car advertising something, website owners are going to get a link stuck to their site by a spammer every now and then. Point is to moderate, if you get a good comment a link is the reward for that person’s patronising your site….

  18. I really haven’t noticed any disruptions on my front. In general, I try to build my links in an ethical manner, but use tools to make the process faster. Too many people favor the idea of throwing all the links that they can at a site and see what happens lol. Sucks for them!

  19. I’m cool with it.

    Honestly, anybody who thought that robospamming the web with thousands of links of identical anchor text was going to last forever could have learned a lesson or two from the financial world over the last few years.

    Diversify!

    I’m still amazed that that has been able to still work at all in recent years, even more amazed at just how well I see it work for people sometimes. It kind of annoys me to be working my butt off and get outranked by somebody who did nothing more than hit a button.

  20. Hi Steve,

    To put what I wrote in context, here’s my statememt from the post:

    While I’ve seen a few people state that Google doesn’t like excessive links pointed to a particular page with the same anchor text, I can’t recall seeing an explicit statement from Google about that before until the mention in this patent. We’re not given any “magical” percentage of how many links pointed to a page might start being considered excessive though.

    There are a lot of assumptions floating around about what the Penguin update might have been. My post really wasn’t about that, but it does point out a process and interface that Google might use to identify web spam and manually disassociate some pages from search results. It doesn’t rely upon any kind of mathematical calculation about a percentage of backlinks that might use a certain amount of exact anchor text. IT also may have little or nothing to do with the Penguin update.

  21. Hi Nyagoslav,

    I never stated that this interface was for webmasters to use. :)

    It looks like it might be used by human evaluators hired by Google or Google employees.

    Some Google employees have published a paper on their crowdsensus algorithm that covers corrections to Google Maps. I wrote about it in Are You Trusted by Google?. In that, I also link to a patent application from Google that you might find interesting called Trusted Maps: Updating Map Locations Using Trust-Based Social Graphs.

    I have looked at Mapmaker and have used it.

  22. Hi Brent,

    Google has been doing both automated and manual review of web spam for years, and it’s likely that they’ve been working harder to automate more and more of their efforts.

  23. Hi Marissa,

    No. I’m saying that Google was granted a patent that shows a manual interface that could be used by people to evaluate search results and comments for spam. The patent mentions that it might incorporate more automated approaches into identifying spam.

    I’m not saying that this patent describing that interface is directly tied to the update that Google has released under the name Penguin. Instead, I’m sharing with you and others a look at a patent that describes some of the approaches that Google may be using to fight web spam.

  24. Hi Steve,

    Chances are that people are going to continue to try to manipulate search results regardless of what Google or Bing or Yahoo do. The response of the search engines is to try to make it harder and harder to do that effectively, and raise the cost of doing so (in terms of time, effort, and money) as much as possible.

    The thing raising the market for negative SEO are the people who want to engage in negative SEO.

  25. Hi Magnus,

    The purpose behind this patent is to provide a system to help make it easier for human evaluators to identify and disassociate web spam when they see it.

    But I think we might be reading too much into it if we take it as a signal from Google that they are failing miserably in their fight against web spam. This is a known problem, and something that Google has told people to avoid doing in their Webmaster Guidelines for years.

    What I’m seeing as I look at it is an approach that Google started developing more than 6 years ago that may now have a number of automated processes built into it to make it easier for them to find web spam. It’s definitely not a solved problem, but this patent gives us a glimpse at a framework for working towards addressing the problem.

  26. Hi Lex,

    Social media does appear to be a whole new layer that could be used to rank search results at some point, and the cost of attempting to manipulate those signals means doing things like creating profiles that look like real people, which I think could potentially be a lot more work than spamming blog comments and forum signatures.

  27. Hi AJ,

    I suspect that many of the potential targets of negative SEO are sites that are already ranking fairly well, and will continue to do so regardless of how many spammy links are pointed at them and subsequently disassociated from them. That in itself is a sign the search engines may use to ignore such efforts.

  28. Hi Reg,

    Good points.

    Nothing within the patent says or implies that it’s something that would be released to the public.

    It’s likely that some type of tool like this has been incorporated into Google’s toolset, and it’s also likely that Google has worked to improve upon it significantly during the past 5-6 years.

    I agree with you that it’s likely Google would be looking at a number of different patterns related to the placement and frequency of links pointed to multiple pages and sites to understand how they were created, rather than some simple metric like a percentage of links using the same anchor text.

  29. Hi Mick,

    There are lots of people who do “SEO” that isn’t necessarily SEO, from participating in blog networks to acquire links from pretty spammy sites, to blog comment spamming, participating in link exchanges aimed solely at increasing PageRank, to trying to calculate and follow some “keyword density,” and so on.

    It’s possible that there are sites that were hit by the Penguin update that shouldn’t have been. I can’t tell you that for certain, though. I did see a site that claimed very publicly to have been affected by Penguin recently which has some significant canonical problems in terms of accessing the pages of the site that could have caused the same problems.

  30. Hi Anton

    If the problem is that people are attempting to manipulate rankings via spammy links, it’s a problem that Google has been aware of for more than a decade. My post isn’t about the Penguin update. It isn’t about negative SEO.

    It’s about a patent that was granted recently that gives us a peek at a content management system that Google developed over 6 years ago that provides them with an easy way to manually disassociate some pages from search results upon a finding that those pages are web spam.

    The patent describes how some of the processes it might use to find web spam might be automated in some ways, and gives us a look at the beginning stages of a tool that is likely much more sophisticated now.

  31. Hi Eloquentlunacy,

    It’s not a surprise that Google is building tools to combat web spam, and to try to improve the quality of their search results. This patent appears to describe some of their sensible actions to overcome the manipulation of their search results.

  32. Hi Josh,

    Sorry to hear that your site was negatively impacted. Have you done a complete and detailed SEO analysis of your site to see if there are other problems that might have caused the loss in traffic? Are there other sites that have scraped and republished your content that are showing up in search results instead of your pages? Hopefully you’re looking at these types of things.

  33. Hi Carlos,

    One constant in the SEO field is that changes happen. There have been other updates that have impacted websites significantly in the past. It sounds like you’re headed in the right direction with the focus you’re following.

  34. Hi Murat,

    Maybe it’s not necessarily just a “repetitive anchor text” punishment. It’s possible that it’s a “repetitive commercial anchor text” punishment where repetitive anchor text showing a high commercial intent (money keywords) are being punished. It’s also possible that other signals may be looked at as well.

  35. Hi Molly,

    If someone comments on a blog, and uses anchor text in the name field for the purpose of spamming search engines, I don’t blame anyone from deleting the comment regardless of the quality of the content in the comment.

  36. Hi Chris,

    I suspect that there are people who have learned that lesson and are still benefiting from their practices, though even those cases have the potential to leave footprints that standout as well. Google’s also been learning that lesson as well, and is increasing the kinds of signals that it uses for the rankings of pages.

  37. I’m a little confused about the function of the human “arbiters” mentioned in the story. Would these be Google employees or are they planning to crowd-source this with trustworthy editors/good web-citizens ala Wikipedia? Also, regardless if it were run internally or crowd-sourced, wouldn’t the system of evaluating ‘second content entities’ for “obscene” or “unrelated” material turn into a form of censorship.

    For example, if I were to post an extremely graphic and obscene tirade in the comments section of a Justin Beiber YouTube video, these human arbiters would identify and flag it for removal – even though, in my opinion, my obscene tirade is an extremely relevant to the video at hand. This automated arbitration system of deciding whether other people’s content is obscene or of good enough quality makes me extremely uncomfortable.

  38. Hey Bill,

    Is it Okay to start another blog in the SAME NICHE as the primary domain in a sub-directory (or folder) using the same theme, settings, logo etc. but with unique content?

    Currently I’m not buying content or hiring writers for my blog but was thinking like this so that all the articles written by by other writers come under another directory so that I can link to them from primary domain if it’s related so as to reduce bounce rate.

    - Mahesh

  39. Hi Alex,

    In my post, when I refer to “arbiters,” its because that’s what the patent filing calls them. They could be Google employees, but nothing with the patent indicates any intent on Google’s part to turn this task over to the public at large.

    Google has obscenity filters built into its safe search, and I’m thankful that I don’t have to wade through pornography to get to the search results I want to see. Yes, it is a form of censorship, but I don’t see any problem with that.

    If you decide to leave a comment of an obscene and pornographic nature at YouTube for an artist who create material aimed at teenagers, I don’t see any problem with Google removing your comment, or only showing it to people who are over 18, and logged into YouTube.

  40. Hi Mahesh,

    If you add a subdirectory to your blog, and have other people post on the same topics, but with unique content, that would still likely be seen as the same blog as the one that you are posting upon. Google might see possibly as a different blog if it were on a subdomain instead.

  41. Pingback: Why Content Creation Is The New Link Building |
  42. I am really impressed by the efforts Google is putting in regarding security, keeping in mind that people are really being affected badly by these spammers and wanted Google to do something in order to control it.

Comments are closed.