How a Search Engine Might Crowdsource Web Spam Identification

The term crowdsourcing was coined by Wired correspondent Jeff Howe, in a 2006 article titled The Rise of Crowdsourcing, where he described how a crowd of people might use their spare time to help in solving problems or creating content, or in addressing other issues that a single person or organization might have difficulties addressing on their own. Could a search engine effectively rely upon searchers to help clean up web spam in search results?

A crowd of people milling about, waiting on Lincoln's second inauguration speech.

What if search engines added a “feedback” button to every page that they showed in search results where searchers could report pages in those results as web spam? Or, if they added a spam button to their toolbar that searchers could click upon to indentify pages they found through a search as spam?

Would such a system help search engines provide better search results? Would people abuse such a system by identifying pages as web spam when they really aren’t? Could the search engines use other information in addition to aggregated spam resports from searchers to identify web spam?

A patent filing from Microsoft describes a way that they might identify web spam by combining information from searchers about pages with information gathered from an automated system for identifying web spam. The combined information could be used to penalize pages in search results that have been identified as spam.

Search Engine Spam Reporting

Before digging into the approach described in Microsoft’s patent application, I thought it might be interesting to look at how each of the search engines presently allow searchers to identify pages that they might consider web spam.

Google has a “Dissatisfied? Help us improve” link at the bottoms of search results pages that you can use to report web spam. Instead, the page asks people who click upon it for feedback on a wider range of topics, listing the following as choices:

  • I couldn’t find the desired page.
  • I couldn’t find the desired information.
  • The results included spam. (Spam is explained in the Help Center)
  • The results contained a page that was irrelevant or off-topic.

There is a text box on the page where you can explain why you were disatisfied with the search results that you viewed, as well as another text box that allows you to identify the URL for a specific page. There’s also a link to a short page discussing reporting web spam. A link on that page leads you to a login for your Google Account, where after logging in you can fill out a full spam report that allows you to identify problems like the following:

  • Hidden text or links
  • Misleading or repeated words
  • Page does not match Google’s description
  • Cloaked page
  • Deceptive redirects
  • Doorway pages
  • Duplicate site or pages
  • Other (specify)

Yahoo’s page to report web spam is something that you have to hunt down in the Yahoo Help pages. There’s no link to it from Yahoo’s search results.

Bing has a “help” link in the bottom left of their pages of search results that you can click upon to answer the question, “Did we find what you were looking for? (required).” It doesn’t specifically mention web spam, and it there’s no detailed spam report that you can visit like the one on Google’s pages.

How likely is it that most searchers will go through the steps involved in reporting web spam at any of these search engines? Probably not too likely.

What if there was a “feedback” link next to each page listed in a search result?

Would it be helpful to the search engines? Would it be abused? Google seems to be leaning towards exercising more and more care in the reporting of web spam by searchers. Google’s Matt Cutts has stated a few times over the past couple of years that Google prefers spam reports that made when someone is logged into their Google Account. In a March 2010 post on his blog requesting Link Spam Reports, he tells us that “We’re [Google] moving away from using the anonymous spam report form.”

The Microsoft patent filing tells us that it might be helpful if spam reporting were more accessible to searchers:

The user base of searchers will generally be the best source for information pertaining to whether results are spam results. However, requests to end users to provide more feedback data have been met with limited success. The limited success stems from the fact that providing feedback is often cumbersome and time consuming for users. Furthermore, pre-configured feedback formats are often inadequate.

Additionally, in considering user feedback, a system must be able to identify feedback from spammers in order to prevent such feedback from artificially lowering rankings of competitors’ websites.

User satisfaction is a critical success factor for a search engine. Spam results significantly decrease the quality of the user experience. Accordingly, a solution is needed that facilitates identification and filtering of spam results.

Crowdsourcing and Automated Web Spam Detection Used Together

The patent application points to the possibility of a toolbar button or a user interface mechanism on a search results page that allows searchers to report web spam in response to a particular query. That information would be aggregated and merged with data from an automated spam analysis system to identify spam pages, and possibly penalize identified pages in future rankings.

It provides details on how information might be collected from searchers and combined with an automated system for identifying web spam, and tells us a little about the kind of information that an automated system might consider when deciding whether a web page is web spam or not. An interesting feature in that automated system is how it might pull in information from Microsoft’s advertising system in deciding whether a page ranking for a specific query term might be web spam.

The patent filing is:

System and Method for Spam Identification
Invented by Brett D. Brewer and Eric B. Watson
Assigned to Microsoft
US Patent Application 20100100564
Published April 22, 2010
Filed: December 24, 2009

Abstract

A system and method are provided for improving a user search experience by identifying spam results in a result set produced in response to a query. The system may include a user interface spam feedback mechanism for allowing a user to indicate that a given result is spam. The system may additionally include an automated spam identification mechanism for implementing automated techniques on the given result to determine whether the given result is spam. The system may further include a merging component for merging the determinations of the user interface spam feedback mechanism and the automated spam identification mechanism for deriving an indicator of the likelihood that a given result is spam.

Automated Spam Analysis

Microsoft might include a number of different signals to use to automatically identify whether certain pages that appear in search results are spam. Their system would include specific modules that would consider different factors, such as the following:

A Characteristic analyzer – may look at features of a page in search results to see things such as the number of advertisements on a website, whether pages of the site engage in keyword stuffing, and whether the page appearing in the search resuls seems to be a member of a group of results with the same IP address that tend to be spammer pages.

A Query independent rank analysis mechanism – may look at the query independent rank for each page, such as numbers of links to the page or other factors that may indicate the quality of a page. Presumably, the higher the rank, the less likely the page is spam.

A Monetization analysis mechanism – The query term used to find the page in a set of search results might be examined based upon monetization data from the advertising system and on clickthrough rates on sponsored sites for bid rates. If the query is a non-commercial one, such as “Carnegie Mellon University”, the automated spam analysis module might be less likely to consider a page as web spam. If the query term is highly commercial, such as “hotel”, then the cost to bid on that term might be much higher, and the search engine might be more inclined to filter out spam.

A Popularity analysis mechanism – Toolbar information, or some other way of measuring traffic to a page, might be used to see how popular that page is. If data collected from multiple toolbars shows that many people visit a particular page, then the automated system might decrease the probability that the page is spam.

Analyzing user feedback

One concern about searchers providing feedback about web sites is that some reports might be people reporting their competitors’ pages as web spam so that those competitors’ pages might be penalized, and they might move ahead of them in search rankings.

A search engine may look at things such as the IP address where feedback came from, and record that information. If excessive feedback originates from a single person or address, it may be a signal that it is from a spammer who is trying to “spam vote a result negatively.”

User feedback might also be viewed to see if a page is being marked as spam for more than one query term. If a page is marked as web spam for more than one search term or phrase, it indicates a higher level of confidence that the page is spam, regardless of which query term is used to find the page.

Conclusion

It would be interesting to see a search engine adopt a system that makes it very easy for searchers to mark pages as web spam, but would such a system be prone to abuse?

Would searchers possibly mark pages as web spam when they weren’t, for other reasons such as consumer responses to business practices or some other issues?

The patent filing tells us that under this system, Microsoft might use information from the automated system alone to identify web spam without reports from searchers. Presumably, reports of web spam from searchers without collaborating information from the automated system wouldn’t result in pages being marked as web spam and penalized by a search engine.

The patent filing doesn’t tell us how detailed such a feedback system might be for searchers, and whether it might include detailed questions like the list Google provides in its spam report page.

Will Microsoft add this kind of spam reporting system to Bing? Will crowds help make Bing a better search engine?

Share

37 thoughts on “How a Search Engine Might Crowdsource Web Spam Identification”

  1. Seems like it might work for pages with enough traffic to a valid statistical analysis. That would weed out fraudulent clicking.

  2. Abuse of report as spam links is one of the risk in things such as these. But, to be able to move forward and improve we have to take chances even if there are risks. And if results would not be favorable then other ways which are better can be thought of. It is a continuous improvement, a continuous process towards betterment.

  3. Really a lot of unanswered questions related to mark the pages as webspam & allow users to easily spot them. Well I dont think that Microsoft will add such a feature to bing because to attempt such kind of innovation there will be many problems which microsoft will have to face.

  4. I haven’t noticed the link that Google provide to report spammy results in the searches. Anyway, I expect that the algorithms in the search engines could analyze the content and determine if content on a site is spam or not without our direct participation.

  5. I wonder exactly how long would it be before site owners are hiring people to simply mark their competitors as ‘spam’.

  6. Really interesting thoughts on spam. Google does a pretty good job at keeping their indexes free of spam, as their standards are quite a bit higher than most.

  7. I think this will help, since a lot of people now are more aware what is a SPAM from a legit offer. I believe that allowing the end user to use this technology is an advantage. Allowing the user to report SPAM easily will give them the sense and responsibility of owning the SERP. Ideally this should make the results more valuable and relevant.

  8. Interesting article!

    Well, I have been hearing some ppl complaining lately about Google not being as responsive to SPAM complaints as they used to. And they DEFINITELY prefer complaints f/a registered Google Account, rather than anonymous (which is what this approach would be). So this idea seems to be going in the opposite direction than where Google is heading at the moment.

    On another front, Craigslist has been using croudsourcing for spam control, and last I checked, IMHO it was was a dismal failure! (Although this may have been more a result of how it was implemented, rather than a fault of the basic concept itself).

    “Would people abuse such a system by identifying pages as web spam when they really aren’t?”

    The spammers will, of course! Imagine the mischief a spammer could cause with this using automated means!

    “A search engine may look at things such as the IP address where feedback came from…”

    Using the anonymous reporter’s IP Address to prevent reporting abuse would IMHO be a failed approach. Looking back at the Craigslist example, the spammers there long ago figured-out how to use proxy servers to evade Craigslist’s IP-based detection & filtering schemes. It is difficult to believe that the blackhat SEOs would be less technically sophisticated than Craigslist users!

    “User feedback might also be viewed to see if a page is being marked as spam for more than one query term. If a page is marked as web spam for more than one search term or phrase, it indicates a higher level of confidence that the page is spam…”

    Again, all it would take is for the spammer to determine more than one SERP in which his honest competitor comes-up for (which he likely knows already, if their rankings are strong enough for him to want to go after them!), and submit a report (perhaps at different times / IP addresses) for each.

    I think that these the kinds of issues may be why Google is moving away from anonymous reporting?

    “…merging the determinations of the user interface spam feedback mechanism and the automated spam identification mechanism…”

    Rather than a user+automated scan combo score method, perhaps the better approach would be to have a user report INITIATE a detailed automated scan of the site? (i.e. – a more detailed/process-intensive test than might be practical to do for every sites in the index?)

    I do think that more information than just a button would be useful, if for no other reason than to determine which types of automated checks would be best to run; and perhaps more importantly, to give Microsoft feedback on situations where their spam reporting / auto detection system is not working properly, so they can improve upon it.

  9. My concern wouldn’t be so much commercial competitors “spam-bombing” pages – that could be mostly controlled by IP checks and such, but political, etc. “spam-bombing” that would be performed by larger unorganized groups that would be difficult to track.

    There’s also the concern of what a spam page is – I’ve seen some extremely nicely designed spam pages (you don’t realize they’re useless until you start reading the details) and some very poorly designed, but useful pages (of course there are some pages that just scream “spam” – gibberish with links). Not sure a really good solution can be put together.

  10. Hi Dave,

    I’m not sure if there’s some threshold level of reports that a page being reported as spam would have to reach before this approach might kick in, but the automated part of the system might use the manual reports to check regardless of the number of reports. The patent filing tells us that the automated system might make a determination that a web page is spam without considering input from the manual system.

  11. Hi Andrew,

    As it is right now, the search engines don’t make it easy for users to report spam. I don’t know if they will make it easier like this patent filing suggests, but it’s a possibility. It may be better to receive more reports, even though a good percentage of them may be fradulent than to not receive many at all. It doesn’t appear that a page would be harmed by abusive reports if the automated system it is combined with doesn’t perceive that page as web spam.

  12. Hi cnpzoomla,

    If Microsoft can see this process as one that can help improve the search results they provide, and they can account for possible abuses of the system, then it might be something that they do try out. I suspect that it’s something they would spend a fair amount of time testing to uncover potential problems.

  13. Hi Mike,

    The Google link doesn’t look like it would be where you would begin reporting web spam, does it?

    Right about the search engines attempting to identify web spam through analyzing content alone. But, adding a feedback feature like this might help them decide which pages to look at first.

  14. Hi Charles,

    It’s possible that someone might hire people to mark pages as spam – which is why the approach is a combined one, with manual feedback and an automated spam identification process.

  15. Hi Keith,

    It’s still possible to see web spam in Google’s index if you look at queries that might be more commercial and competitive.

  16. Hi Vic,

    Unfortunately, some web spam is getting harder to identify, from blogs that have been taken over by link spammers to other sites that present pages that are very similar to legitimate sites (but aren’t).

  17. Hi Dave,

    A lot of great points. Thanks.

    I would think that more than a button would be helpful as well, and that it’s most likely that feedback from searchers would be used more to initiate a scan from the search engine than to make a final determination.

  18. Hi Felix,

    Those are some very good points as well. It probably would be in the search engines best interest to have more than a button that they could click on to mark a page as spam. Having them explain why they think a page is spam would help filter some of those reports.

    It is harder to distinguish between some pages that are spam and others that aren’t. Looking at the content of a page alone may not be sufficient, even by the search engine. I did like the descriptions of the different things that the search engine might look at in its automated approach, and that it combines things such as a check of the content and advertisements on a page, an analysis of links pointing to and from the page, and data about how many people use the page and how they use it, what kinds of query terms are involved, and more.

    That may help keep pages from being identified as spam solely on the basis of user feedback.

  19. You have a good point that the analysis of links / advertisements would help filter spam pages (I suspect that Google is already doing a lot of that). Combining that with more of a report than a button would also probably keep “button-happy” people from over clicking.

    Overall, I don’t think there’s a perfect solution – I imagine spam web sites will keep adapting the same way spam email senders do (hmmm… the grammar there doesn’t seem quite right!) :)

  20. As always – the question then becomes …

    How long until those pesky SEO’s and website owners abuse that too?

    ‘Tis a shame though … I’m sure we could all do with a bit less spam in our results — or our Inboxes for that matter. :)

  21. Crowd sourcing is really taking off. I think using the crowd to control web spam is a great idea. It seems like if “x” amount of people flag a particular site as spam that it would allow google..bing..yahoo.. to quickly take down these sites.

  22. Hi Felix,

    It does appear that the crowdsourcing aspect of this is to use that feedback more as an indication of pages search engines should look at first, rather than as the deciding factor as to which pages are spam or not. The other things like link analysis do make sense, and I agree with you that Google is probably doing those types of things.

    I also agree that spammers will likely adapt over time, but if the cost of spamming can become more expensive, it might become less attractive to the people doing it.

  23. Hi Bottomless,

    Nice article and points. Another recent patent filing that I haven’t written about comes even closer to what you describe – a social network enabled search that uses social networks to help rerank search results. Definitely an interesting idea.

  24. Hi Dave,

    These days, I expect that most search engineers build algorithms and approaches to removing spam with that question in mind – how might this be abused. It’s a reasonable question – any algorithm is only as good as the cost of preventing people from attacking and abusing it. Would love to see less spam everywhere. :)

  25. Hi Chad,

    Maybe not take those sites down as much as review them – I expect a system like this will have its share of abuse. But the great thing about getting crowd involved is that it can provide more information and details, and help the search engines decide what pages to look at first.

  26. Pingback: Using mobs to fight spam
  27. Personally, I think this is an excellent idea, if only the search engines could figure out a way to implement it. With the community providing constant feedback, they could easily filter out the majority of spam and disinformation floating around out there. But as you said, it is becoming increasingly difficult to separate spam from legitimate content, and the risk of legitimate content being removed from false flagging by the community would become a prime concern. The idea is sound, however the implementation (namely the amount of “pull” given to the community) would have to be somewhat limited in order to avoid revenge/rival flagging of good content. It would place an additional burden on the search provider, as monitoring the community feedback would be essential, even though it could be largely automated.

  28. Hi John,

    Good points. The patent does appear to provide searchers with more of an ability to flag content for review rather than the power to remove content from search results. I would expect that some kind of quality control over the automated process would be in place for manual reviews of a number of sites that were both determined to be web spam or not web spam, so that people from the search engine could monitor how effective (or ineffective) this approach might be.

  29. Would such a voting system be prone to abuse? Absolutely! Black Haters will use anything and everything to knock a competitor down. I can easily see trojans being embeded in an army of computers with a million different IP addresses secretly voting a website as spam over a long, slow period. As a relevancy tool, I think it is a great idea, but should it be enabled? I think not. Those result would be manipulated way too easily and it would look completely natural. What-is-more, these services would be monetized and sold to the highest bidder on the black hat market. Now that would be a money-maker…unfortunately.

  30. Hi Mark,

    Most algorithms that attempt to identify web spam or rank web pages have aspects of them that are prone to abuse, which is sometimes referred to as a “cost of attack.”

    I’ve seen a number of whitepapers from Microsoft mention that they are aware that in publishing information about some of the algorithms they might use to fight web spam, they are trying to “raise” the cost of attack that people attempting to spam the search engines have to follow, so that at some point it becomes cheaper and easier to use tactics that don’t involve spam.

  31. Crowdsourcing is not new thing. DMOZ directory works due to community-based job and I remember that captcha is a crowdsourcing project too.

    Automated tools like Toolbar can be cheated easily changing IP address.

    I really appreciate new spam detection techniques because today spam is very aggressive.
    Thank you for your interesting article, I read this blog everyday.

Comments are closed.