Classifying Web Spam by Looking at Query and Page Features

Why do search engines care about spam pages that show up in search results? What does a search engine consider web spam? How can a search engine identify web spam?

Should someone who publishes information on the Web be concerned that a search engine might label their pages as spam?

Might the best way to avoid having a search engine avoid mislabel your web pages as search engine spam be to focus upon building quality content on the pages of your site?

It probably is, but it doesn’t hurt to look at what the search engines say on these topics, which is a good reason to keep up with patent filings and papers that are published by the search engines.

Good SEO and Bad SEO Techniques

Ideally, Search Engine Optimization (SEO) involves building a strong technical and informational foundation for a Web site so that all visitors, including search engines, can easily navigate through web pages and find content that fulfill informational and other needs of visitors who want to find what the site has to offer.

This means that effective SEO work on the pages of such a site would aim at improving the quality and content of those pages, focusing upon words and information that searchers would expect to see upon the pages, and that searchers would use in queries to find the pages of the site when searching at a search engine.

It also means using a strong linking and information architecture for the site, descriptive page titles and meta descriptions, meaningful headings, relevant images, intelligent and helpful link text in links to pages, and useful content that might be relevant to those queries and helpful to those searchers.

Unfortunately, there are site owners who will use other methods that try to take advantage of aspects of search algorithms that may lead to having pages returned in search results that aren’t very relevant for queries used by searchers. Pages that use these kinds of methods are referred to as web spam or search engine spam pages.

A Classification System for Search Engine Spam

A recent patent application and a recent whitepaper from Microsoft explore a classification system for search engine spam – methods of optimizing web pages that aren’t relevant for the queries that they target – based upon comparing features of search queries and features of web pages that may rank well for those queries.

Web Spam Classification Using Query Dependent Data
Invented by Krysta Svore and Chris Burges
Assigned to Microsoft Corporation
US Patent Application 20080270376
Published October 30, 2008
Filed April 30, 2007

Abstract

A web spam page classifier is described that identifies web spam pages based on features of a search query and web page pair. The features can be extracted from training instances and a training algorithm can be employed to develop the classifier. Pages identified as web spam pages can be demoted and/or removed from a relevancy ranked list.

The 2007 Microsoft Whitepaper, presented at the 3rd International Workshop on Adversarial Information Retrieval on the Web, in Banff, Alberta, CA, shares a couple of authors with the patent, and also explores how to classify Web spam based upon features found in queries and web pages: Improving Web Spam Classification using Rank-time Features (pdf)

The paper provides more technical details on web spam classification than the patent filing, and is worth a look if you want to delve into the topic fairly deeply. Here, I’m going to go over some of the features that a search engine might look at when trying to classify pages as spam. Keep in mind that it is possible for a spam classification program at a search engine to mislabel a page as spam, as you read through these features.

Ranking Features in Queries and Pages

When a search engine ranks pages, it may look at a number of features that may appear on those pages and within queries that people use to search for those pages at a search engine.

It may look for the appearance of query terms in page titles, on headings that appear upon those pages, in anchor text of links pointing to the pages, in metatags, and may use any other information related to a particular web page and the domain that it appears upon.

The patent filing tells us that a search engine could look at hundreds of different features to rank pages.

Rankings can be driven by algorithms that identify features of a search query, web pages that may match that query, and the how terms that appear in the query and those pages may be related.

Example features may include:

  • The most frequent term in the web page,
  • A number of times a particular term appears in the web page,
  • A domain name associated with the web page (i.e., www.example.com),
  • A number of links pointing to the page,
  • Whether a query term appears in a title of the web page,
  • Other features

A search engine would look at features like those to determine how relevant a page is to a query.

Web Spam Pages

While the features listed above might be helpful in ranking pages, so that those pages can be shown in an order based upon which pages are most relevant to a search, some pages that show up in search results might be web spam pages.

The authors of the patent filing tell us that there are a couple of very good reasons to keep web spam pages from showing up in search results.

The first is that if searchers keep on getting web spam pages as part of the results they receive from a search, they may switch to a different search engine.

The second is that legitimate sites may start utilizing spamming techniques to improve ratings over spam pages..

Web Spam Classification

There are so many web pages, and so many potential spam pages, that search engines can’t manually identify all spam pages. Instead, people from a search engine may label some web pages as spam manually, and then the search engine may try to use programs to look at information taken from pages, or the domains they appear upon, or the links that point to those pages to label other pages as web spam.

One example cited is of a web page stuffing keywords into its page to try to increase its relevancy ranking. It will likely to have more keywords than a legitimate site for a particular keyword. By training a web spam classifier to recognize situations like that, web spam pages can be identified.

Features used to compare queries and pages might be broadly classified as:

  • Spam based features
  • Rank-time query independent features, and;
  • Rank-time query dependent features.

A web spam classifier program might begin to label a given web page as “spam” or “not spam” by looking at decisions made by human judges who determined whether certain pages were spam when looking at those pages in search results for specific queries.

It may then continue to learn to judge pages associated with results for specific queries, based upon looking at certain aspects or features related to the queries, and related to the pages.

To create a set of training data for a spam classification program to identify spam pages, a search engine might grab representative queries from search engine query log files, and from toolbar data collected by the search engine, to learn how to tell the differences between “spam” pages and “not spam” pages from features of the queries and the pages.

The patent filing tells us about how a human reviewer might identify web spam pages;

When using human labels, a human judge is given the list of queries and issues each query to a search engine. A returned list of 10 results with snippets is shown to the judge. For each URL appearing in the top 10 returned search results, the judge labels the URL as spam, not spam, or unknown. The judgment is made based on the quality of content, the use of obvious spam techniques, and whether or not the result should appear in the top 10.

How well can a computer program do the same thing?

Web Spam Features

Values of web spam features can be determined by mining feature information for each URL in the testing and training sets. Examples of such features include:

  1. The number of spammy in-links to the top level domain of the site – The number in-links coming from labeled spam pages.
  2. The quality of phrases in the document – a score that indicates the quality of terms on the page.
  3. Density of keywords (spammy terms) – a score that indicates how many terms on the page are spam terms.

Query Independent Features

A query independent feature is a kind of ranking feature used in identifying spam that doesn’t bother to look at a specific query when deciding if a page is web spam, and the spam determination can be made by looking at other information. Because of that, this kind of feature is considered to be “query-independent.

These query-independent features can be grouped into page-level features, domain-level features, anchor features, popularity features, and time features.

Page-level features can be determined by looking just at a page or URL, and can include such things as:

  • The count of the most frequent term,
  • The count of the number of unique terms,
  • The total number of terms,
  • The number of words in the URL, and;
  • The number of words in the title.

Domain-level features can be computed as averages across all pages in a domain (such as all pages with the domain www.example.com). Examples of domain-level features include:

  • The rank of the domain,
  • The average number of words (on each page), and;
  • The top-level domain.

Popularity features are features that measure the popularity of pages through the collection of user data, such as collected toolbar data, where the user has agreed to provide access to data collected during a logged session.

  • The number of hits within a domain,
  • The number of users of a domain,
  • The number of hits on a URL, and;
  • The number of users of a URL.

Time features include:

  • The date the URL was crawled,
  • The last date page changed, and;
  • The time since the page was crawled.

Other features can be used, such as:

  • Frequent term counts (how often different words appear on pages),
  • Anchor text features,
  • etc.

Query Dependent Features

Unlike query independent features, query dependent features depend upon looking at the actual terms from a search query and where and how those terms appear upon a page, and there can be several hundred query dependent features.

Query dependent features (or ranking signals) can be found in queries, in the content of a document, and in the content of a URL.

Query dependent features can depend just on the query used to search for, or on a relationship between a query and the properties of a document.

Examples of query-dependent features include:

  • The number of query terms in a web page title
  • How frequently a query term appears on a page,
  • How often a query term occurs in all pages of the search engine index,
  • The number of documents on the Web that contain the query term, and;
  • N-grams shared between the query terms and a document.

Conclusion

There are so many pages on the web that may show up for particular search results, that this process of identifying spam web pages as described in the patent filing as a largely automated process.

While a number of features or signals are described in the patent application that might be looked at to create and continuously update a program that can identify web spam pages, one thing to keep in mind is that the impact of a page being labeled as “spam,” is that the page may be pushed down in search results, or removed completely.

The authors tell us in the whitepaper that they are very much concerned that pages which aren’t spam may be labeled as spam, and try to take care not to have too many “false positives” where non-spam pages are identified as spam.

Without focusing on the specific features listed above (and many other that aren’t listed), the best way to try to avoid having any of your pages mislabeled as spam, as a web page publisher, is to focus upon building pages focusing upon quality and content.

Share

12 thoughts on “Classifying Web Spam by Looking at Query and Page Features”

  1. Hi,

    another really nice and complete post, thank you.. I think that systems are not so bad at picking out good from bad these days, so I wouldn’t panic too much about my site coming up as a spam page (unless it actually was).

    I love the web spam area of research, it really goes hand in hand with understanding search engines, and something any SE engineer needs to be aware of when designing them.

    I wrote about a paper dealing with the use of link structure fro fighting web spam, which was “Improving Web Spam Classifiers Using Link Structure” by Qingqing Gan and Torsten Suel from the Polytechnic University in Brooklyn,NY. I liked it.

    There’s also “Link analysis for web spam detection”, recently published with Baeza-yates also listed as an author.

    There’s also “Identifying web spam with user behavior analysis” and also “Tracking Web spam with HTML style similarities”, and plenty of others.

    The conference to watch is AIRweb. It’s worth getting an ACM membership to access the digital library if you haven’t already. It’s my “bread and butter” :)

    CJ

  2. I wonder how well the search engines are doing with spam detection. I think of the fact that recently, while searching on Google, I was informed that my inquiry was typical of a spammer (I was looking for a website or phone number for some businesses that I needed to contact). It made me wonder why typing in a business name into a search engine would be considered the action of a spammer.

    I also noticed that one site which has the hall marks of a splog has been sending traffic to my site, and it appears to be ranking well for its targeted keywords. When I saw the site, splog was my immediate thought, but obviously this is not going through other people’s minds. I wonder then if search engines can learn from human actions. Well, I have to read the whitepaper to have a deeper understanding.

  3. Thank you CJ,

    The AIRWeb papers are a great source of information on Web Spam topics.

    Hi Frank,

    A couple of great examples of problems surrounding web spam. Splogs do still show up in search results, and efforts by search engines to fight spam sometimes leave you scratching your head.

    With billions of pages on the Web, search engines do try to find ways to tackle spam programmatically, and the possibility of false positives and false negatives do exist, so that splogs sometimes rank well, and searches that might be from spammers looking for places to spam aren’t always. Here’s a snippet from the wikipedia article that I linked to:

    Spam filtering
    A false positive occurs when “spam filtering” or “spam blocking” techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task.

    A false negative occurs when a spam email is not detected as spam, but is classified as “non-spam”. A low number of false negatives is an indicator of the efficiency of “spam filtering” methods.

    I’ve received that same message from Google on a search or two – and I suspect that the way I formatted those searches was similar to searches from automated programs that were being done to find places where spam comments could be dropped – not my intention, but I did get the same message. Statistically, other searches like ours were found frequently enough to be from spammers that we were shown that message, even though we had no intention of spamming.

  4. I wanted to know that what does unsolicited mail mean? I got several mails that My mail is spam though obviously its not. What should i do in this case?

  5. Hi Steve,

    The term “unsolicited commercial email” is one that has been used by the US government and other organizations to define email spam. Some resources that discuss it further are:

    It looks like the Australian Government also defines email spam as unsolicited commercial emails:

    Many internet service providers (ISPs) include spam filtering programs to their customers that may block some spam messages, though they may also block a small number of emails that aren’t spam.

    Over the past few years, I’ve sent a a handful of emails to people I know which were returned to me with a message from my ISP stating that the domains that the emails were to go to were blocked as spam sites. They weren’t spam sites, but the spam filtering software being used kept them from going through.

    You might try to contact the internet service provider that provides you with email services to see if they can help you keep your emails from being incorrectly being returned as spam. They may not be able to help, but they may be able to tell you whom to contact.

  6. Web spam detection has come a long way, but has a long way to go. The issue is that as the engines evolve, spammers are evolving quicker. It’s going to be a battle like this for the near future, then in the distant future I really suspect that search has to evolve into something very different than it is today if we’re ever truly going to get rid of spam in the indexes.

  7. Hi Greg,

    I agree with you. :)

    The evolution from search engines and from spammers does likely mean that search will likely evolve into something very different. I suspect that personalization is one step in that direction.

  8. Hi,

    Is there any such possibility where non-competitive keywords get ranked well and also quickly even though the pages containing those non-competitive terms are spam. I am not sure tough if there is any relationship between the ranking of a non-competitive keywords and spam techniques. Just want to be sure of that..William

    Great post william as ever. Loved it.

  9. Hi sham,

    Thanks. It is possible that a page that you or I might consider to be web spam might rank well and quickly for query terms that aren’t very competitive. If a search engine identifies a page as web spam, and penalizes or reduces the “importance” measure of that page, but there is no other competition for the query term or phrase, it may be still possible for the page to show up in search results. It’s also possible for a search engine to remove pages that it identifies as spam from its index completely, too.

  10. Defining SPAM is changing very rapidly with all the Google Free apps. As they collect data from Chrome, docs, gmail and about 30 other products their ability to filter on verified user data will only increase their advantage over competitors.

    Imagine being another search engine and knowing the price of entry into the market is a few hundred million worth of free products to build a data stream!

    More than once I have spoken to the local Tech School and always stump them with the question: “Why does Google give all that stuff away for free?”

  11. Hi Mark,

    Google’s main source of income are the advertisements that are shown with many of the services that they do provide. Those include ads in Gmail and in search results, in results for custom search engines, and in other places as well. The better the services they offer, the more likely it is that people will come back and use those services. The better the services they offer, the harder it is for others to compete with them.

    They do use the data they collect to try to improve those services as well.

Comments are closed.