A new patent application from Microsoft describes some ways to identify some of the spam pages that show up in search engine results. The research that led to the application started off by looking at something else completely, but a chance discovery turned up some interesting results.
The initial research began with something Microsoft calls Pageturner. Pageturner is a project that looks at how often web pages update, and how frequently they might need to be crawled. It also looks at identifying duplicate and near duplicate content on web pages.
The Microsoft researchers on that project found themselves being drawn to some very different research after looking at some of their results, especially from some pages located in Germany, which changed too quickly. Here are a couple of papers that describe some of the results of the original research:
A presentation from 2003, PageTurner: A large-scale study of the evolution of web pages (powerpoint) (no longer available), provides a little more insight into aspects of Pageturner, and some differences based upon different top level domains. There’s a lot of interesting information from this study. Here are some conclusions noted in the presentation:
- Pages don’t change much from week to week
- Pages have predictable change rate
- Markup-only changes often due to
- Session IDs
- Banner ads
- Large changes due to
- Log files, weblogs, and crafty porn
That last one – crafty porn – was the key to some further research tackling web spam. Microsoft came out with a patent application this last week that includes some of the research initiated during the Pageturner project, and the followup research that it inspired. The title of the document is Content evaluation , and it was filed on September 30, 2004, and published on March 30, 2006.
Here’s the abstract from the patent application:
Evaluating content is described, including generating a data set using an attribute associated with the content, evaluating the data set using a statistical distribution to identify a class of statistical outliers, and analyzing a web page to determine whether it is part of the class of statistical outliers.
A system includes a memory configured to store data, and a processor configured to generate a data set using an attribute associated with the content, evaluate the data set using a statistical distribution to identify a class of statistical outliers, and analyze a web page to determine whether it is part of the class of statistical outliers.
Another technique includes crawling a set of web pages, evaluating the set of web pages to compute a statistical distribution, flagging an outlier page in the statistical distribution as web spam, and creating an index of the web pages and the outlier page for answering a query.
The three have also worked together on the following papers, which look at web spam more closely.
While conducting research for the Pageturner project, looking a more than a few hundred million webpages, our researchers noticed a very large number of machine generated spam pages from a handful of servers in Germany. Those pages were generated by assembling “grammatically wellformed German sentences drawn from a large collection of sentences.” As a researcher, when you find something interesting like this, it’s difficult not to act upon it. The three Microsoft team members started looking for more:
This discovery motivated us to develop techniques for finding other instances of such “slice and dice” generation of web pages, where pages are automatically generated by stitching together phrases drawn from a limited corpus.
We applied these techniques to two data sets, a set of 151 million web pages collected in December 2002 and a set of 96 million web pages collected in June 2004. We found a number of other instances of large-scale phrase-level replication within the two data sets.
This paper describes the algorithms we used to discover this type of replication, and highlights the results of our data mining.
One clue that set off the discovery was that a number of these German pages were likely to change much more quickly than pages elsewhere. Delving deeper, they found a million pages from 116,654 hosts all sharing the same IP address, and operated by the same organization.
The paper describes how to locate the duplicated use of phrases from other pages, that may have been taken from other sites, and joined together in grammatically correct sentences. It discusses a way of identifying those pages by a process called shingling, and it describes some other characteristics of automatically generated pages that are intended to lure people to sites from search engine searches.
Spam web pages that are machine generated tend to differ in a number of ways from most other web pages, and can possibly be identified through statistical analysis. This paper looks at some ways of finding those pages. The types of things that the paper notes as predictive of automatically generated spam pages include:
- Pages with Long “host” names and a large number of “characters, dots, dashes and digits” in them tend to be spam pages. A “host” name is the section of a URL before a domain name. On many sites, this is often a “www,” but some sites use a subdomain name (or different host name) on it.
- Host name resolutions may help point to spam pages. These are pages that all point to the same IP address, and share the same domain name, but a different host name. Example: “http://some-host-name.example.com” could be one of 20,000 addresses that all point to the same IP address.
- Linkage properties: looking at the number of links embedded on a page compared to the number of links pointing to those pages. Are they similar to what is seen on other pages on other sites?
- Content properties: A large number of automatically generated pages contain the exact same number of words, though individual words will differ from page to page. (Amongst the pages they were looking at, they found 944 such hosts serving 323,454 pages which all had “no variance in word count.”)
- Content evolution properties: Spam pages tend to change everytime they are downloaded, which stands out from much more slowly changing pages on other sites.
Our researchers aren’t strangers to trying to locate spam pages on the web. A patent with Marc Najork named as the inventor, when he worked at Hewlett-Packard, also looks interesting:
A search engine receives from a client a representation of a first object that was returned by a web server to the client in response to a request from the client.
The search engine receives from the web server a second object in response to an identical request from the search engine, and compares the representation of the first object to a representation of the second object.
The web server is determined to be cloaked if the representation of the first object does not match the representation of the second object. Typically, the client receives a URL embedded in a response to a search request submitted to the search engine.
A toolbar operating in conjunction with the web browser on the client processes the URL. The processing includes:
directing the web browser to obtain an object corresponding to the URL from a web server addressed by the URL;
converting the object to a feature vector; and
delivering the feature vector and the URL back to the search engine.
I recall a few forum posts from Ammon Johns (here’s one) which mentioned that a toolbar from a search engine could easily be used to help compare what a search engines sees when it spiders a site to what a human visitor sees when it visits the same page. It’s likely that if Hewlett-Packard has devised a means of doing this, that the major search engines are capable of a similar ability.
If you prefer the movie version, instead of reading through a lot of this text, Dennis Fetterly gave a presentation at Purdue University on December 8, 2004, about using some of these processes to identify spam:
The presentation is 36 minutes long, and includes a lot of examples, and a nice question and answer session.
In May, the WWW2006 conference includes a presentation of papers, including some on search spam. One of those will be Detecting Spam Web Pages through Content Analysis, in which our three collaborators are joined by Alexandros Ntoulas of the UCLA Computer Science Dept. Should be an interesting presentation.