Sorting Commercial Pages from Informational Pages
Does sorting commercial pages from informational pages provide value to searchers?
That’s one of the notions behind Yahoo’s Mindset (no longer available), which allows searchers to use a slider bar (see image at left) to rerank search results based upon whether a site is more commercial or informational. Yahoo! Mindset was released in May of 2005, and uses machine learning for text classification to sort top results for a query.
I included the kind of reorganization of search results done by Mindset in my post on 20 Ways Search Engines May Rerank Search Results back in October, but I didn’t have a link then to a patent or paper that might describe some of the processes behind how it might work. Since then, I’ve come across some interesting criticism of the use of sliders from Greg Linden, and his comparison of Yahoo Mindset with Yahoo Personalized Search.
I’ve also seen a newly released patent application from Yahoo which discusses ways to sort commercial from informational page in search results:
Method and apparatus for categorizing and presenting documents of a distributed database
Invented by Daniel C. Fain, Paul T. Ryan, and Peter Savich
US Patent Application 20060265400
Published November 23, 2006
Filed: April 28, 2006
Described herein are methods for creating categorized documents, categorizing documents in a distributed database and categorizing Resulting Pages. Also described herein is an apparatus for searching a distributed database. The method for creating categorized documents generally comprises: initially assuming all documents are of type 1; filtering out all type 2 documents and placing them in a first category; filtering out all type 3 documents and placing them in a second category; and defining all remaining documents as type 4 documents and placing all type 4 documents in a third category. The apparatus for searching a distributed database generally comprises at least one memory device; a computing apparatus; an indexer; a transactional score generator; and a category assignor; a search server; and a user interface in communication with the search server.
The document describes a method of classifying pages which involves:
- - Determining whether or not a page is spam,
- - Creating a quality score for pages,
- - Providing a transactional score for pages,
- - Deriving a propagation matrix (authorities and hubs score), and;
- - Assigning a commercial score.
Determining Whether a Page is Spam
Determining a spam score may be accomplished by having a human assign a score, though that doesn’t lend itself well to classifying the large number of pages on the web. But a manual determination is possible.
Automated techniques are much more likely, and the document points to a couple of documents that describe more about automated processes:
- - A white paper by Alan Perkins entitled “The Classification of Search Engine Spam,” and;
- - A paper by Danny Sullivan entitled “Search Engine Spamming” from the 2002 Boston SES Conference which appears to be available to subscription members of Search Engine Watch.
Assigning a Quality Score
A quality score may be given to pages based upon such things as:
- - Quality of the content,
- - Reputation of the author or source of information,
- - Ease of use of page, and;
- - Other such criteria.
This score could be assigned manually, or determined automatically, with a default value given to pages not explicitly evaluated.
Creating a Transaction Score
This score would represent if or how strongly a page might lead to transactions, such as a sale, lease, rental or auction. Here are some examples from the patent application of the types of things that might be looked for:
- - A field for entering credit card information;
- - A field for a username and/or password for an online payment system such as PayPal.TM. or BidPay.TM.,
- - A telephone number identified for a “sales office,” a “sales representative,” “for more information call,” or any other transaction-oriented phrase;
- - A link or button with text such as “click here to purchase,” “One-Click.TM. purchase,” or similar phrase,
- - Text such as “your shopping cart contains” or “has been added to your cart,” and/or;
- - A tag such as a one-pixel GIF used for conversion tracking.
In addition to determining whether or not a page is transactional, a score for deciding how strongly transactional is provided.
Deriving a Propagation Matrix
This involves a mix of creating an authority score and hub score for each page, and looking at user behavior on those pages.
Relative importance of each page is determined by looking at the number of links from each page to other pages, and the number of links from that page to other pages, to create an authority score and a hub score.
In general, a hub is a page with many outgoing links and an authority is a page with many incoming links. The hub and authority scores reflect how heavily a page serves as a reference or is referenced itself.
User behavior involved in creating this matrix includes looking at numbers of page views and calculating a “transition count” where the number of times a person views one page, and then goes to another page without visiting any other intervening pages is counted.
Calculating a Commercial Score
The patent application provides a fairly complex calculation for the commercial score, using the other factors mentioned above; the propagation matrix, the transaction rating, the spam score, and the quality score. Note that all pages are considered informational pages until they go through this process and are determined to be commercial or not.
Some Other Issues Discussed in the Patent Application
Glossing over quickly some of the other things discussed in the document about this process:
Further categorization may be done, and a person may choose before the search how much more categorization they would like to see.
More information for commercial sites may be displayed, taken from places like a domain name registry.
Paid placement advertising may be shown on results pages.
The patent application doesn’t specifically state that it is related to Yahoo Mindset, yet it does describe a process that appears to be very similar to what is going on upon the Mindset pages. Yahoo Mindset
is was in beta, and the Frequently Asked Questions page for the service is was worth looking at for more information on how it works.
I found it interesting that the patent filing referred to documents from Alan Perkins and Danny Sullivan when discussing automated methods for detecting spam since neither is a search engineer, yet not surprised since both are very knowledgeable about search, and are authority figures on many topics related to search.
The use of quality scores and determination of hubs and authorities in this reranking process was also something interesting to see.
In building ecommerce sites, there is often a value to users of those sites in building pages that make it very easy for people to conduct transactions, and pages that help provide people with enough information for them to make an informed purchasing decision. The transactional pages and the informational pages can be the same page, but they don’t have to be. There can be some value in providing both types of pages to potential customers. If a process like this, that sorts commercial and informational pages, is elevated from demo to part of the normal search functions at Yahoo (or another search engine), then there may even be more value in having purely informational pages on an ecommerce site, in addition to transactional pages.