Sorting Commercial Pages from Informational Pages

slider from Mindset

Does sorting commercial pages from informational pages provide value to searchers?

That’s one of the notions behind Yahoo’s Mindset (no longer available), which allows searchers to use a slider bar (see image at left) to rerank search results based upon whether a site is more commercial or informational. Yahoo! Mindset was released in May of 2005, and uses machine learning for text classification to sort top results for a query.

I included the kind of reorganization of search results done by Mindset in my post on 20 Ways Search Engines May Rerank Search Results back in October, but I didn’t have a link then to a patent or paper that might describe some of the processes behind how it might work. Since then, I’ve come across some interesting criticism of the use of sliders from Greg Linden, and his comparison of Yahoo Mindset with Yahoo Personalized Search.

I’ve also seen a newly released patent application from Yahoo which discusses ways to sort commercial from informational page in search results:

Method and apparatus for categorizing and presenting documents of a distributed database
Invented by Daniel C. Fain, Paul T. Ryan, and Peter Savich
US Patent Application 20060265400
Published November 23, 2006
Filed: April 28, 2006

Abstract

Described herein are methods for creating categorized documents, categorizing documents in a distributed database and categorizing Resulting Pages. Also described herein is an apparatus for searching a distributed database. The method for creating categorized documents generally comprises: initially assuming all documents are of type 1; filtering out all type 2 documents and placing them in a first category; filtering out all type 3 documents and placing them in a second category; and defining all remaining documents as type 4 documents and placing all type 4 documents in a third category. The apparatus for searching a distributed database generally comprises at least one memory device; a computing apparatus; an indexer; a transactional score generator; and a category assignor; a search server; and a user interface in communication with the search server.

The document describes a method of classifying pages which involves:

  • - Determining whether or not a page is spam,
  • - Creating a quality score for pages,
  • - Providing a transactional score for pages,
  • - Deriving a propagation matrix (authorities and hubs score), and;
  • - Assigning a commercial score.

Determining Whether a Page is Spam

Determining a spam score may be accomplished by having a human assign a score, though that doesn’t lend itself well to classifying the large number of pages on the web. But a manual determination is possible.

Automated techniques are much more likely, and the document points to a couple of documents that describe more about automated processes:

  • - A paper by Danny Sullivan entitled “Search Engine Spamming” from the 2002 Boston SES Conference which appears to be available to subscription members of Search Engine Watch.

Assigning a Quality Score

A quality score may be given to pages based upon such things as:

  • - Quality of the content,
  • - Reputation of the author or source of information,
  • - Ease of use of page, and;
  • - Other such criteria.

This score could be assigned manually, or determined automatically, with a default value given to pages not explicitly evaluated.

Creating a Transaction Score

This score would represent if or how strongly a page might lead to transactions, such as a sale, lease, rental or auction. Here are some examples from the patent application of the types of things that might be looked for:

  • - A field for entering credit card information;
  • - A field for a username and/or password for an online payment system such as PayPal.TM. or BidPay.TM.,
  • - A telephone number identified for a “sales office,” a “sales representative,” “for more information call,” or any other transaction-oriented phrase;
  • - A link or button with text such as “click here to purchase,” “One-Click.TM. purchase,” or similar phrase,
  • - Text such as “your shopping cart contains” or “has been added to your cart,” and/or;
  • - A tag such as a one-pixel GIF used for conversion tracking.

In addition to determining whether or not a page is transactional, a score for deciding how strongly transactional is provided.

Deriving a Propagation Matrix

This involves a mix of creating an authority score and hub score for each page, and looking at user behavior on those pages.

Relative importance of each page is determined by looking at the number of links from each page to other pages, and the number of links from that page to other pages, to create an authority score and a hub score.

In general, a hub is a page with many outgoing links and an authority is a page with many incoming links. The hub and authority scores reflect how heavily a page serves as a reference or is referenced itself.

User behavior involved in creating this matrix includes looking at numbers of page views and calculating a “transition count” where the number of times a person views one page, and then goes to another page without visiting any other intervening pages is counted.

Calculating a Commercial Score

The patent application provides a fairly complex calculation for the commercial score, using the other factors mentioned above; the propagation matrix, the transaction rating, the spam score, and the quality score. Note that all pages are considered informational pages until they go through this process and are determined to be commercial or not.

Some Other Issues Discussed in the Patent Application

Glossing over quickly some of the other things discussed in the document about this process:

Further categorization may be done, and a person may choose before the search how much more categorization they would like to see.

More information for commercial sites may be displayed, taken from places like a domain name registry.

Paid placement advertising may be shown on results pages.

Conclusion

The patent application doesn’t specifically state that it is related to Yahoo Mindset, yet it does describe a process that appears to be very similar to what is going on upon the Mindset pages. Yahoo Mindset is was in beta, and the Frequently Asked Questions page for the service is was worth looking at for more information on how it works.

I found it interesting that the patent filing referred to documents from Alan Perkins and Danny Sullivan when discussing automated methods for detecting spam since neither is a search engineer, yet not surprised since both are very knowledgeable about search, and are authority figures on many topics related to search.

The use of quality scores and determination of hubs and authorities in this reranking process was also something interesting to see.

In building ecommerce sites, there is often a value to users of those sites in building pages that make it very easy for people to conduct transactions, and pages that help provide people with enough information for them to make an informed purchasing decision. The transactional pages and the informational pages can be the same page, but they don’t have to be. There can be some value in providing both types of pages to potential customers. If a process like this, that sorts commercial and informational pages, is elevated from demo to part of the normal search functions at Yahoo (or another search engine), then there may even be more value in having purely informational pages on an ecommerce site, in addition to transactional pages.

Share

6 thoughts on “Sorting Commercial Pages from Informational Pages”

  1. You know, when there are commercial websites with plenty of educational content, it may be hard to identify a site as commercial or informational. What if the algorithm does it wrong? Then one of the best websites won’t be getting targeted visitors.

    Then again, there are multiple instances between an ecommerce website without content and ecommerce websites with plenty of informational content (which may be called informational websites and a shop). I doubt all of them will be given a correct ‘commercialized’ score.

    That being said, it is always useful to make sure the visitors finish the transaction (with or without reading additional pages), even without this patent :)

  2. I found this extra setting kind of useless when I tried it. I think they need to work a little more on it.

    As for SEO’s, if such a tool would be implemmented it would pose another challenge for optimization. But I don’t see this beeing very popular in the future. Maybe the way to go would be to allow better interpretation of the search query – like askjeeves tried to do (poorly, but at least they tried).

    When machine translation becomes very good, the search engines will change a great deal. That’s my prediction anyway.

  3. What if the algorithm does it wrong?

    I wonder that myself, but not just for this patent, and the assumptions behind it. Determining relevance isn’t a black and white endeavor, and the assumptions that are made in trying to do so for a incredibly large volume of sites is going to lead to mistakes and inaccuracies.

    It’s possible that sites with high transactional scores could also be the most informative pages on a subject too, because the owners of the site want to not only convert sales, but also make the shopper feel that they are making a fully informed shopping decision.

    I like to point out patents like this one because it allows us to ask how it might affect us if it stopped being a Beta project in the search engines research areas, and was folded into the normal search engine. What kind of effect would it have? There’s a mention in the patent application that something like this could be part of a browser helper, like a toolbar feature. Would more people use it if it was? I don’t know.

    Hi Carfeu,

    I’m not completely satisfied with the results, but then again, I do a lot of specialized searches that give me pretty good results with the search engines. For example, doing a search for a query phrase and adding “site:.edu” is a good way of only getting informational type pages from university sites. But I don’t expect many people to do that.

    I think that you’re referring to the additional categories that Ask.com shows that allows people to do query refinements by choosing a category or a query that might be closer to what the searcher intended. You may be right. I did enjoy moving the slider back and forth, and seeing the changes in what Yahoo Mindset was serving for the few top pages. The middle of the page query refinements (or sometimes top of page refinements) that Google offers are sometimes pretty good. I do like the fact that there’s some experimentation going on, and the search engines are trying some different things.

    The search engines will likely change a great deal over the next few years, and I think that many searchers are getting more sophisticated, too.

  4. The other thing it doesn’t evaluate is there may be commercial pages for things like lead generation which may not have a shopping cart or a tag. Or other pages made to essentially be affiliate pages that are essentially spam but look like information. From all the spam I’ve seen in competitive areas, Yahoo needs to do a better job of determining what is “quality” information.

    Perhaps a solution could be some sort of authentication program. A true commerce site could authenticate themselves as a commerce site with Google, Yahoo, etc. It might involve certain standards, an application process, some sort of human review. The same could be done for people who are essentially publishing content.

    There is a risk people could abuse the system, but at the same time having a “trusted vendor” system or some sort of quality control process could help to control spam. In a sense, a form of paid inclusion but one with more controls and the search engines could mark it that way and highlight these results for commercial queries.

    I think ultimately what searches are looking to are to get some sort of quality for the item or vendor they are looking for. If the search engines could provide more information and verify that a website is actually a store, has a physical location, and is not just another spam site posing as a real site, I think it would provide a value to searches and the search engines.

    Michael @ SEOG.net

  5. Some interesting ideas, Michael. Definitely worth exploring.

    I noticed, doing a couple of sets of searches, that the default results in Yahoo Mindset are different than normal Yahoo results. I’m wondering if Mindset is indeed doing some spam filtering that normal Yahoo isn’t.

    I see the value in authentication, too. But there are so many ways of validation, such as Better Business Bureau or Chamber of Commerce memberships, acquisition of a valid security certificate, and others. Should one be more valid than others? Will some businesses fail to validate for one reason or another? If so, will that harm the quality of the search engine?

    I see Google trying to get businesses to validate their business information in the Local Business Center. I don’t know how many businesses are actually doing that, though. It is often to their benefit to do so, and it provides an added level of trust and authentication, like you say.

Comments are closed.