Microsoft Explains Duplicate Content Results Filtering

On Wednesday…

Me: “What should I include in my presentation on duplicate content at Webmaster World Pubcon in two weeks?

A Friend: “How does a search engine decide which duplicate to show in search results, and which ones not to show?”

Me: “Good one.”

A Friend:“Yep. How do they choose? PageRank? First one published?”

Me:“There are white papers and patent filings describing ways a search engine might discover duplicate content. They look at URLs and linking structures of mirrored sites, or examine consecutive word sequences in the snippets returned with results.”

A Friend: “Right. But that doesn’t answer the question.”

Me: “I’ve seen more than a couple of duplicate content filtering issues in the past. I’ve explored the topic in detail. But I’ve never seen something in writing on the subject from someone connected with a search engine.

A Friend:“And?”

Me: “It doesn’t seem to be any one signal. It’s not PageRank alone, or distance from root directory. It’s probably not the first one published, because many sites are dynamic, and the time stamp on the original may be later than on the copy, and the first copy spidered might be the one the search engines think is the oldest. It doesn’t appear to be perceived authority. It could have something to do with the number and quality of inbound and outbound links from a page. It could be a mix of all of those things and others.”

A Friend: “It’s still a good question.”

Me “Yes it is. I’ll work on it.”

On Thursday morning…

Me, looking through new patent applications: “Sweet!”

Collapsing Equivalent Results

Thanks, Microsoft.

A new patent application published Thursday discusses some of the signals that may be used to determine which results to show, and which to filter, at least possibly in Windows Live Search.

It may not include all of the signals being looked at – some of those might be trade secrets.

The practices at Google and Yahoo and Ask.com may be different.

But, all of the major search engines are striving to create good user experiences for people who search using their services. And all of them want to avoid duplicative results filling up the early spots on search results pages. The patent application does provide some insight into what search engines consider in choosing which pages to show, and which to hide.

I was surprised by a couple of the factors, and by the appearance of something I believe I’ve seen Matt Cutts refer to as “Pretty URLs.”

System and method for optimizing search results through equivalent results collapsing
Invented by Brett D. Brewer
Assigned to Microsoft
US Patent Application 20060248066
Published November 2, 2006
Filed: April 28, 2005

Abstract

A system and method are provided for optimizing a set of search results typically produced in response to a query. The method may include detecting whether two or more results access equivalent content and selecting a single user-preferred result from the two or more results that access equivalent content.

The method may additionally include creating a set of search results for display to a user, the set of search results including the single user-preferred result and excluding any other result that accesses the equivalent content. The system may include a duplication detection mechanism for detecting any results that access equivalent content and a user-preferred result selection mechanism for selecting one of the results that accesses the equivalent content as a user-preferred result.

The Duplicate Content Problem

1. A search engine finds documents that match queries and assigns them scores to determine the order within which they should be displayed.

2. Pages that may be very relevant as results may also be duplicates, or near duplicates, of each other.

3. Example: www.ymca.net and www.ymca.net/index.jsp lead to the same content with the first URL redirecting to the second one. And, www.ymca.com and www.ymca.com/index.jsp could be mirrors of www.ymca.net.

4. A search engine might include all four results in the top ten results of a search for the query “ymca”.

5. This is a bad user experience, because it keeps the searcher from seeing other results that might also be relevant, on the first page of results.

Choosing One Result

The system described would include:

* A crawler that visits web pages, and indexes and stores results in an index/storage system.

* Ranking components that may rank located results in response to searchers’ queries.

* Results storage components which may have a cache for recently stored results and an index system for storage of additional results.

* A duplication detection mechanism which would detect results having duplicate content. A technique for detecting duplicates referenced in the patent application involves using “shingleprints” as described in another Microsoft U.S. patent application, Method for duplicate detection and suppression.

* A result selection module decides which result to display to searchers, regardless of whether shingleprints or other methods are used to determine which are duplicates.

Result Selection Module

Some parts which may be included in the result selection module:

  • A query independent ranking component (something like PageRank, or a page quality score, or others, or combinations of all),
  • A result analysis component,
  • A navigation model selection mechanism,
  • a click through rate determination component,
  • A user-preferred result selection mechanism, and;
  • Result storage.

Upon finding that results are duplicates, or very near duplicates, those results would be placed in Result Storage, but the search engine would not display them all.

The Result Selection Module would determine (through the result analysis component) which was the “user preferred selection” (via the user-preferred result selection mechanism) to show in response to the query.

A different URL might be chosen as the URL that the search engine actually uses to navigate to the page (chosen via the navigation model selection mechanism).

Some Factors the Results Analysis Component Might Consider

* Extension – .com might be a better choice than .net – it “appeals” to users because they understand it

* Shorter URLs – In the YMCA example above, the user-preferred version of the URL may be www.ymca.com both because “.com” is more common than “.net” and because the www.ymca.com URL is shorter than the two “index.jsp” results.

* The Navigational Model Selection might chose a different URL – while the searcher is shown www.ymca.com, the link might actually go to www.ymca.com/index.jsp, which is selected by the navigation model selection mechanism and is stored in the result storage area, in order to save the user a redirect. Eliminating redirects leads to the fastest result.

* The URL might contain keywords that appear in the query. In that case, the URL acts as a document summary. So, www.sfgiants.com might be a better choice than www.mlb.com/sf/id1223/xyx.com when the query is “sf giants”

* Searcher Location or language – A different duplicate might be chosen based upon where the person searching is from. So a London-based searcher might see www.example.co.uk where a New York searcher would get www.example.com

* Popularity – how well linked to the page is by other sites might be determined by the query independent ranking component.

* Click through rates might be tested, and the version of the URL with the highest may be determined by the click through rate determination component, acting upon the assumption that high click-through rates indicate that users find the result satisfactory.

* Fewest redirects – as determined by the navigation model.

The user-preferred result selection mechanism uses input from the query independent ranking component, the result analysis component, and the click through determination component to select a user-preferred result. (That sounds much better than the technical term I’ve seen Matt Cutts use regarding displayed URLs in results in the context of redirects – the “prettiest URL.”)

Conclusion

So, something like PageRank does matter when it comes to filtering equivalent results, as does searcher location, click-through rates, amount of redirects, words used in URLs, length of URL, choice of tld, and possibly other signals.

The other interesting thing here is that a search engine may display one URL for searchers, and use a different one for navigation – Pretty URLs for people, and more direct URLs to navigate to the page.

Share

13 thoughts on “Microsoft Explains Duplicate Content Results Filtering”

  1. Bill,

    Great post, it gives some insight into what the engines may be using. However, it seems to leave more questions than answers… Typically, this is the way of the search world.

    I would like to hear your opinion on differing domains on the same topic. More specifically, what if your content is taken by a competing site with approximately the same value? Let’s just say: potsnpans.com & potsandpans.com. Same content on both, say an article on non-stick surfaces.

    What do you think it would come down to?

  2. Thanks.

    Sometimes finding some new, and maybe better questions can be useful.

    I think that different domains on the same topic would follow along with the examples above pretty well – ymca.com and ymca.net are different domains by virtue of the different tlds.

    Query independent ranking factors, and tested click through rates might make a difference here.

    In Microsoft, query independent factors might mean some type of link popularity measure, but it could also mean a quality page score measure along the lines of ranknet.

    If Google were using something like this, pagerank might be a factor. However, I’ve seen at least one instance where it didn’t seem to be the deciding factor, where a duplicate page with little to no pagerank was appearing in results, and a URL with a pagerank of 5 was being filtered.

    Was a different query independent factor an issue? Maybe.

    Could reranking measures get looked at first, even though they might be dependent upon a query, like a reranking based upon local interconnectivity? The factor above about a preferred country or language is a reranking of results. Maybe.

    But I’m feeling good about having a new set of questions to ask.

  3. Great post although that patent makes my eyes bleed.

    With all of the search engines tracking clickthrough on the URLs, it might lead to more optimization that SEOs have to do in the future. That is, they could have to start doing a bit of copywriting for their meta descriptions and titles.

    G-Man

  4. Thanks, G-Man.

    Sadly, that was one of the more clearly written and straight-forward patent applications I’ve read in a while. :(

    Copy writing should be attaining a larger role in SEO. I’ve seen a lot of page titles and snippets that wouldn’t be harmed by some rewriting. If “time spent on a page” and “distance scrolled down a page” are other measures that may be looked at now or in the future (and there’s some reason to believe that they might be), then the copy on a page is even more important now than it has been in the future.

  5. Nice one again.

    One of the things you may want to cover at the PubCon is whether the search engines will consider a copied article on a trusted website as an original, if the original article was on a lesser established website.

    For instance, I am not being re-published by the iEntry network (WebProNews, etc) and pondering of whether the search engines will trust the re-published posts more than mine. I’m thinking I’ll be getting more links to my own posts, but the trust of the domain plays a role, too, especially when the search engines may find the re-published post sooner than my own post.

    What do you think about it?

    Another interesting thing is the user-experience factors that affect the “User score”, or something. Click-throughs is one, but it is not an objective factor, I surmise. I remember reading somewhere about the user returning to the SERP after clicking on the result, but I think it was in relation to Google, if anything. Have you seen any info on this, or I have missed it in the RSS reader?

    Overall, nice overview, too :)

  6. Thanks. :)

    A very good question, Yuri. One that I’ve been considering for a while, especially in the context of syndicating articles and of the use of full RSS feeds that people may present on their pages using a server side script.

    I’m not sure that there’s a consistent rule applied across the board. Query independent factors and measured user behavior may make a difference in those instances.

    That’s part of the reason why this patent application is so interesting.

    As for the objectiveness of click-throughs, there was a paper earlier this year on eye-tracking co-authored by a Google employee, which was presented at a conference in San Diego. I haven’t read that one yet, but an earlier paper co-authored by the same person (and two of her co-authors) explores aspects of click throughs in Google results – Accurately Interpreting Clickthrough Data as Implicit Feedback

    The authors of that second document notes that they identified at least two potential sources of bias:

    1. A “trust” bias – the more highly a page is ranked in results, the more likely someone will click upon that result, even if the abstract shown in the results is less relevant.

    2. A “quality” bias – The decision to click is based not only on the relevance of the abstract of a result, but also on the quality of the abstracts of other results. Because of that, they note:

    This shows that clicks have to be interpreted relative to the order of presentation and relative to the other abstracts.

    I don’t think that’s what you are referring to, but the paper is worth a look.

  7. Just wish I was going to the PubCon but not sure the wife would let me travel to the US from the UK without her going too!

    Would love to see your presentation, Bill, I don’t suppose anyone is going to film it and pop it on YouTube are they?

    Daz

  8. Hi Daz,

    Good to see you. Bring her along? :)

    There have been past pubcons in London, but it appears not this year. If there’s one next year, I might consider visiting for it. Be nice to meet you, and many of the people from the UK whom I’ve talked with on the forums but haven’t met in person.

    I hadn’t heard anything about filming any of the presentations. Though, if they filmed a few of them, and posted them on YouTube, that might inspire the growth of future Pubcons. It would be fun.

  9. I have been looking for source info like this for awhile. Thanks for the post. Looking forward to your presentation as well and picking your brain in the bar if possible. I’ll buy.

  10. Hi Bill

    Yeah I could but last time we went to Vegas (on our honeymoon almost a year ago now) she cleared me out…

    honey if you’re reading this I’m only joking

    Bill she won’t be reading this but just in case I am covering my backside here… LOL

    I do intend to visit one PubCon in the US and the UK at some time but not “knowing” that many people in the industry this side of the pond or the US tends to put me off…. I don’t want to look like those that chase the big guns in Web Marketing around the conference… but I don’t want to look like “Billy No Mates” either!

    Would be great to meet you in person at some point though :)

    Daz

  11. Would be nice to meet you, Daz.

    This is my first Pubcon, but I’ve met a number of folks at the Search Engine Strategies Conferences, and it is nice to be able to talk with folks in person whom you’ve only spoken with online before, and to meet new folks in the industry.

    I suspect that if they have a pubcon on the east coast next spring that I’ll be attending that one.

Comments are closed.