Solving Different URLs with Similar Text (DUST)

Different URLs, Similar Pages

There are sites where (substantially) the same page may be found under different Unform Resource Locators (URLs) or addresses.

For example:

  • http://www.google.com/news = http://news.google.com/
  • http://www.nytimes.com = http://nytimes.com

When this happens, there can be some negative results from the perspectives of both search engines and site owners, such as:

  • Search engines have to spend time trying to visit each version of the page
  • Search engines may treat each page as different but duplicate pages.

It’s recommended that this type of duplication of pages under different addresses be avoided, if at all possible. Site owners can try to reduce or limit the possibility that these different URLs with the same (or very similar) content appears on their sites. What might search engines do to limit or stop this kind of problem?

A Possible Solution for Search Engines?

My examples are simple ones, but there are more complex situations where multiple addresses may exist for the same page. An algorithm to help search engines understand when the same (or a very similar) page is being exhibited under different URLs was the focus of a poster presented at the WWW2006 Conference this past May.

The extended abstract of that poster, Do not Crawl in the DUST: Different URLs with Similar Text, looks at some of the more complex versions, and describes an algorithm that might help search engines recognize those pages before visiting them, so that only one is crawled and possibly indexed. The authors are Uri Schonfeld, Ziv Bar-Yossef, and Idit Keidar. (Note: Ziv Bar-Yossef joined Google last month.)

Here’s a snippet from the introductory paragraphs to that document:

Many web sites define links, redirections, or aliases, such as allowing the tilde symbol (“~”) to replace a string like “/people”, or “/users”. Some sites allow different conventions for file extensions- “.htm” and “.html”; others allow for multiple default index file names – “index.html” and “tex2html12″. A single web server often has multiple DNS names, and any can be typed in the URL. As the above examples illustrate, DUST is typically not random, but rather stems from some general rules, which we call DUST rules, such as “~” $ \rightarrow$ “/people”, or “/default.html” at the end of the URL can be omitted.

Moreover, DUST rules are typically not universal. Many are artifacts of a particular web server implementation. For example, URLs of dynamically generated pages often include parameters; which parameters impact the page’s content is up to the software that generates the pages. Some sites use their own conventions; for example, a forum site we studied allows accessing story number “num” on its site both via the URL “http://domain/story?id=num” and via “http://domain/story_num”. In this paper, we focus on detecting DUST rules within a given web site. We are not aware of any previous work tackling this problem.

Other pages that might be determined to be similar are ones where the main content is available at one URL, and the same content with some additional information (such as blog comments) can be seen at another URL.

Identifying DUST

The poster notes that search engines do attempt to identify DUST with some simple and some complex approaches, for example:

  1. “http://” may be added to links found during crawling, where it is missing.
  2. Trailing slashes used in links (http://www.example.com/) may be removed.
  3. Hash-based summaries of page content (shingles) may be compared after pages are fetched.

What the paper introduces is an algorithm, that the authors refer to as DustBuster, which looks individual sites, and tries to see if there are rules being followed on the site where similar content is being shown under different URLs.

For example, in the site where “story?id=” can be replaced by “story_”, we are likely to see in the URL list many different pairs of URLs that differ only in this substring; we say that such a pair of URLs is an instance of “story?id=” and “story_”. The set of all instances of a rule is called the rule’s support. Our first attempt to uncover DUST is therefore to seek rules that have large support.

It also tries to understand possible exceptions to those rules. The poster defines those in more detail, and it’s worth trying to understand the examples, exceptions, and approaches that they use.

Letting the Search Engine Decide Which URL is Good

There’s one problem that I have with the approach, and that is that the algorithm decides which pages to index and keep, and which to avoid – and not fetch for indexing.

This could be a problem, for instance, for a news story page which is available at different URLs, with one displaying comments and the other not showing them. Or a product page, which might be shown twice – once with, and once without user reviews. Or a set of dynamic pages where some small portion of the page changes in response to which link is clicked upon.

But those pages might have difficulties being indexed anyway, or filtered during the serving of a page, if a shingling approach is used, and determines that they are the same or substantially similar pages.

Either way, if an algorithm like DustBuster were used, or another approach, it’s still the search engine deciding which of the similar pages it might include in its index, and which it wouldn’t. If you can avoid DUST, it’s not a bad idea to try.

Share

10 thoughts on “Solving Different URLs with Similar Text (DUST)”

  1. This is one of the age-old questions, though, isn’t it?

    I remember when you had to pay your Webhosting service an extra fee to set up the DNS record that resolved all w w w. whatver references to their www-less version (or vice versa).

    Once upon a time, most “technically savvy” people assumed you’d either use one variation or the other, and that was that. But then we got personalized home pages, which evolved into all sorts of clever scripts that could produce dynamic content on the fly.

    I think what they need is a CANONICAL statement for robots.txt that Webmasters can implement. Something like:

    Canonical: www .*.TLD !*.TLD

    The first term says all ” w w w . whatever ” URLs are canonical. The second term says anything lacking a “www”. is not canonical. The terms could easily be reversed, using the bang (!) character to denote undesired expressions. “regexp” die-hards might choke on that example, but I’m sure they could devise something cryptic to satisfy their bizarre rules.

    Canonical: !ID=*

    In this example, the expression would mean “ignore everything from ‘ID=’ onwards”.

    Canonical: topic=* . cat=*

    In this example, the expressions would mean “any term starting with these values can be used (in this order) to define a unique, persistent URL that does not require other elements you may find in the URL”.

    These may be crude examples, but if the search engines can ask Webmasters to implement something as non-standard as rel=nofollow, they can ask us to add a canonical nomenclature to our robots.txt files.

    I think that would make life a whole lot simpler for many people (except the guys who actually define the robots exclusion standard, but they have totally dropped the ball on this issue — we need to fire them).

  2. Some interesting points, Michael.

    I wish hosts wouldn’t automatically assume that you want both the “www” and “non-www” versions to point to a site, but unfortunately many of them do these days.

    A canonical statement there might not be a bad idea.

    It’s been more than 10 years since the development of the robots exclusion standard, and there haven’t been many changes to it. It still confuses some folks, but I think that is possibly because a few parts of it are poorly written – a little more plain English in the explanation of the standard wouldn’t hurt things.

    I don’t know if we will see the folks who wrote the exclusion standards want to step up and try to address these types of issues. It would be interesting to see if the search engines would.

  3. In fact, Russian engine Yandex already using something like this. Engine understands directive “Host” for “www” or “non-www” version in robots.txt file. It takes about 1-6 months to teach engine, what version you prefer – only one bot of Yandex understands the directive the way you want. But there’s a problem. Only one Yandex bot can do what you want, and all the other bots of engine understand directive as “/” for the version, that you don’t want to be desplayed at SERP. So, using the directive, you may loose traffic, because all the pages with URLs you don’t want to be in SE index will be out of it in 1-2 weeks. And you have to wait for months…
    Using 301 redirect is not so usefull, because you loose both Yandex rank (named “thematic index of citing, or TIC” depends on theme and number of links – important if site described at Yandex directory, play no role at SERP’s ranking) and link ranking (unfortunately don’t know the English translation – I mean a beneficial effect of text of inbound links at your sites SERP’s place).
    But you’re right! Exclusion standards are very, very old.

  4. Great comment – thanks!

    It is bad to have such different results for a directive from different bots from the same search engine. I can see that being a problem.

    The best translation for “link ranking” might be “link popularity.” That’s the term I use.

    Who would make new Exclusion Standards? The search engines? :)

  5. I think, greatest problem is different SE opinions. Google let you point main host via SiteMaps, Yandex (not important for you, but important for us) goes own way… Yahoo, MSN… crawl delay, index/noindex, follow/nofollow… And we know nothing about Baidu:) All the “social search” projects, where people will decide, which host to place in SERP…
    In fact, there’s no standarts at all. And it’s a PROBLEM.
    People in SEO/SEM market know about it. But most of webmasters don’t…

Comments are closed.