A Yahoo Approach to Avoid Crawling Advertisement and Session Tracking Links

A newly published Yahoo patent application describes a couple of ways to filter out some of the URLs that it might crawl, to keep those pages from being indexed and presented to searchers.

Those URLs are referred to in the patent filing as “transient” links because they change from visit to visit, often because they are advertisements that have URLs with tracking codes included within them, or contain session IDs to track visitors.

An approach is provided for identifying transient links on a Web page. The approach ensures that transient links are not crawled and archived, thereby saving resources for crawling valid links leading to useful information.

Outgoing links on a web page are identified, and after a period of time, a new copy of the web page is obtained and the outgoing links identified. The respective sets of links are compared and links which do not appear in both sets of links are identified as transient.

Consecutive crawling to identify transient links
Invented by Dmitri Pavlovski, Vladimir Ofitserov, and Alexander Arsky
US Patent Application 20070226206
Published September 27, 2007
Filed: March 23, 2006

There are three major stages to how a search engine works. The first involves a search engine sending out programs that are commonly referred to as crawlers or spiders or robots. Those crawlers identify pages to be indexed on the Web, and the addresses of those pages in the form of URLs.

The other stages involve indexing information found on pages in a crawl, and the presentation of results found in that index in response to a query made by a searcher. If the crawling stage can become more efficient, then the other stages may have less work to do, and will also be more efficient.

Making Web Crawling More Efficient

The ways that crawling programs from the major search engines actually work is something search engines often don’t share much about.

We have some hints, like a a Stanford page listing resources used during the early stages of work upon Google, which included a document titled Efficient Crawling Through URL Ordering. That paper discusses of how a search crawling program might prioritize which URLs a spider might visit next when it finds addresses to documents while crawling a page.

The inventors of this Yahoo process describe some factors of a crawling process in the patent filing:

Web crawlers use a wide variety of crawl algorithms to determine the order in which Web pages are crawled. For example, a first-in-first-out by link approach may be used. With this approach, links are crawled based upon the order in which they are located on a Web page.

As another example, a “best first” approach may be used where the order in which links are to be crawled is selected based upon link relevancy, i.e., the links considered to be the more relevant are crawled before links that are considered to be less relevant.

They also tell us that it is pretty common for advertisers to include information within URLs that help to identify users, and track where those visitors are coming from. This kind of information may appear in the use of session IDs, tracking URLs, and other techniques that cause a URL to change from one visitor to another.

Because of the changes, if those URLs were indexed, the search engine’s index might contain a lot of pages at different URLs that were duplicates of each other or that shouldn’t have been crawled in the first place. We are told that:

Because the purpose of a Web crawler is to discover pages that contain useful information for web users, it would be inefficient and wasteful of resources to crawl and index every transient link whose only significance is being used as a unique tracking or session identifier.

The process in this patent filing are aimed at avoiding those types of transient links.

Identifying Transient Links

On a web page, you may find text, a link to other pages, and advertisments. Those links to other pages have URLs that point to pages with useful information to be crawled and archived. The advertisement might be an image with an embedded tracking URL. When a web crawling program follows the advertisemen’s tracking URL, it is brought to another Web page, quite possibly located on a different Web server.

A crawler requests the web page from the server hosting it, and is provided the HTML from the page. It parses through the HTML, and extracts a list of all the URLs from the page, and stores them. It then issues a “refresh” command for a new copy of the page, after a minute or so (the patent filing tells us that “while one minute has been found to give the best results, any length of time may be used.”

The refreshed copy of the page may differ from the first copy. It’s possible that the Web server may insert into the new copy a new advertisement with a new embedded tracking URL, replacing the old advertisement. The crawler makes another list of all URLs from the page, and stores that list.

The list of originally extracted URLs is compared with the newly extracted URLs. The URLs which were in the first crawl of the Web page that have disappeared in second crawl of the Web page are considered transient, and not useful for crawling or inclusion into a searchable index.

In one embodiment, all links that appear in both of the consecutive crawls of the same page are marked as suitable for crawling and inclusion in an index, and are indeed crawled.

Segmenting Pages to Make Future Comparisons Faster

Instead of comparing all links on future crawls of a page, it might be easier to only view parts of pages where transient links were found on previous crawls. The patent describes how it might break down the page into parts:

One approach for the identification of portions of HTML can be performed using Document Object Model Tree (DOM) decomposition. A DOM tree is a representation of a portion of HTML using a tree of HTML tags where group tags like <table> have sub-tree tags <tr> and in turn </tr><tr> tags have leaf tags <td>.

In general, a DOM tree contains tags and their text and attributes. To identify transient links using fewer crawls of the page, the crawler can initially fetch a page several times, decompose the HTML comprising the page into a DOM tree, identify transient links and identify transient DOM sub-tree elements that contain only transient links.

When crawling the same page in the future, if the crawler discovers that page has a DOM tree identical to previously crawled instances, then the crawler may consider the new links originating from the same transient DOM sub-tree as transient without additional fetches of same page.

This kind of segmentation of web pages isn’t unique to Yahoo.

Both Google and Microsoft have published patent filings and papers that describe how they might segment parts of web pages for different purposes. I wrote about some ways that Google might do something like that in Google and Document Segmentation Indexing for Local Search.

Microsoft has written about a few different approaches to segmentation of pages, and their most well known document on the subject is probably VIPS: a Vision-based Page Segmentation Algorithm (pdf).

Since many pages on a website share the same template, this kind of segmentation may be helpful in helping the crawler ignore transient links from the same area on other pages of the same site.

Identifying Sites that Are Frequent Targets of Transient Links

The URLs of the transient links may also be identified and collected, so that it can ignore them in the future:

According to an embodiment, to reduce the number of consecutive fetches, a crawler can attempt to identify websites that are frequently used as targets of transient links.

An approach that can be used involves identification of transient links by using the techniques described above, and further aggregating all links by target websites and identifying websites for which most of the links are transient.

The crawler may later use a list of such websites to identify all future links to them as transient links without performing additional fetches of the same page.

I’m not sure if this would have any impact upon non-advertisement links to pages on sites that also use advertisments.

Share

9 thoughts on “A Yahoo Approach to Avoid Crawling Advertisement and Session Tracking Links”

  1. The references to Yahoo spidering links by percived relevance and duplicate dom trees are pretty interesting. I noticed today, I am suffering heavily from duplicate content on the home page and wondered if search engines used HTML structure at all to find duplicate content.

    Its a shame that not so many web developers are keyed up on adding some blocking method to sites when tracking is used. Although I guess some of these URLs might be optimised by affiliates, so it could be a bad idea.

    PS – Lyndon recently wrote a post you might enjoy calling for more scholarly seo posts.

  2. Hi David,

    The hints in here about how Yahoo may be crawliing pages were nice to see. Getting a glimpse into how a search engine may be doing something a little differently in an area like this is exciting.

    I did write about a method that Yahoo may be using to identify whether a single page might be using a template, so that the content areas might be indexed and checked for duplicate content a little differently than the templated areas. It’s at – Yahoo Research Looks at Templates and Search Engine Indexing. Not sure that it fits into the situation you are describing with your homepage, but it may be worth a look.

    There’s been a lot of discussion around the Web about the identification of paid links. While this doesn’t specifically get into that debate, it does show one easy way of determining that some links are likely paid for.

    Nice post/rant from Lyndon. Thanks for your kind words in the comments. :)

  3. This seems to be at odds with the notion that regular content updates encourage spiders to return more frequently – and makes me wonder what impact this could have on sites which use RSS feeds and other dynamic content rotation to ensure pages are fresh…

  4. Hi Jason,

    Interesting question. The approach in the patent application is an attempt to make crawling websites more efficient, by allow search crawling spiders to make smarter decisions when choosing which URLs to follow.

    It appears, from some of the statements in the document, that some significant testing of this process has happened, and at one point they mention that it may work best when the second crawl is around a minute later.

    A site that regularly updates content possibly may not update content that quickly, but there are some sites that may see such quick changes (for instance, the front page of Digg) that wouldn’t work well with this process.

    The popular notion that you mention is something that probably happens – if a site updates at a very quick rate, it is likely that if the search engine deems it to be an important enough site (through something like pagerank, or number of links pointing to it, or through traffic estimates based upon user activity as measure through toolbar or ISP logs or search results visits) so that they will send spiders more often.

    I think that for most sites, and most links, this method will probably work fine, and won’t interfere with the rate of revisits of spiders to index new content. For a much smaller number of sites that update very rapidly, there may need to be some kind of special treatment so that links on rapidly updated sites aren’t confused with pages where the links are transient, as described in this patent filing.

Comments are closed.