Recommended Reading










Yahoo Research Looks at Templates and Search Engine Indexing

There has been a tremendous amount of growth, over the past few years, of web sites that use content management systems, such as blogs, ecommerce shopping sites, wikis, and others. How might that affect how search engines index the pages of those sites?

A new Yahoo Research paper, Page-level Template Detection via Isotonic Smoothing (pdf), discusses some of the problems that exist with so many sites using templates, and a method to use to try to understand if a page is using a template. Here’s a snippet from the paper:

The increased use of content-management systems to generate webpages has significantly enriched the browsing experience of end users; the multitude of site navigation links, sidebars, copyright notices, and timestamps provide easy to access and often useful information to the users.

From an objective standpoint, however, these “template” structures pollute the content by digressing from the main topic of discourse of the webpage.

Furthermore, they can cripple the performance of many modules of search engines, including the index, ranking function, summarization, duplicate detection, etc.

With templated content currently constituting more than half of all HTML on the web and growing steadily, it is imperative that search engines develop scalable tools and techniques to reliably detect templates on a webpage.

Issues Around Templates

The paper focuses upon looking at the HTML underneath the pages, to learn how to identify features that might indicate a page is using a template.

The reason to do this might be to focus more upon indexing a “content” area upon a page than other sections that may repeat from page to page upon a site.

Another problem that can be solved when indexing pages is that sites with the same content, but different template features such as navigation and header and footer sections might not be identified as duplicate content.

Two pages that have the same templated areas, but different main content might also be viewed as duplicates even though they possibly shouldn’t be.

Templates can make classification of the content of pages more difficult than it should be, because the classification of pages may take into account the content found in templated areas of pages. This is especially true when looking at more than one site that contains content that main be within the same category, but the information from the templates are very different – say for instance a review of a camera on CNet and a review of the same camera on PCConnection.

Template Features

Some of the snippets of HTML or features such as navigation sidebars or copyright notices, that they identified while collecting data over a number of web sites (3, 700 websites from the Yahoo! search engine index that each had at least 100 webpages) for a training set shared some common characteristics when they looked at things like:

  • Closeness to the margins of the webpage,
  • Number of links per word,
  • Fraction of text within anchors,
  • The size of the anchors,
  • Traction of links that are intra-site, and;
  • The ratio of visible characters to HTML content
  • Share/Bookmark

14 comments to Yahoo Research Looks at Templates and Search Engine Indexing

  • Two pages that have the same templated areas, but different main content might also be viewed as duplicates even though they possibly shouldn’t be.

    Hmmm, are they saying that is a current problem, or that it is a problem they would need to overcome if they paid more attention to the HTML structure of a webpage?

    It would imply a much bigger reliance on the coding of a page for ranking than previously seen.

  • Hi Adrian,

    It’s difficult to tell how they are handling duplicates from this paper, but they seem to be implying that if they did pay more attention to the structure of a page, and were able to understand what parts of pages were from a template, and what parts weren’t, that they would have a better handle determining which pages had duplicate content, and which pages didn’t.

    Yes, the coding of a page would be more important under an approach like this one.

  • Is this something new? I’ve thought all the search engines were already pretty competent at figuring out what are navigational (and therefore templated) elements.

  • There are ways that search engines can attempt to identify whether a page is using a template, such as looking at a number of pages on the site that the page appears upon, and seeing that elements on the page are consistent across those pages.

    This approach aims at being able to understand whether a page uses a template without having to make that analysis across multiple pages.

    There’s some overlap with attempting to perform a block level analysis of a page which Microsoft has a number of papers about, or a visual segmentation of a page which Google has described in a patent application and a paper or two. Those approaches will work regardless of whether pages of a site use a template or not, and might do some similar types of analysis to try to understand what different parts of a page are doing.

  • I have to admit the web designer in me would be perfectly happy if they could never sort this out and using templates led to duplicate content issues. Nice selling point for me and anyone else doing custom design.

    But realistically templates, themes, skins, and content management systems aren’t going away and there’s no reason they should given how easy they can make it to set up a site.

    One question when it comes to duplicate content is it was my impression search engines were stripping out the code prior to making the determination of what is and isn’t duplicate. This might indicate that they aren’t.

    I can see difficulty in understanding templated designs too since an automated approach might only be able to tell so much. Take SEO by the Sea. It’s a completely unique look to you and me, but will the underlying structure be very different from other 2 column WordPress themes.

    My own blog is on WordPress and I customized it to match the look of the rest of my site and so it’s unique, but still the underlying code isn’t going to be very far off from many other WordPress themes.

  • Those are excellent points and questions, Steve.

    One question when it comes to duplicate content is it was my impression search engines were stripping out the code prior to making the determination of what is and isn’t duplicate. This might indicate that they aren’t.

    This paper, and another one from Google that I wrote about on the Cre8asite blog – New Google Paper on Near Duplicate Documents provide some interesting questions about how the search engines are handling “near duplicate” pages. That paper, Detecting Near Duplicates for Web Crawling (pdf), describes some other approaches to detecting duplicate and near duplicate documents, and the difficulties in doing so in a timely manner.

    In this templated approach, understanding that some content may be better not looked at when making a decision regarding duplication issues or categorization is an interesting step, especially if it can be done without having to compare many pages on a site to decide if a template is in use.

    Also, keep in mind that search engines are exploring different parts of pages. I mentioned these above, but I’ll link to them – for instance Microsoft with VIPS, and Block Level Link Analysis, and Google with Document segmentation based on visual gaps. It’s good to see that Yahoo is exploring similar grounds in a different manner.

  • Beginners Guide to SEO

    Hi Bill, I got to your site through my bloglines feeds. Interesting article. The sites at my Company all use templates, this remains me that yesterday I saw one of our sites in Google with most of the pages reported in the supplemental index. Could this be because the SE is looking to our pages as duplicate content?

    Is there a relation text/code then to determine duplicate content?

    Should nofollow tags be used for pages we don’t considered important to be indexed by SE to avoid duplicate content issues?

  • Hi

    Having pages in a supplemental index for certain terms or phrases, and having pages that are being filtered in results because of duplication are different problems, though they can sometimes appear to be very similar – your pages don’t appear in searches for certain phrases at the top of the results.

    Pages that are showing as supplemental, seen as supplemental results in relation to searches are pages that are in a secondary index that Google maintains. One possible description of this supplemental index is in the Phrase Based Indexing patent applications that I’ve written about. See: Google Aiming at 100 Billion Pages?. Another might be in an extended index as described in a Google Patent that I wrote about in Google Patent on Extended Search Indexes

    I’ve written a little about duplicate content issues. A page that is seen as a duplicate might still be in Google’s main index, or it may be in this secondary or supplemental index.

    Regardless of whether your problem is due to duplicate content or being placed in the supplemental index or both, there may be some things that you can do to make sure that those pages start appearing in responses to queries. Templates could possibly be part of the issue, but they aren’t necessarily to blame.

    Lots of sites use templates. If the content from the templates being used include most of the content on the pages (so that your pages are extremely similar to each other), and data being used to provide unique content to those pages involves only inserting a few words then you could have a problem because of your use of templates.

    I don’t think that there’s an exact percentage of unique content that I can tell you that needs to appear on a page, but if most pages are very similar to each other then that’s a problem, templates or no templates.

    There are some great suggestions from Adam Lasnik on duplicate content over at the Google Blog – Deftly dealing with duplicate content.

    I believe that Adam, and Vanessa Fox, from Google, also suggested that one way to get pages out of the supplemental index is to get some links pointed to them.

    If you have the same, or extremely similar page titles, and meta description tags on each page, changing those so that they are unique, and descriptive of the content of the pages that they are titles and descriptions to may also be helpful when it comes to having pages come out of the supplemental index.

    If the content management system (CMS) that you are using (I’m assuming that you are using one because you are using templates), doesn’t enable you to have unique page titles and unique meta descriptions, you may want to have it changed so that it does. Some CMS’ don’t come with those capabilities out of the box, but have had people write modifications that enable them to do that.

  • [...] * Advanced SEO topics for real this time: Bill had a couple of posts that caught my attention this week. The first is on clustering users for personalized search which is something I really think Google is going to move towards. Secondly, Yahoo research posted a paper on detecting templates inside a web page. I went through this because it could also give you a little better idea on how they detect paid link footprints and so forth. [...]

  • Bill,

    I know that I am way behind on this post, I have spent most of the day catching up, but while reading this article, I specifically remember Shari Thurow speaking of “shingles” that the crawlers would only index the unique content on a given page from a domain when pages have a consistent template. This would give them the ability to minimize the amount of storage space and isolate the topic of each page.

    I will have to dig deeper to find my notes on this topic and the context. Great post, thanks.

    Cheers

  • Hi Stephen,

    Good questions. The heart of your question is the phrase “consistent template.” It takes time and effort to compare all of the pages of a site, to see if they are using a “consistent template.” If you could reduce the time and effort by recognizing templated pages without having to do the comparisons, you may make duplication detection and classification of content use less time and resources.

    Since shingles are a little like fingerprints, in that they only take measurements from a certain fraction of the available information on a page, to compare against other pages (or to compare snippets from pages against other snippets.), being able to identify templates, even different kinds of templates on different domains can be helpful.

    One problem that this paper attempts to address is when a search engine might be seeing two pages with the exact or almost exact same content but different templates as nonduplicates:

    The presence of templated content of webpages can foil duplicate detection algorithms whenever the shingling process retains shingles from the templated regions. Two pages that have absolutely the same content, say the exact same AP news story repeated across two different news websites, might be considered non-duplicates if the shingling process retains shingles from the template regions of the webpages as this portion of the webpages is different. This can lead to false negatives and cause us to return duplicate results for queries.

    One of the aims of this paper is point out a way of eliminating as much as possible the need to compare pages of a site to identify templates, so that when a method like shingles is used to identify duplicates, the templated parts aren’t being compared.

  • [...] I’ll be keeping an eye open for those last two patent applications. The approach involving templates has me wondering how much it might be like a recent Yahoo patent application involving understanding whether a page is using a template or not, without looking at other pages on the same site to see if there is a pattern that would indicate the use of templates. [...]

  • [...] Yahoo Research Looks at Templates and Search Engine Indexing [...]

  • [...] paper that I wrote about from Yahoo in my post Yahoo Research Looks at Templates and Search Engine Indexing explored how Yahoo might look at the features found on a page to see if that page was using a [...]

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>