Yahoo Research Looks at Templates and Search Engine Indexing

There has been a tremendous amount of growth, over the past few years, of web sites that use content management systems, such as blogs, ecommerce shopping sites, wikis, and others. How might that affect how search engines index the pages of those sites?

A new Yahoo Research paper, Page-level Template Detection via Isotonic Smoothing (pdf), discusses some of the problems that exist with so many sites using templates, and a method to use to try to understand if a page is using a template. Here’s a snippet from the paper:

The increased use of content-management systems to generate webpages has significantly enriched the browsing experience of end users; the multitude of site navigation links, sidebars, copyright notices, and timestamps provide easy to access and often useful information to the users.

From an objective standpoint, however, these “template” structures pollute the content by digressing from the main topic of discourse of the webpage.

Furthermore, they can cripple the performance of many modules of search engines, including the index, ranking function, summarization, duplicate detection, etc.

With templated content currently constituting more than half of all HTML on the web and growing steadily, it is imperative that search engines develop scalable tools and techniques to reliably detect templates on a webpage.

Issues Around Templates

The paper focuses upon looking at the HTML underneath the pages, to learn how to identify features that might indicate a page is using a template.

The reason to do this might be to focus more upon indexing a “content” area upon a page than other sections that may repeat from page to page upon a site.

Another problem that can be solved when indexing pages is that sites with the same content, but different template features such as navigation and header and footer sections might not be identified as duplicate content.

Two pages that have the same templated areas, but different main content might also be viewed as duplicates even though they possibly shouldn’t be.

Templates can make classification of the content of pages more difficult than it should be, because the classification of pages may take into account the content found in templated areas of pages. This is especially true when looking at more than one site that contains content that main be within the same category, but the information from the templates are very different – say for instance a review of a camera on CNet and a review of the same camera on PCConnection.

Template Features

Some of the snippets of HTML or features such as navigation sidebars or copyright notices, that they identified while collecting data over a number of web sites (3, 700 websites from the Yahoo! search engine index that each had at least 100 webpages) for a training set shared some common characteristics when they looked at things like:

  • Closeness to the margins of the webpage,
  • Number of links per word,
  • Fraction of text within anchors,
  • The size of the anchors,
  • Traction of links that are intra-site, and;
  • The ratio of visible characters to HTML content
Share

16 thoughts on “Yahoo Research Looks at Templates and Search Engine Indexing”

  1. Two pages that have the same templated areas, but different main content might also be viewed as duplicates even though they possibly shouldn’t be.

    Hmmm, are they saying that is a current problem, or that it is a problem they would need to overcome if they paid more attention to the HTML structure of a webpage?

    It would imply a much bigger reliance on the coding of a page for ranking than previously seen.

  2. Hi Adrian,

    It’s difficult to tell how they are handling duplicates from this paper, but they seem to be implying that if they did pay more attention to the structure of a page, and were able to understand what parts of pages were from a template, and what parts weren’t, that they would have a better handle determining which pages had duplicate content, and which pages didn’t.

    Yes, the coding of a page would be more important under an approach like this one.

  3. I have to admit the web designer in me would be perfectly happy if they could never sort this out and using templates led to duplicate content issues. Nice selling point for me and anyone else doing custom design.

    But realistically templates, themes, skins, and content management systems aren’t going away and there’s no reason they should given how easy they can make it to set up a site.

    One question when it comes to duplicate content is it was my impression search engines were stripping out the code prior to making the determination of what is and isn’t duplicate. This might indicate that they aren’t.

    I can see difficulty in understanding templated designs too since an automated approach might only be able to tell so much. Take SEO by the Sea. It’s a completely unique look to you and me, but will the underlying structure be very different from other 2 column WordPress themes.

    My own blog is on WordPress and I customized it to match the look of the rest of my site and so it’s unique, but still the underlying code isn’t going to be very far off from many other WordPress themes.

  4. Those are excellent points and questions, Steve.

    One question when it comes to duplicate content is it was my impression search engines were stripping out the code prior to making the determination of what is and isn’t duplicate. This might indicate that they aren’t.

    This paper, and another one from Google that I wrote about on the Cre8asite blog – New Google Paper on Near Duplicate Documents provide some interesting questions about how the search engines are handling “near duplicate” pages. That paper, Detecting Near Duplicates for Web Crawling (pdf), describes some other approaches to detecting duplicate and near duplicate documents, and the difficulties in doing so in a timely manner.

    In this templated approach, understanding that some content may be better not looked at when making a decision regarding duplication issues or categorization is an interesting step, especially if it can be done without having to compare many pages on a site to decide if a template is in use.

    Also, keep in mind that search engines are exploring different parts of pages. I mentioned these above, but I’ll link to them – for instance Microsoft with VIPS, and Block Level Link Analysis, and Google with Document segmentation based on visual gaps. It’s good to see that Yahoo is exploring similar grounds in a different manner.

  5. Hi Bill, I got to your site through my bloglines feeds. Interesting article. The sites at my Company all use templates, this remains me that yesterday I saw one of our sites in Google with most of the pages reported in the supplemental index. Could this be because the SE is looking to our pages as duplicate content?

    Is there a relation text/code then to determine duplicate content?

    Should nofollow tags be used for pages we don’t considered important to be indexed by SE to avoid duplicate content issues?

  6. Hi

    Having pages in a supplemental index for certain terms or phrases, and having pages that are being filtered in results because of duplication are different problems, though they can sometimes appear to be very similar – your pages don’t appear in searches for certain phrases at the top of the results.

    Pages that are showing as supplemental, seen as supplemental results in relation to searches are pages that are in a secondary index that Google maintains. One possible description of this supplemental index is in the Phrase Based Indexing patent applications that I’ve written about. See: Google Aiming at 100 Billion Pages?. Another might be in an extended index as described in a Google Patent that I wrote about in Google Patent on Extended Search Indexes

    I’ve written a little about duplicate content issues. A page that is seen as a duplicate might still be in Google’s main index, or it may be in this secondary or supplemental index.

    Regardless of whether your problem is due to duplicate content or being placed in the supplemental index or both, there may be some things that you can do to make sure that those pages start appearing in responses to queries. Templates could possibly be part of the issue, but they aren’t necessarily to blame.

    Lots of sites use templates. If the content from the templates being used include most of the content on the pages (so that your pages are extremely similar to each other), and data being used to provide unique content to those pages involves only inserting a few words then you could have a problem because of your use of templates.

    I don’t think that there’s an exact percentage of unique content that I can tell you that needs to appear on a page, but if most pages are very similar to each other then that’s a problem, templates or no templates.

    There are some great suggestions from Adam Lasnik on duplicate content over at the Google Blog – Deftly dealing with duplicate content.

    I believe that Adam, and Vanessa Fox, from Google, also suggested that one way to get pages out of the supplemental index is to get some links pointed to them.

    If you have the same, or extremely similar page titles, and meta description tags on each page, changing those so that they are unique, and descriptive of the content of the pages that they are titles and descriptions to may also be helpful when it comes to having pages come out of the supplemental index.

    If the content management system (CMS) that you are using (I’m assuming that you are using one because you are using templates), doesn’t enable you to have unique page titles and unique meta descriptions, you may want to have it changed so that it does. Some CMS’ don’t come with those capabilities out of the box, but have had people write modifications that enable them to do that.

  7. Bill,

    I know that I am way behind on this post, I have spent most of the day catching up, but while reading this article, I specifically remember Shari Thurow speaking of “shingles” that the crawlers would only index the unique content on a given page from a domain when pages have a consistent template. This would give them the ability to minimize the amount of storage space and isolate the topic of each page.

    I will have to dig deeper to find my notes on this topic and the context. Great post, thanks.

    Cheers

  8. Hi Stephen,

    Good questions. The heart of your question is the phrase “consistent template.” It takes time and effort to compare all of the pages of a site, to see if they are using a “consistent template.” If you could reduce the time and effort by recognizing templated pages without having to do the comparisons, you may make duplication detection and classification of content use less time and resources.

    Since shingles are a little like fingerprints, in that they only take measurements from a certain fraction of the available information on a page, to compare against other pages (or to compare snippets from pages against other snippets.), being able to identify templates, even different kinds of templates on different domains can be helpful.

    One problem that this paper attempts to address is when a search engine might be seeing two pages with the exact or almost exact same content but different templates as nonduplicates:

    The presence of templated content of webpages can foil duplicate detection algorithms whenever the shingling process retains shingles from the templated regions. Two pages that have absolutely the same content, say the exact same AP news story repeated across two different news websites, might be considered non-duplicates if the shingling process retains shingles from the template regions of the webpages as this portion of the webpages is different. This can lead to false negatives and cause us to return duplicate results for queries.

    One of the aims of this paper is point out a way of eliminating as much as possible the need to compare pages of a site to identify templates, so that when a method like shingles is used to identify duplicates, the templated parts aren’t being compared.

  9. I think these issue are largely irrelevant now for many wordpress users due to the number plugins now availble to overcome duplcite issue issues (i.e. ‘All in One SEO Pack’, canonical redirect plugins). I see this post was originally written in 2007 and I am sure even Yahoo has chaged its code algorithm since then to address this issue.

  10. Hi Carl,

    This patent is intended to cover a lot more sites than just wordpress blogs, including many different content management systems, homegrown ecommerce sites, static html pages, and more. There’s a lot of diversity on the Web, and even a lot of diversity in different wordpress templates too.

    There’s more to understanding whether or not a site is using a template than just overcoming duplicate content issues as well. For instance, if a search engine can visit a page and understand based upon the structure of that page which area contains the main content, it might give more weight to the main content section’s content than other words on the page when it indexes that content.

    While this patent was originally filed in 2007, that doesn’t mean that it isn’t in use these days. Yahoo has been using Microsoft’s database for search, and we don’t know how much of Yahoo’s processes and techniques for crawling and indexing pages might be in use by Microsoft, but Microsoft has spent a lot of energy coming up with their own web page segmentation process as well. So even if they aren’t using this particular approach to segmenting the different types of content found on a page, there’s a chance that they are using something similar that was developed by Microsoft.

  11. I have never liked using a template, trying to keep it as simple as possible. My main reason has always been to make sure the reader’s attention will not stray from the content of the particular page. Seems like this attitude actually made it so much easier for Google and Yahoo to index my pages. I think one should also avoid using too many plugins, which may improve the look of a page, but will unnecessarily complicate the matter of indexing. I also have a number of webpages at Squidoo. They encourage using as many widgets as possible and I have always resisted this. My approach has always been that you cannot hide poor content with a lot of widgets and plugins. On the other hand too many widgets could obscure good content.

  12. Hi Jamie,

    Using some kind of template when you create a page, regardless of whether you are using a content management system (CMS) or not can make it a lot easier to author a page and have a consistent look and feel across the pages of your site. A similar masthead and heading/main navigation section and a consistent footer (with copyright notice, etc.) can make it look like the pages of a site are related to each other. Those don’t need to stand out in a way that draws attention from the main content of a page.

    I also don’t think that most plugins that present information in a sidebar or on a page will harm your main content from being indexed.

    But, the point of this approach from Yahoo is to try to understand what might be boilerplate on a page, and what might be the main content of that page.

    Using too many widgets and plugins and adblocks and so on may possibly harm the indexing of your pages if they harm the experience of visitors to a page, and as you note, “obscure good content.”

    But I don’t think that using a template or a CMS in itself will make it less likely that you content will not be indexed. A template that is labeled well from a semantic perspective may make it easier for a search engine to index your main content. For instance, using a class named “header” for the heading section, and one named “footer” for the footer section might make it easier for a search engine to index more important content on your pages.

Comments are closed.