Computer programmers will sometimes use the term “boilerplate” code to refer to standard stock code that they often insert into programs. Lawyers use legal boilerplate in contracts – often the small print on the back of a contract that doesn’t change regardless of what a contract is about.
A lot of web pages and documents reuse the same text in sidebars and in footers at the bottoms of pages, like copyright notices and navigation sidebars.
It might be a good step for a search engine to ignore the boilerplate text when indexing pages, or using the content of pages to create query suggestions for someone using a desktop personalized search. Ignoring this similar text in the same documents could be helpful when using those documents to rerank search results in personalized search.
New York Times Boilerplate
Is Google ignoring boilerplate on pages when it indexes those pages and tries to understand what the pages are about? Does it disregard the words in your copyright notice, or the use of “home” in your link to your homepage?
Do the words appearing as anchor text in links to other blogs in your blogroll get ignored when the search engine tries to understand what one of your blog posts is about?
Wikipedia Boilerplate
It’s difficult to tell how much attention Google might pay to your copyright notice, or an introductory blurb or disclaimer that might appear on all of your pages. If Google is paying attention to those words now, it might not pay as much attention to them in the future.
Google’s Next Generation Search Engine?
Google’s next generation search engine may look a little like a hybrid between their present Web search as well as their desktop and intranet search, with a number of additional features. I wrote about two patent applications that seem to be part of that search in Google on Desktop Search and Personal Information Management.
I also noted in that post that there are at least 50 patent applications in total that may be part of that future search engine, which are listed as “related applications” in a patent application published at WIPO titled Methods and Systems for Information Capture and Retrieval.
Many of those patent applications were originally filed with the US Patent and Trademark office in 2003 and 2004, and the direction that Google may follow in the future could include them, or it could go in another direction completely. But many of the ideas behind them may make their way into whatever path Google may follow.
Google and Boilerplate
I’m keeping an eye open for the publication of those 50 patent applications. One of them came out this week, focusing on ignoring boilerplate, which could be something useful for today’s Google. The patent filing is:
Systems and methods for analyzing boilerplate
Invented by Stephen R. Lawrence
US Patent Application 20080040316
Published February 14, 2008
Filed March 31, 2004
Abstract
Systems and methods for analyzing boilerplate are described. In one described system, an indexer identifies a common element in a plurality of related articles. The indexer then classifies the common element as boilerplate. For example, the indexer may identify a copyright notice appearing in a plurality of related articles. The copyright notice in these articles is considered boilerplate.
Text in documents (web pages, word documents, PDFs, and so on) on your hard drive, or browser cache, or in your web surfing history or favorites might be used to create queries based upon what you’ve been doing recently with those documents. Those queries might be shown in a sidebox on your computer screen, as an information resource that you can use if you want to find out more on the topic that you are writing about, or reading, or browsing.
Or that document information might be used to rerank search results, when you perform a search, to find stuff related to what you were doing on your computer recently if it might be helpful.
Boilerplate language could be identified by the search engine in a few different ways when it looks at the text or other elements on a page. An example from the patent application is that “any text following the word ‘copyright’ is boilerplate.”
Other types of boilerplate might include navigational text, disclaimers, and text that appears on every page of a web site.
Important Terms and Concepts
There are two different types of queries that may be used by this search system, looking at recently used and viewed pages to grab keywords for searches:
Implicit queries – the indexing program looks for boilerplate elements on pages, and content elements, and creates an implicit search query comprising a search term from a term found in the content area.
Explicit queries – the query system might remove or weigh down boilerplate when someone performs a search.
With both implicit and explicit queries, the relative importance of actual content is given higher weight than the boilerplate language. An article might not be indexed after the boilerplate has been removed, which would mean that only the non-boilerplate language is used to influence those queries.
Boilerplate – examples include headers, footers, and navigational elements that may occur on multiple articles. This kind of language could be identified by analyzing a number of related articles, such as multiple web pages within a web site. Boilerplate might also be identified by analyzing a single article.
Identifying boilerplate – the indexer may identify a boilerplate element in a few different ways. One might be to analyze the frequency of terms and phrases in a number of related articles to identify common element. The indexer could then classify the common elements as boilerplate. For example, a phrase like “Copyright 2004,” appearing in a number of related articles could be seen as boilerplate.
Spatial location of terms or phrases – common terms or phrases often occur at a particular positions in articles, and might be boilerplate. For example, a common term often found at the bottom of an article might be a copyright notice.
Navigational elements as boilerplate – common phrases occurring at the same place at the top, left, and right of an article could be navigational elements. On a web site, navigational elements are links letting visitors go to certain sections of the site; such as links to the home page, or a help page, and other pages on a site.
Article markup indicating boilerplate – HTML markup code for common terms on pages might indicate that those terms are boilerplate. One example is javascript used in navigational links to change the appearance of those links when someone moves a mouse pointer over the links. A score might be determined for common terms that have markup near them. Different weights might be assigned based upon different kinds of markup – so words in javascript links might be given a higher boilerplate weight than words in bold or italic HTML elements.
Some markup could reliably identify boilerplate when you are even just looking at one article, instead of seeing if the boilerplate appears on more than one page of the same site. For example, links are markup code, and links going to the home page of a site or a page ending in “help.html” or “copyright.html” could be considered boilerplate.
Predetermined terms and phrases as boilerplate – could be identified based on a predetermined list of terms and phrases. For example, common navigational or legal terms, such as “Home”, “Help”, “Terms of Service”, and “Copyright” may be used on a page, and the sections of the page where those appear might be considered boilerplate. The sections may be sentences or paragraphs. The text appearing in those areas might not be considered boilerplate on other pages when they appear without the predetermined terms.
Frequently indexed terms as boilerplate – terms that appear often in many different articles might be more likely to be considered boilerplate than terms that rarely appear. Examples of terms like that are “home” and “contact us.” These terms appear very frequently as hyperlinks on many pages available on the Web.
Common terms and phrases are sometimes not boilerplate – even though some phrases may occur on multiple related pages, that frequency of use may not be an indication that the phrase is boilerplate. For example, a site about astronomy may include the term “astronomy” in many or all pages, and that term is relevant and important.
Some Conclusions
1. Keep in mind that a search engine may ignore the text on pages that it may think is boilerplate.
2. If you want a search engine to pay attention to text upon pages, pay attention to where that text appears on a page, and how frequently the same text appears on more than one page.
3. Global navigation and site wide links appearing on pages might be viewed as boilerplate when it comes to the content of the pages those links appear upon, but the anchor text within them may still tell the search engine something about the pages that they point towards.
4. Google may or may not be using something like this now, but if they aren’t, they could be in the future.
This definitely makes alot of sense as efficiency is key.
Aside from semantics this why I don’t like sitewide navigations. In my opinion it is best to have a minimal sitewide nav purely for usability which is accompanies by a larger semantic nav hand crafted for each individual page.
Great post on the future of Google search.
Some additional things I have learned about Google recently with my own website and blog:
They seem to ignore old versions of WordPress. I found this out recently while using their blogsearch.google.com site.
I had been using a version of WordPress that was over a year old ( I don’t remember which exact version number ) and Google’s blogsearch would not index any of my new posts and hadn’t since August 2007.
I thought this was odd so I checked out some other blogs (one in particular that is a major top ranking blog was using an old version of WordPress ) and, sure enough, their new posts weren’t showing up either.
Just on a whim I updated my blog to the latest version of WordPress ( currently 2.3.3 ) and, wow, my posts are being gobbled up by Google’s blogsearch within minutes of being posted and many are getting page one placement for top keywords on their web search engine as well, even though the posts are brand new and have absolutely no pagerank credibility.
In addition, as of this comment, Google’s blogsearch seems to be dumping indexed posts that are older than about the middle of January 2008. I found this out when the total number of posts indexed by them for my blog went from hundreds to just the 27 most recent ones, posted since the middle of January 2008.
I did a simple blogsearch using the blogurl:www.name-of-blog.com handle for a number of top blogs and found out that Google’s blogsearch does not list any posts past about the middle of Jan. 2008, including yours, Bill.
Google seems to be moving towards rewarding current and timely information rather than dated information on both their blogsearch as well as their web search engines. The copyright information on a site may someday help or hurt a page depending on the year that is stated. I also think that web pages that have not been modified in awhile may start to get penalized, but am not aware of this happening currently.
Thanks.
I think G has been using this for at least two years. Searching for common elements using a site: search produces some interesting results..
Only one result returned for a phrase found in your right column:
“Hello. I am Bill Slawski” site:seobythesea.com
http://tinyurl.com/yoqq86
But click “repeat the search with the omitted results included” and the rest of your pages pop back out.
Hi Jordan,
Thanks for the example. Much appreciated.
I just saw another Google patent application last night, that I want to cover in the near future, which talks about the detection of duplication content within the same site, and on different sites.
A couple of paragraphs from that one discuss the identification of boilerplate on pages first, before comparing pages to see how similar the content is. Having seen this patent application on boilerplate made that a lot more understandable.
I think that the idea of Yahoo looking for templated sections of pages carries some similar ideas, too.
@Se, Thanks. There’s some value in being able to distinquish between boilerplate and content for indexing, duplicate content detection, and more.
@ Dane, This patent application does potentially hold some implications for sitewide links, but I don’t know how much. Worth exploring.
@ People Finder, Thanks for your observations regarding Google’s blog search. I think that it may be worth exploring the same notion in Google’s Web search. Indexing around events seems to be covered in more than one of the unpublished Google patent applications. Microsoft and Yahoo also seem to be exploring some of these areas, with the Web as a “stream of data,” where fresh may sometimes trump older pages in some circumstances.
Hi Michael,
There are some other areas where looking at boilerplate may be important for the search engine.
One would be when it is creating implicit queries based upon documents that someone is using, writing, or view. Offering someone search results about copyright, based upon the existence of a copyright notice at the bottom of a page wouldn’t do.
Letting the same notice influence an explicit query wouldn’t be ideal either. A search for publishing information might end up showing “copyright in publishing” information instead.
Another area where it might be good to ignore boilerplate is in duplicate content detection.
There is no harm in indexing boilerplate text, except for the disk space that requires and the processing time needed to index the text.
However, I can see how some legal challenges might arise if search engines start only partially indexing pages — wait, Google is already doing that with the Supplemental Results Index.
Never mind.
Lets face it, the trend is to want the search engines to index the most current and relevant content, whilst also rooting out unnecessary, repetitive window dressing.
I’m sorry Michael but I fail to see how boilerplate content is equally as relevant as non-boilerplate content. If I’m looking for something very specific “Powered by WordPress” generally isn’t going to be my query.
It is possible to use boilerplate content to narrow down specific searches like
“Powered by WordPress” crowdsourcing resources
Still, the meat of that query is by far crowdsourcing resources. In this case I’m not looking for WordPress plugins or anything to do with the term powered these terms may as well be operators to help define my query.
I’m all for relevance, but in all seriousness, indexing boilerplate text makes sense for several reasons.
First, it helps you identify who is publishing through multiple domains.
Secondly, it helps you identify who is sharing information about terms of service, company information, contact information, and related services (all of which are queries that consumers perform every day).
Thirdly, it helps you understand what technology lies behind the page. Most CSS-driven sites have a very specific style for their boilerplate even though they can switch out templates.
Fourth, boilerplate text is part of the public record. Excluding those embedded terms and conditions that are placed in footer text misrepresents the content of the page. If a search engine is claiming to make the Web searchable, then morally it MUST provide some means for searching boilerplate content even if that requires the user to stipulate bypassing a filter.
Boilerplate text is not any less relevant or important to a user’s query simply because people are tired of seeing it. Boilerplate text helps you identify which site you’re actually looking at, and that is always very important to know.
I really don’t think G was thinking about presenting the pages, sans boilterplate content. Pages would look like crap for one 😉
However, I do think they wanted to:
a) reduce the power of links in boilerplate, ie navigation
b) better identify duplicate content
c) and perhaps save a little disc space
Let’s take this site for example:
I can easily see G storing this page as two separate entities, the left and the right.
To a human eye, the left side is what matters. G knows this and wants to weight the post area more than the nav, wrt to content and links.
I think the duplicate content side was more important before phrase based indexing. However, you can see how removing the boilerplate would help reduce false positives when just looking at the difference between pages.
G could store the content from 607 pages and the nav once and cut down the amount of disc space needed by at least 40%. Maybe more, as there’s a lot of markup involved in right side of the page. This is a bit more far fetched than the ideas above, as the processing power to put the pages back together might outweigh the storage savings.
Great post and very informative point of view. I would, however, argue as others have that there is a still a number of good reasons why boilerplate navigation and text should still be indexed and taken into consideration.
Although the boilerplate text is not necessarily as relevant as the rest of the content, its certainly no reason to exclude them from indexing.
Basic IR algorithms already take account of such common terms (idf or PPM spring to mind), so the big SE’s are bound to have fully refined algorithms that are quite capable of measuring the relevance of this content within the page/site without labelling it as “boilerplate”.
As for indexing it is probably more efficient for common terms than that for those less-frequently seen.
As Michael mentions the real fun is in what can be done by identifying the boilerplate text in areas such as spam/duplicate detection. As for that fourth point Michael its the best. There is no way that completely excluding content can be a good thing.
You might also want to read Ji-Rong Wen Block Level Link Analysis
@ David Marx ,
Good points. The search engines may have a bias towards the recent and the relevant, though there are time when and where older content still retains it’s relevancy. I’m not sure that this particular patent application focuses too much upon that, but it is related to other patent applications that do, so you’ve picked up on a related understream behind the patent application.
@ Michael Martinez, I do think that there is some relevance to what we find in boilerplate, and I don’t think that it will cease to be indexed. It’s just that within the context of the uses described in this patent application, boilerplate material is given less weight, or is ignored to generate keywords to drive the creation of implicit searches, or to cause the reranking of explicit searches.
@ Dane, While I agree with Michael that boilerplate should continue to be indexed, I also agree with you that it is less relevant.
@ Jordan,
Good call. I think one of the most important implications behind this is in the duplication content detection process. If some of the indexed content upon a page can be labeled boilerplate, and not be considered when trying to identify duplicate content, then that process can be a lot more effective.
@ Chris,
Again, within the context of the patent application, there are specific and reasonable purposes behind defining parts of pages as boilerplate. It’s not a question of whether those pages of pages will be indexed, but rather of creating implict queries, and influencing results of explicit queries from the content upon recently viewed, written, or opened documents without giving the boilerplate parts of those pages much weight.
@ Jon
Thanks! I agree; boilerplate content should be indexed.
@ Teddie,
Thanks for the reference. The Microsoft papers involving Block Level analysis have been around for a while. I’d recommend those to readers, too.
Some more recent research that builds upon that research, and refines it involves object level indexing – see Search Objective Gets a Refined Approach
It does seem like Google must be eliminating some boiler plate info now. They are so strong on elimination of duplicate content. Boiler plate stuff rarely adds value in the search results so it seem like a logical step.
Dane, “powered by wordpress” is a tag, not boilerplate content. Boilerplate content tends to be more substantial. However, if all that you have for boilerplate is one tag, this is a non-issue anyway.
There are plenty of Web sites with extensive boilerplate content that actually changes according to context. Contextual boilerplate is easy to create and it’s useful for site searches. And users do a LOT of site searches.
Yes Bill. Sounds very similar, if not identical, to the VIPS algo. http://www.cad.zju.edu.cn/home/dengcai/VIPS/VIPS.html
@ Jordan, The VIPS approach is worth looking at, and the object level indexing that I linked to a few comments above is definitely worth a look if you want to explore those topics even further.
One main idea is to understand segmentation, like in the VIPS approach, and then see if information from different parts of pages on the same or different sites can be joined together in a meaningful way to understand different objects and entities.
@ Ken, I think that you might be right, or may be right sometime in the near future. My more recent post, New Google Process for Detecting Near Duplicate Content describes a new Google patent application about duplication content where they specificially state that they may attempt to locate and not consider boilerplate material on pages before attempting to review pages for duplicate and near duplicate content. If boilerplate can be taken out of the equation, then the processes might be more effective.
@ Michael, there is some boilerplate that doesn’t hold much value, like the “powered by wordpress” tag. There’s other boilerplate that may contain more. It’s difficult to tell from the patent filing if all boilerplate will be ignored on the same level. There are too many “in another embodiment” type statements in the document to make any statements about the process with a great deal of certainty. Regardless, I still think discussing the topic gives us some ideas about what the search engine might consider doing in the future, and how we might consider presenting information on pages so that it isn’t considered boilerplate if it holds some important value and meaning.
Although the boilerplate texts are look like duplicated contents but they’re very important and useful sections of most webpages that can communicate with visitors, show how they can use or navigate around the sites, show anything about entire websites in conclusions form. So, I think, it’s no reason for Google to ignore them. May be, Google and most search engines have recognized the boilerplates and have been accepting them ,possibly, have been using them for indexing, but only for one most important page of sites such as homepage, and have been ignoring them for another pages.
@ R. Apichart, Thanks. There can be some value in what we may be calling boilerplate here, but within the context of the method in the patent, that boilerplate may not be helpful in generating implicit queries or reranking explicit queries influenced by keywords taken from non-boilerplate content on recently viewed, read, or written documents (including web pages).
Indexing that content within a search engine index is another matter, but it’s possible that boilerplate content shouldn’t carry as much weight as more useful page content.
Interesting article and obviously a step in reducing the clout advertising and sitewide links have.
Hi Pete,
The primary goal behind uncovering boilerplate, from this patent application, is in uncovering information on pages viewed by someone, when that information might be helpful in informing searches that can be influenced by the documents that someone is viewing, or was recently viewing when they conduct a search. In other words, a search based upon the context of the search.
Being able to identify boilerplate might have some additional benefits or side effects, like lessening the weight or importance of links contained within that boilerplate, as you note. 🙂
It may also be helpful in detecting near duplicate content, by allowing a search engine to ignore boilerplate content when comparing the content of multiple pages to see how similar it is from one page to another.
Nice article Bill!
I just wanted to ask you about a hypothetical case in which the navigation elements are considered boilerplate. If so, why bother with the concept of the first link only counts (placed in the top left corner for usability purposes or the ‘home’ link) since the search engine may be bypassing it to focus on editorial content?
Hi Augusto,
Thank you.
Keep in mind that the main focus for this particular patent filing is on a searching system, like a desktop search, which may create queries for the person using it, based upon words found on a document or a web page that the person is viewing, or has recently viewed. Those queries are ones that results might show up for in a sidebar, like the one used by Google Desktop search, without the person using the system actually choosing the query terms, or explicitly performing the search themselves.
In that circumstance, the search system wouldn’t give much weight at all to the text associated with a link in an area of a document that it might consider to be boilerplate.
But, the idea that Google might take the concepts behind this desktop searching system, and apply it to the indexing of Web pages is a real possibility, and Google could segment a page into parts, and give each part different weights. Microsoft described doing something like that in their paper on Block Level Link Analysis, and the idea is one that we have to keep in mind.
When you ask about the idea of “first link only counts,” I believe that you are referring to some blog posts that drew some conclusions about how search engines might index links on a page when more than one link from that page pointed to another page. I seem to recall, after everything was said and done on that topic, which was discussed in a number of blogs, and a few forum settings too, that most people concluded that the results behind that research were inconclusive.
There are usability benefits in having a link from a logo, and other clearly marked links to a home page, like you note. And some sites have multiple links to the same page for other purposes, like Amazon’s two links at the top of their page that link to another page where new users may register for the first time, and returning users may log in – two links for two distinct groups of visitors, pointing to the same page.
The usability benefits by themselves make having multiple links worth having. What impact does that have upon something like the calculation of PageRank? It’s hard to say how search engines might weight those pages, but even the original PageRank patents provide possibilities for weighing the PageRank value of a link differently. For example:
From Method for scoring documents in a linked database
In that last paragraph, we are told that links that are near the tops of pages, or are emphasized in other ways could be given more weight. It’s possible that links within the main content sections of pages, as opposed to being considered in a “boilerplate” section may also be give more weight. So that kind of idea has been around since the very early days of PageRank.
Usability in having that link is a reason by itself to make a site easy to navigate, regardless of how much or how little weight a link might have under a system like PageRank.
Thanks for your prompt response.
I do agree that these ideas have a real possibility of inclusion in Google’s search technology.
Current or future technology might also make possible to detect paid links that are placed in blogrolls, footers, or any other section that can be easily identified as mentioned in the “article markup indicating boilerplate†paragraph.
It’s been tested that links placed at the top of the page, not from the physical appearance of the site (aesthetics), but from the coding perspective, perform better than other links located elsewhere. And as you pointed out, these concepts have been covered since the very beginnings of PageRank.
Then, perhaps the first anchor text counts is probably more gear towards duplicity of links within the editorial content than the boilerplate if we consider the fact that Google leans toward building pages for visitors rather than crawlers.
Even though the findings are inconclusive and further research is needed to come up with a theory regarding the first anchor text, at least we know that if the anchor text is the same, they will typically drop the second link (according to Matt Cutts comments).
Hi Augusto,
Thank you. Those are some good points. I’ve written recently about some of the approaches that the major search engines may be using when it comes to anchor text:
While it’s likely that there are differences in the approaches of all three search engines to anchor text, there are also likely some similarities.
If Google is indeed creating an “anchor map” of all of the URLs and associated anchor text for each, it’s possible that entries within that map that are substantially duplicates may be treated differently than entries that aren’t. but we can’t tell with any certainty, and often the anchor text isn’t the same, even though the URL being pointed to might be.
If, as Matt notes, duplicate links on the same page using the same anchor text may see one of the links being dropped, of course that’s not the same as saying that links to the same URL using different anchor text will also be dropped. So Matt’s answer doesn’t help us there.
Google’s phrased-based indexing patent filings have some interesting things to say about anchor text too, and it’s possible that Google is using the approach in those. The discussion on outlink and inlink scores, in the section on “Document Annotation for Improved Ranking,” and on “bombing” pages with links, in Phrase-based indexing in an information retrieval system is definitely worth a look.
Hi Bill,
Matt Cutts declared that recent Google algorithm changes included: “Better page titles in search results by de-duplicating boilerplate anchors” on 11/14/2011.
This draws an interesting line between the timing of patent releases and their implementation in Google’s search technology 😉
Sha
Hi Sha,
While I originally wrote this post back in 2008, I was writing about the patent application before it was granted.
Google was granted this patent (US patent number 8,041,713) on October 18, 2011, so yes there is an interesting correlation between the granting of the patent and the statement from Matt Cutts. 🙂