How A Search Engine Might Classify Web Pages as Sensitive

Given the Panda Updates from Google, I’ve been spending a fair amount of time looking at how search engines might use automated programs to classify webpages, and how they use those classifications. If you’re a web publisher, it’s the kind of thing that you might be interested in as well. If you display ads, what does Google think of where and how you present them? How does your choice of colors, font styles and sizes, number of columns, size of headings and footers, inclusion of about pages and privacy policies, and other features on your site influence how Google might perceive and classify and score your pages?

One example of a problem where classification of pages might be helpful to a search engine is described in the book about Google by Steven Levy, In The Plex. The author tells us about some Google Adsense gaffs that show challenges in automating the matching of advertisements with pages to display those ads upon. One particularly offensive match was a Google ad for plastic bags showing on a news page about a grisly murder where the victim’s body was disposed of in plastic trash bags. Tickets for air travel might be placed on a page about plane crashes. A coupon offering a free dinner for 2 at a particular chain restaurant appeared on the same page as an article about a number of people who dined at a restaurant in that chain and had suffered from food poisoning. The author notes:

Google Engineers started working on ways to mitigate this problem, but it would never be eliminated. It was just too hard for an algorithm trained to discover matches between articles and ads to exercise human good taste.

I don’t believe that I’ve seen a patent or paper from Google directly on this subject, though I did write a post a few years back, How Google Rejects Annoying Advertisements and Pages, that described many of the things that Google might be looking for when using an automated process to review ads.

The patent I wrote about in that post, Detecting and rejecting annoying documents, was granted last week. It looks at a large number of features that might be related to both advertisements and landing pages that influence whether or not an advertisement might be accepted. But it doesn’t discuss whether or not some ads might be considered inappropriate for some web pages that they might be displayed upon.

Microsoft was granted a patent this week on a process they came up with to try to avoid showing inappropriate advertisements on Web pages, though it’s possible that they’ve replaced the process they detail in the patent with something new. In early 2007, you could visit Microsoft AdCenter Lab and see a tool for “Detecting Sensitive Web Pages” amongst the experimental products the search engine offered.

A screenshot showing part of the Detecting Sensitive Webpages tool from Microsoft Adcenter, via the Wayback Machine.

I’m not sure how useful the tool itself might have been for site owners, but I did find a blog post on Webmetrics Guru that shows what the results from the tool looked like on Microsoft AdCenter Labs New and Improved Beta Tools – Sensitive Page Detection.

The goal of the tool was to look at the content of one or more pages of a site to predict a “sensitivity” level associated with that content, and to determine whether or not it fit within certain sensitivity categories. The patent behind the tool is:

Sensitive webpage content detection
Invented by Ying Li, Teresa Mah, Jie Tong, Xin Jin, Saleel Sathe, and Jingyi Xu
Assignee: Microsoft Corporation (Redmond, WA)
US Patent 7,974,994
Granted July 5, 2011
Filed May 14, 2007


Computer-readable media, systems, and methods for sensitive webpage content detection are described. In embodiments, a multi-class classifier is developed and one or more webpages with webpage content are received. In various embodiments, the one or more webpages are analyzed with the multi-class classifier and, in various embodiments, a sensitivity level is predicted that is associated with the webpage content of the one or more webpages. In various other embodiments, the multi-class classifier includes one or more sensitivity categories.

A flowchart from the patent showing the process involved in detecting content that might be inappropriate to shwo some advertisements alongside of.

The database behind a system like this might store specific information about web pages and advertisements, such as:

  • Sensitivity categories,
  • Sensitivity subcategories,
  • Multi-class classifier information,
  • Webpage information,
  • Association information involving webpages and sensitivity categories and subcategories,
  • Advertisement information,
  • Parental control information,
  • Forum information,
  • Blog information

In addition to determining whether an ad might be inappropriate for a specific page, this system might be used to specifically target certain pages for time-sensitive advertisements. For example, when the content of a page involves a recent natural disaster, advertisements and public service announcements involving relief efforts might be more easily shown on those pages.

Sensitive and Non-Sensitive Categories and Subcategories

The patent includes an number of examples of categories that might be assigned to pages, and provides examples of “sensitive” and “non-sensitive” examples of each, involving sex, weapons, accidents, crime, terrorism and war. Here is their breakdown from the larger accidents category:

ACCIDENTS: Accidents pages are pages such as news articles, analysis, or commentary on events resulting in fatalities.
Accidents – sensitive: Natural disasters Vehicle crashes Household accidents
Accidents – non-sensitive: Minor injuries Non-fatal, major injuries Sports injuries Natural disaster preparedness Injury prevention and precautions Injury treatment

Categorization Process

The categorization of web pages might be done by collecting a number of training pages and classifying those, to use to classify other pages in an automated manner. For example, a query involving crime prevention might be submitted to a search engine, and the top 500 web pages returned might be reviewed by humans to find the pages relevant to crime prevention. Those pages may then be placed within a training set of pages for a “crime – non sensitive” category. Other pages then might be identified as being in that category by comparison with those training pages.

This machine learning system might look for similar phrases and terms in other pages, as well as how frequently those terms appear, whether the phrase appear near the top of a page or less prominently lower upon the page, if the terms show upon in an alternative font such as a larger font, or bold, or italic, or underlined.

Certain rules may be applied based upon associations found with the human reviewed pages and other pages seen only by the machine learning system. For example, if the word “sex” appears upon a page more than 3 times, and the word “nude” appears more than twice, that may indicated a certain probability that the page belongs to a “sex–sensitive subcategory” (I better not use the word “sex” or “nude” again on this page. Oops.)

Some rules may be applied differently if the pages being classified are news articles, blog posts, online forums, or pages operated by a specific business


This patent paints a fairly broad overview of how web pages might be classified into sensitive and non-sensitive categories, based upon a human review of a sample number of web pages followed up by an automated approach for additional pages that looks at features associated with the use of specific terms found on the manually reviewed pages. Chances are that Google may be doing something similar for ads that they display upon pages as well.

As for Google and their Panda updates, the type of document classification system in the Microsoft patent is aimed at determining when it might be appropriate to show certain advertisments on certain pages rather than reviewing pages to try to determine the “quality” of those pages. Chances are that the type of number of features used in a document classification to determine the quality of pages contains a much larger set of features, but chances are that many of the ideas behind the approach are similar, including the use of human reviewers to manually identify a number of “high” quality pages.

What kind of features might Google be looking at on your pages to determine what level of quality it might have?

The answer to that question might best be served by looking at the questions that Google Fellow Amit Singhal raised in the Google Webmaster Central blog post More guidance on building high-quality sites. In addition to looking at how your site might fit those questions, it may not hurt to find “high quality” sites that rank well for similar or related queries as the pages on your site and see how those sites address the issues raised in those questions.

21 thoughts on “How A Search Engine Might Classify Web Pages as Sensitive”

  1. Thanks Bill – another great article!

    This reminds me a bit of an experiment that I remember reading about.

    It was noticed that when typing an email into Gmail there’s a paragraph that can be written which ensures that no keyword based adverts appear on the page. I don’t recall the exact phrase but I believe it used a traumatic event to trigger the removal of the ad.

    The patent looks like it develops the occurrence of a traumatic keyword theory further than a repeated mention which would surely be better than the relying on single instances or a human based tagging solution.

    In addition – I suppose the inventory that could be freed up by the removal of ads could be used for charitable advertisers in the same way that TV programmes feature the “if you were affected by the storyline of this programming” message.

    Great article as always Bill!


  2. Bill,

    Although I can see how an addition of this type of scripting would indeed be beneficial to advertising on say, something like sensitive news items written about taboo subjects, just think of how enabling this type of functionality would make the net less fun for those of us who frequent sites like LOL!!!

    Great and informative post as always. Honestly, I think that this would be a great addition to advertising. I have on more than one occasion where an advertiser would have been embarrassed to see their ad on a particular page.


  3. Does affiliation have to do with back links and social graph now?

    Just wondering if we have to “watch our backs” more in the future as far as who our advocates and fans are. You can’t always control who is linking to you. I’ve seen many times where unsavory sites give you sitewides through crappy widgets or just because they are trying to make their site more official by using good sites as their OBL’s.

  4. Sensitivity is so important for marketing. A poorly-timed or -placed ad can make the company it advertises look insensitive without them even trying, which can reflect badly on potential clients. I’ll be interested to see where this goes.

  5. Hi Mark,

    The unintended consequences of contextual advertising systems that match up content with ads may bring up some unusual and unfortunate results. I still remember Altavista offering me a discount on Dalai Lama’s a number of years ago.

    Chances are that something like this has been happening for more than a couple of years now, and I think that it’s the kind of thing that advertisers really want to know that you might be doing. It’s easy enough for human beings to make PR mistakes, but you don’t want your automated ads to be doing the same thing too.

  6. Hi Tom,

    Thanks. I hadn’t heard of that approach to filtering Gmail of ads, but it makes sense that they would do something like that. There does seem to be more than just the frequency with which particular words are used on a page; a few of those seem to look to see if those words are being emphasized in a manner which might make them the focus of the page.

    Google will some public service announcements if you let them, when there aren’t appropriate ads to show. Chances are, those would go through a classification and filtering system as well.

  7. Hi Donnie,

    For many sites, where advertising might distract from people focusing upon your pages and what you offer, advertisements might not be a good idea, and I would recommend against them. In addition to the potential distractions that they can cause, the ads based upon the content on your pages might be for people who may be competing with you for your visitors’ attention.

    But many sites have a business model where the advertisements are the way that they make money on the sites, from news and media sites to blogs that focus upon providing information on different subjects, to forums and other sites that don’t necessarily offer goods or services. Showing appropriate ads for those pages make sense.

  8. Hi Kentaro

    It is important. The patent was Microsoft’s but chances are that Google has developed some kind of process to attempt to make sure that advertisements they show on content pages are appropriate fits. There is some potential for harm if they don’t, and if those types of mistakes happen on a regular basis, they could lead to less advertisers, so there’s definitely an incentive there to be able to identify the kinds of sensitive pages described in the patent.

  9. Hi Brent,

    I’m not sure how backlinks might influence the classification of a page for purposes of determining how “sensitive” that page might be regarding displaying advertisements. Most of the analysis seems to be focused upon content appearing upon a page rather that whether or not the page might be affiliated with some other sites, even if those other sites might cover topics that are sensitive.

    I did write about a Google patent that was granted last year in a post titled Google’s Affiliated Page Link Patent that gives us some hints about how the search engine might determine whether or not some sites might be affiliated in some way. While one of the potential impacts of that type of affiliation might be a limit or cap on how much link weight might be passed along by links, it’s possible that a document classification system like the one used in the Panda updates might also be looking at backlinks from sites that appear to be affiliated.

    Supposedly Google has sent out at least a couple of notifications through Google Webmaster Tools that the recipients of those notices have a number of manipulative links pointed toward their site. On the bad side, it’s not good to see that competitors might be able to hurt someone by pointing links at them. On the good side, that kind of notification may just give people a chance to have a conversation about those types of links, especially if they aren’t responsible for them.

  10. While I don’t think that an algorithm, no matter how advanced, could ever avoid more subtle advertizing errors, it could very likely get rid of the vast majority of obvious advertizing blunders.

    And I suppose that’s what Google might really be after: to avoid being associated with hilarious mistakes in general. Because when people see an ad that is inappropriate for the context it’s found in, they don’t get angry; they laugh. And I think Google just wants to avoid being laughed at.

  11. Hi Blake,

    Good points. With the volume of advertisements that the search engines receive, it really does help to have some kind of filter. Having to manually check every advertisement that might be displayed upon a page to see if there might be a problem with placing it on that page is just too much work, especially to make a service like that affordable for advertisements.

    Google definitely wants to avoid the kind of bad publicity that an egregious error might cause, like the examples that I mentioned in my post from In the Plex. Not only does it make them look bad, but it also might be the kind of news that would make it less likely that businesses would advertise through Google.

  12. This really shows how important AdWords is to Google’s business model. To go through the trouble of the patent process to display some human sensitivity (if that’s even possible for an algorithm) is telling.

    Also, I tend to agree with Donnie on not putting adverts on most sites.

  13. Hi everyone,

    I think is an extremely important matter. I was talking some time ago with a client that had his advert for a Villa on sale diplayed on a website about a soccer player called David Villa…

    That should never happen, especially when you are paying for those adverts.

  14. It’s hard to believe how advanced advertising is getting. I think if it prevents major advertising blunders then it is a good thing.

  15. Hi Jonathan, Mel, Dave, and DrD

    Adwords is the primary way that Google makes money, so anyway that they can improve it is a good step. This patent is from Microsoft rather than Google, and chances are that Google has come up with their own automated process to try to make sure that combinations of content pages and advertisements are good matches. Chances are that some matches, like the one that Rafa describes will still happen, though it would be in Google’s best interest that they don’t.

    The process described in this patent has nothing to do with why some low quality pages might rank well for certain queries, but the kind of document classification that works behind it might be similar in some ways to processes behind Google’s Panda updates, which it might be kind to say are still in their infancy. It’s likely that Panda does include some grammar and spelling check like you might find in a word processing script.

  16. This mechanized analysis also helps explain how pages in my realm of plastic surgery that Google ranks highly are often spam farms. Google needs to write an algorithm that reads proper English and exclude pages full of keyword scrabble in favor of those that read coherently. Could they not “borrow” such scripts from the makers of Word Processors?

  17. i GUESS at the end of the day, only humans manually editing where each ad is allowed would come close to 100% doing good matches. Since this is how dmoz works and I and many people have not been graced with an entry, i think that many people are going to accept automation. and surely if there is an inappropriate match a feedback mechanism will help google clean up their algorithm. lucky i dont use adwords at the moment!

Comments are closed.