How a Search Engine Might Fight Googlebombing

The first known appearance of the phrase “googlebomb” showed up in an article by Adam Mathes in the online magazie, in a request to help pull a joke on a friend of his, by making the friend’s website rank highly for the term “talentless hack.”

You’ve possibly noticed that some pages rank well in Google search results for terms of phrases that don’t actually appear on those pages, because other pages link to those pages using those words as the text that accompanies those links. For example, search for “click here” and the top search result at Google is the Adobe Reader download page, which is linked to by millions of links across the Web using “click here” as a link to the page.

I’ve used the phrase “Googlebomb” in this post, but this is something that happens at Yahoo and Bing as well. Given enough links from enough pages using the same text pointing to a specific page, and there’s a chance that the page being linked to might rank very well in search results from any of the major search engines, even if the content of the page has nothing to do with the text in those links.

Usually, when people link to pages, the text used in those links if often descriptive of what people might find at the pages being linked to. This can help a search engine understand what the page being pointed to is about. Search engines have been associating the text in links to the pages that they refer to since the early days of the Web. As Google’s founders, Sergey Brin and Lawrence Page note in one of the first white papers about Google, The Anatomy of a Large-Scale Hypertextual Web Search Engine, the idea is something that they incorporated in Google, but it didn’t start with them:

This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm [McBryan 94] especially because it helps search non-text information, and expands the search coverage with fewer downloaded documents.

We use anchor propagation mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data which must be processed. In our current crawl of 24 million pages, we had over 259 million anchors which we indexed.

Trying to understand the relevance of a page from links pointed to it is often referred to as hypertext relevance, and while its been employed by search engines for almost as long as there have been search engines on the Web, it’s also been manipulated by people for personal, political and commercial purposes.

The talentless hack Googlebomb was intended as a joke, but one of the most famous googlebombs was inspired by political activism, with a large number of people linking to the presidential biography page on George Bush’s Whitehouse biography using the phrase “miserable failure” in the anchor text of their links. A September 2005 statement in the Official Google Blog, Googlebombing ‘failure’, explained why that page was showing up for that result:

Google’s search results are generated by computer programs that rank web pages in large part by examining the number and relative popularity of the sites that link to them. By using a practice called googlebombing, however, determined pranksters can occasionally produce odd results. In this case, a number of webmasters use the phrases [failure] and [miserable failure] to describe and link to President Bush’s website, thus pushing it to the top of searches for those phrases.

In January, 2007, a post on the Google Webmaster Central blog, A quick word about Googlebombs told us that Google had solved the “Miserable Failure” Googlebomb:

We wanted to give a quick update about “Googlebombs.” By improving our analysis of the link structure of the web, Google has begun minimizing the impact of many Googlebombs. Now we will typically return commentary, discussions, and articles about the Googlebombs instead. The actual scale of this change is pretty small (there are under a hundred well-known Googlebombs), but if you’d like to get more details about this topic, read on.

The post doesn’t tell us how the problem was solved, other than mentioning an improvement to the way that they analyze links, and that the solution was an algorithmic one. How did Google solve the problem?

The only patent or whitepaper reference that I’ve seen on Googlebombs from Google appears in the Google patents on Phrase-Based Indexing. Until today, I hadn’t seen any other references from any of the other search engines about how they may have attempted to solve the problem, until a Yahoo patent granted today, which describes how they fight “search engine hijacking,” which uses the example of a query for “miserable failure” showing the Presidential biography page.

The Google phrase-based indexing approach that I mentioned may or may not be the method used, as described in the Google Webmaster Central post above. But, it may account for the President’s bio page starting to show up in search results a few months later when the word “failure” was added to that bio page. Here’s a snippet from the first phrase-based indexing patent:

[0156] This approach has the benefit of entirely preventing certain types of manipulations of web pages (a class of documents) in order to skew the results of a search. Search engines that use a ranking algorithm that relies on the number of links that point to a given document in order to rank that document can be “bombed” by artificially creating a large number of pages with a given anchor text which then point to a desired page.

As a result, when a search query using the anchor text is entered, the desired page is typically returned, even if in fact this page has little or nothing to do with the anchor text. Importing the related bit vector from a target document URL1 into the phrase A related phrase bit vector for document URL0 eliminates the reliance of the search system on just the relationship of phrase A in URL0 pointing to URL1 as an indicator of significance or URL1 to the anchor text phrase.

Once the whitehouse staff added “failure” to the bio page, it suddently became relevant under a phrase based indexing approach for all of those links pointing to it that used “miserable failure” as anchor text.

Yahoo and Bing are also subject to Google Bombs, and a search at both Yahoo and Bing for “miserable failure” shows the George Bush whitehouse bio in the top four results. The Yahoo patent describes a way of diffusing Google Bombs using sentiment analysis, and if it works, it’s possible that Microsoft might want to license the approach from Yahoo.

The patent is:

Mitigation of search engine hijacking
Invented by Shanmugasundaram Ravikumar and Bo Pang
Assigned to Yahoo!
US Patent 7,870,131
Granted January 11, 2011
Filed: December 13, 2007


The subject matter disclosed herein relates to mitigation of search engine hijacking. In one example implementation, a sentiment value associated with anchortext in a search engine result may be determined.

Similarly, a sentiment value of one or more web pages referenced by the anchor text may also be determined. A divergence between sentiment values associated with the anchortext and a web page may then determined.

Here’s the technical language on how the Yahoo method works, straight from the patent:

More specifically, given an anchortext-page pair (q, p), a sentiment classifier may be applied to the anchortext and the web page separately, resulting in the sentiment of the anchortext (C(p)) and the sentiment of the web page (C(q)). In the case where C(p) U C(q)={acceptable, unacceptable}, a determination may be made to see whether the anchortext q is trying to hijack web page p. Where Pq is the set of all pages with anchortext q, and Qp is the set of all anchortexts for page p, hijacking may be indicated where C(p)={acceptable} and C(q)={unacceptable}. This may correspond to a case in which an invalid anchortext tries to hijack a valid web page. In this case, anchortext q may be declared as hijacking page p if the multi-set Pq, treated as a distribution, has low entropy and if most of the anchortext in the set Qp are “acceptable”. Such a result may indicate that the goal of the anchortext q is to slander web page p as web page p is also indicated as having a significant amount of other “labelings”(in the form of diverse, and mostly “acceptable” anchortexts).

Likewise, for example, hijacking of search engine may be indicated in cases where anchortext has a sentiment value that is acceptable and the web page has a sentiment value that is unacceptable. Ranking component may determine that such hijacking is occurring if a set of anchortexts referencing the web page has a distribution with low entropy, and if a majority of web pages within a set of web pages containing the anchortext have an acceptable sentiment value. In such a case, the acceptable anchortext sentiment value diverges from the unacceptable web page sentiment value, and such divergence may be shown not to be a normal occurrence due to the low entropy of the set of anchortexts referencing the web page.

In other words, if anchor text used to point to a page has a negative sentiment value, and the text on the web page being pointed to has a positive sentiment value, then the relevance of that anchor text may not be used by the search engine to analyze what the page is about. Likewise, if the anchor text has a positive sentiment value, and the text on the page linked to has a negative sentiment value, then the anchor text also may not be applied to the page pointed towards.

A link using “miserable failure” as anchor text expresses a negative sentiment, and the bio page of the former president expresses positive sentiments. Under this system, presumably, the “miserable failure” text wouldn’t be applied to the bio page.

I’m not sure if Yahoo tried this out, and with Bing now powering Yahoo’s search results, it’s impossible to test whether or not this was effective if Yahoo had implemented it. At this point, Yahoobombing and bingbombing still seem to work.


Is Phrase-Based indexing responsible for the disappearance of Googlebombs at Google?

There are two different sets of Phrase-based indexing patents that were published by Google. The first set described a number of ways that it could be used by the search engine. The second set described how the system could be incorporated into a large scale search engine index like Google’s. Phrase-based indexing would stop the miserable failure query from showing George Bush’s bio, and would explain why the bio started appearing again for a query using just “failure” once the whitehouse added that word to the page after the miserable failure googlebomb was diffused.

Is Google using some kind of sentiment analysis approach to solve Googlebombing, like described in the Yahoo patent?

It’s possible, but it’s hard to say whether or not the Yahoo approach even works, at this point.


Author: Bill Slawski

Share This Post On