Google’s Second Most Important Algorithm? Before Google’s Panda, there was Phil

They named the project Phil, because it sounded friendly. (For those who required an acronym, they had one handy: Probabilistic Hierarchical Inferential Learner.) That was bad news for a Google Engineer named Phil who kept getting emails about the system. He begged Harik to change the name, but Phil it was.

- Steven Levy, In The Plex: How Google Thinks, Works, and Shapes Our Lives.

How does Google decide which Adsense advertisements to show on which Web pages? How do they avoid showing inappropriate advertisements on those content pages? How does the document classification system they use to power those decisions work, and has its use been expanded beyond Google’s advertising system?

A screenshot of an interface from the patent Categorizing objects, such as documents and/or clusters, with respect to a taxonomy and data structures derived from such categorization, that shows how someone might discover which categories a website might be included within.

Steven Levy’s In the Plex describes the early days of Google’s Adsense program, where web publishers could sign up to display contextually relevant blocks of advertisements on their pages, and would receive a percentage of the advertising fee if someone clicked through an ad being shown. The name of the program, Adsense, originated with a company called Applied Semantics which Google acquired around the time that they were putting the finishing touches on this content-based advertising system.

According to Levy, the timing of the Applied Semantics acquisition and the use of the Adsense name caused many to believe that the advertising technology developed by Applied Semantics was at the heart of the Adsense program, when in reality the technology that powered the system was developed in-house, and was known as “PHIL.”

Note, I’m having a hard time finishing In the Plex. I keep on getting sidetracked by mentions of people and algorithms and events that I want to learn more about.

The mentions of PHIL in the book led me to dig in deeper, and see if I could uncover more information about PHIL. I came across the original provisional patent filing “Methods and apparatus for probabilistic hierarchical inferential learner,” application number 60/416,144, which was filed on October 3, 2002 and expired on July 19th, 2004 without having been published. I had never seen it before, and I suspect that very few outside of Google or the patent office have either, though I recognized a good amount of what I read in it.

PHIL was invented by Georges Harik and Noam Shazeer, and many of the ideas about how words and concepts associated with them might be clustered and classified can be found in the following patents where they are listed as co-inventors:

The following patent application describes how the Phil clustering system might be used to categorize many different types of documents:

Categorizing objects, such as documents and/or clusters, with respect to a taxonomy and data structures derived from such categorization

Invented by David Gehrking, Ching Law, and Andrew Maxwell
US Patent Application 20060242147
Published October 26, 2006
Filed: April 22, 2005

Abstract

A Website may be automatically categorized by:

  • Accepting Website information,
  • Determining a set of scored clusters (e.g., semantic, term co-occurrence, etc.) for the Website using the Website information, and
  • Determining at least one category (e.g., a vertical category) of a predefined taxonomy using at least some of the set of clusters.

A semantic cluster (e.g., a term co-occurrence cluster) may be automatically associated with one or more categories (e.g., vertical categories) of a predefined taxonomy by:

  • Accepting a semantic cluster,
  • Identifying a set of a one or more scored concepts using the accepted cluster,
  • Identifying a set of one or more categories using at least some of the one or more scored concepts, and
  • Associating at least some of the one or more categories with the semantic cluster.

A property (e.g., a Website) may be associated with one or more categories (e.g., vertical categories) of a predefined taxonomy by:

  • Accepting information about the property,
  • Identifying a set of a one or more scored semantic clusters (e.g., term co-occurrence clusters) using the accepted property information,
  • Identifying a set of one or more categories (e.g., vertical categories) using at least some of the one or more scored semantic clusters, and
  • Associating at least some of the one or more categories with the property.

The present invention concerns organizing information. In particular, the present invention concerns categorizing terms, phrases, documents and/or term co-occurrence clusters with respect to a taxonomy and using such categorized documents and/or clusters.

Some of the capabilities behind PHIL were pinpointed in one of the slides from a presentation by Google’s Ruchira S. Datta, during the Tenth Annual Bay Area Discrete Mathematics Day, PHIL: The Probabilistic Hierarchical Inferential Learner, April 9th, 2005

What Can We Do With Phil?

  • We can compare the concepts occurring in queries, documents, and ads.
  • We can compare the concepts between two documents, and form a distance metric for clustering similar documents.
  • For a query with ambiguous meaning, we can present results corresponding to the alternate meanings, in proportion to their likelihood.
  • We can use the concepts as features in order to classify texts.
  • We can guess whether words are misspellings of each other based on the concepts they induce.

The provisional patent was refiled with slightly modified language in September of 2003, and was granted under the name Method and apparatus for characterizing documents based on clusters of related words. One of the major changes was the removal of references to the system as “PHIL”.

I wrote about that patent back in January, before I began reading In the Plex and before I knew of the Google Code name PHIL, in the post Why a Search Engine Might Cluster Concepts to Improve Search Results. At the time, I hadn’t realized that not only is Google using the type of classification described in the patent to classify web pages for purposes of its Adsense program, but that they had been for years and years.

Conclusion

We know that PHIL played an important and integral role in how Google classifies web pages to decide upon which ads to show on pages through Adsense. Some of the classifications provided as examples in the Categorizing objects patent include “sensitive” topics like the ones that I wrote about in my last post on How A Search Engine Might Classify Web Pages as Sensitive.

PHIL classifies pages into different categories by learning about terms and concepts that co-occur upon pages, and clustering similar pages together. Google’s Phrase-Based Indexing also focuses upon identifying good phrases and co-occurrence of terms on pages, but using a different system, and focusing more upon ranking web pages. There have been two generations of Phrase Based Indexing patents, with the second group focusing upon implementing the system in a large scale indexing system, so there’s a possibility that it plays a similar role to PHIL on the web search side.

There is the possibility that the kind of taxonomy that PHIL places pages and sites within could be used in combination with other types of document classification systems, perhaps even one such as Panda, so that sites that cover similar types of topics might be compared against one another.

Share

25 thoughts on “Google’s Second Most Important Algorithm? Before Google’s Panda, there was Phil”

  1. Well, all I can say is “Thank goodness for PHIL” because Adsense is the only ad program that pulls in half-way targeted ads on my sites.

    I have tried others before, but nothing has matched up so far in terms of conversions.

    Mark

  2. Wow, that’s some heavy detective work.

    You mentioned
    “I came across the original provisional patent filing “Methods and apparatus for probabilistic hierarchical inferential learner,” application number 60/416,144, which was filed on October 3, 2002 and expired on July 19th, 2004 without having been published.”

    How the heck did you find a provisional patent that was never published? Not through the normal PTO patent search process, presumably!

    - Ted

  3. Ruchira Datta is a woman. Coincidentally, her time at Caltech overlaps with that of Adam Weissman and Gil Elbaz (who founded Applied Semantics).

  4. Crikey Bill – that’s dedication!

    The capabilities of PHIL mentioned at the 10th Tenth Annual Bay Area Discrete Mathematics Day back in 2005 are interesting; which got me thinking about the integration of other document classification systems. Personally I’d hazard a guess that the goal based targeting system, classification and attribution modelling of Dart Doubleclick (floated 2005, acquired by google 2008) would have been integrated prior to Panda.

    Then again though – I was under the impression that some of the TangoZebra functionality would have been integrated into YouTube by now…

    Tom

  5. Google adsense is very effective way to advertise everything and earn money indeed, but problem is that recently they added some restriction for the address of asian blogger. I tried to several times to approve my accounts but I often failure to do that.

    They should follow the equal right for the every one.

  6. I’s easy to see how well PHIL works, if you have Adsense on your site. Just change the title tag, h1 tag and a few terms on a page and when you refresh, you should get new ads in your Adsense space focused on the new topic you’ve given the page.

    I think the “Website” box on the Google Keyword Tool is based on Phil, also. Enter any URL to get a good idea of what Google thinks that page is about. Very valuable insight.

  7. I’m not keen on Adsense and the keyword tool is hit and miss depending on topic. I also have a few minor grumbles about how the other 60M people in the UK search for sites within the Google search bar – it is beyond belief! How can we cope for that?

  8. Hi Mark,

    I’ve heard some people very satisfied with AdSense and others underwhelmed. To a degree, I guess it depends upon whether or not visitors to your site might be interested in visiting the ads shown on your page, regardless of how relevant they might be.

  9. Hi Jeremy,

    I mentioned Ruchira S. Datta pretty much just because of my link to her presentation on the topic, and I don’t know what role if any she might have had in the development of PHIL. Georges Harik and Noam M. Shazeer seem to get most of the credit for it, though it’s very likely that others at Google may have played a role as well.

  10. Hi Curt,

    Thanks for the additional information. You have me wondering if there were some interesting discussions going on at Caltech back then about semantic search.

  11. Hi Tom,

    Thanks. I’m still wondering at this point how the Applied Semantics technology might have been incorporated into what Google does since PHIL was used in Adsense, and a couple of Google patents from Adam Weissman and Gil Elbaz focus more upon Web search than sponsored search. I can’t say that I’ve delved too deeply into the technology that came with Doubleclick into Google.

  12. Hi Mar,

    Great points and information. I’ve explored using the URL box in the same way. It is interesting seeing what Google might have to say about particular pages.

  13. Hi SF,

    I’d love to see someone do a large scale study on both AdSense (and really how relevant ads are to the content they are shown with), and on the keyword tool as well. I don’t know what your issues are with searches on the Google toolbar, so I can’t really respond to your comment on that point.

  14. Hi Ted,

    I looked up the patent application number in the USTPO Patent Application Information Retrieval, where I was able to access it since it expired. It had never been published as a pending patent application under the normal patent process, but it is available through the database because of its expiration. While it’s a publicly accessible document, it’s not an easily accessible one.

  15. I believe PHIL and the AdSense program is separate from SEO. It effectively (for the most part) targets publisher sites based on content and the “relational dimensions” of keywords on the page, but I don’t think it has any effect on SEO.

    Re: AdSense program – once you’re banned, it is very hard to get back in without some sort of new domain and even mailing address.

  16. Hi Vince,

    One of the things that we’ve often heard from representives from Google is that the organic and the paid sides of search are seperate from each other. But that doesn’t mean that some of the ideas that circulate on one side of search might not influence the other side as well.

    For instance, when Google researchers came up with a decision tree process that could work with very large data sets using Google’s MapReduce, as described in the paper PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce, they experimented with how effectively that process might be in predicting the bounce rates for Adwords advertisements and landing pages. They note in the paper that success in that experiment indicates that the process they developed could be used with other very large data sets, like those involved with organic search. There’s a decent possibility that process may be behind Google’s Panda updates.

    I’m not insisting that some aspect of PHIL is used in Google’s organic results, but as a document classification system that pays special care and attention to the words and concepts used on pages to classify those pages, and one that Google has definitely been using for years, it merits attention to anyone interested in how Google works to classify pages. An added note, the patents that the researchers who came to Google from Applied Semantics are also mentioned in this patent, and it’s possible that they had some influence in added refinements and developments to the PHIL system. Looking at the patents that they developed while at Google, some of them focus more upon organic search than paid search.

  17. The original Applied Semantics patent is one of the least turgid, and most apt papers I have read via a post on this site. Thanks – Patents are not the easiest to understand of decipher, so thanks for the valuable input you give.

  18. Hi JC,

    You’re welcome. The Applied Semantics papers and patent filings are pretty interesting, focusing more upon understanding concepts on web pages than upon matching keywords in documents. They are definitely worth reading for anyone interested in search engines or SEO.

  19. So far I have been quite impressed with adsense placements on my blogs. Depending on the location of the user and the keywords surrounding the placement (alt title for pictures, meta tags…) I’d say 75% of the ads are really relevant.
    As a publisher, it is really important you optimize your pages/articles well to get the most relevant ads on your site. I should add that I prefer to limit the number of ads, as Google will show the highest bidding ads first (resulting in higher income per click for me).

  20. Hi Eric,

    I’ve tried out adsense, and there are times when I do see very relevant ads, and times when I wonder how some ads were chosen because they just aren’t good fits.

    I agree with you on the idea of limiting the number of ads shown, so that it’s more likely that higher bidding ads, and more relevant ads are shown.

Comments are closed.