“All mushrooms are edible; but some only once.” ~ Croatian proverb
Google was granted a patent today that could be used to collect a seed set of data about features associated with different types of mushrooms, to “determine whether a specimen is poisonous based on predetermined features of the specimen.” The patent also describes how that process could be used to help filter email spam based upon the features found within the email, or to determine whether images on a page are advertisements, or to determine categories of pages on the Web on the basis of textual features within those pages. The image below, from the patent shows how features about a picture such as height, width, placement on a page, caption, and so on might be examined while determining whether or not it is an advertisement:
This machine-learning approach can be trained with data that produces known outcomes, which could then be applied to very large data sets to classify data according to patterns identified within the seed set of data. When Google published Finding more high quality sites in search in February of 2011, they introduced what would become known as the Big Panda update. The approach was further elaborated on by Google’s Matt Cutts and Amit Singhal in an Interview at Wired Magazine around a week later in TED 2011: The “Panda”That Hates Farms: A Q&A With Google’s Top Search Engineers.
In early May of last year, Amit Singhal followed up with a post aimed at helping webmasters focus upon the kinds of efforts they should take to avoid being targeted by the Panda update with a series of questions in More guidance on building high-quality sites.
After reading the Wired interview, and finding out that the Panda update was named after a Google engineer, I tried to identify whom that engineer might be, and wrote the post Searching Google for Big Panda and Finding Decision Trees. I identified Biswanath Panda as a person of interest behind the upgrade based upon both his surname and a paper he co-wrote with Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo, titled PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce (pdf).
The paper describes how the researchers involved were able to perform some fairly complex classification processes on features associated with advertisements and landing pages to predict which of those would earn more click-throughs and longer stays from visitors. It also tells us that the processes involved could be used in other ways, possibly including classifying Web pages based upon features within a known seed set, to determine which pages visitors would tend to spend more time upon. See also, Predicting Bounce Rates in Sponsored Search Advertisements, which provides additional details on how an examination of different features related to sponsored ads and landing pages could be used to predict bounce rates.
I included a question mark at the end of the title to this post because I honestly don’t know if it should be considered Google’s “Panda” patent. It doesn’t have a co-author with a surname of “Panda,” though that could possibly be a nickname of one of the co-inventors. It doesn’t specifically mention that the process involved could be used to “noticeably” impact 11.8% of queries used at Google when applied to classify web pages. It doesn’t have a list of questions that web masters should ask themselves about their pages, like the ones that Google’s head of Search Quality, Amit Singhal published in his More Guidance post.
But, when you read the Wired interview with Matt Cutts and Amit Singhal, you get the sense that they are comparing sites that they find to be good quality with sites that aren’t, and that to do so, they are looking at specific features on those pages to make those decisions:
Wired.com: But how do you implement that algorithmically?
Cutts: I think you look for signals that recreate that same intuition, that same experience that you have as an engineer and that users have. Whenever we look at the most blocked sites, it did match our intuition and experience, but the key is, you also have your experience of the sorts of sites that are going to be adding value for users versus not adding value for users. And we actually came up with a classifier to say, okay, IRS or Wikipedia or New York Times is over on this side, and the low-quality sites are over on this side. And you can really see mathematical reasons.
This patent presents a way of examining features on a seed set of known pages, and developing comparisons of those features with features found on an unknown set to determing a classification of those pages based upon the examined features.
It also allows for the introductions of new features to be used while the classification process is ongoing. The patent is:
Feature selection for large scale models
Invented by Sameer Singh, Eldon S. Larsen, Jeremy Kubica, Andrew W. Moore
Assigned to Google
US Patent 8,190,537
Granted May 29, 2012
Filed: October 31, 2008
Abstract
Disclosed are a method and system for receiving a plurality of potential features to be added to a model having existing features. For each of the potential features, an approximate model is learned by holding values of the existing features in the model constant.
The approximate model includes the model having existing features and at least the potential feature. A performance metric is computed for evaluating the performance of the approximate model. The performance metric is used to rank the potential feature based on a predetermined criterion.
There is a whitepaper from Google that includes three of the four inventors listed on the patent and covers substantially similar territory, titled Parallel Large Scale Feature Selection for Logistic Regression. The paper leads off by telling us about some of the problems it is intended to address:
High-dimensional data sets with a large number of features are used increasingly more often in real-world machine learning tasks. Text mining problems such as classification and spam detection rely on features that describe the occurrence of specific combinations of words and therefore the numbers of potential features can grow up to billions.
Takeaways
I have been keeping a careful eye out for a patent that would describe the process behind Google’s Panda updates, and based upon the nature of those updates, my expectation was that I might not necessarily recognize it once I came across it. I didn’t expect it to provide details upon specific features that might be seen as positive or negative when it comes to determining the quality of web pages. I didn’t expect it to provide hints about what a webmaster might do if he or she was impacted by it.
I did expect that a patent about the Panda update would involve very large data sets, that it would include a machine learning approach that might determine positive features from known websites considered to be high quality, and that it could expand upon the features being used during the process of classifying a large set of pages. The process described in this patent does seem to fit those expectations.
The processes described in this patent are likely similar in many ways to the algorithm that classified documents under the Panda updates and could be the actual framework for the updates.
Regardless, it doesn’t provide any answers to ranking better under Panda, or any specific solutions to regaining rankings that might have been lost. It doesn’t focus upon any one feature or signal that could potentially be tweaked to change around the fortunes of pages that might have been affected, but rather considers a wide range of factors.
For pages that have been negatively impacted by Panda, the solution is more likely in removing or replacing low-quality content upon pages, and creating the kind of experience on the remaining pages that are pointed at by the questions that Amit Singhal mentions in his More Guidance post.
And as for mushrooms, I’m told that my great grandmother used to boil them with a silver coin included in the pot. If the coin blackened during the process, it was supposedly a sign that the mushrooms were poisonous. Researching that approach, I see lots of articles indicating that it just doesn’t work. My family lucked out. Be careful what you’re consuming, regardless of whether it’s about mushrooms or solutions to algorithmic updates. π
“Be careful what youΓ’β¬β’re consuming, regardless of whether itΓ’β¬β’s about mushrooms or solutions to algorithmic updates.” – bwaaah ha ha ha ha. Sigh. Love that one brother.
The patent application is too early (2008) and is not associated with either of the Panda engineers (Navneet and Biswanath) who have been put forth as candidates for the “Panda breakthrough”. I’ve been told that Matt Cutts confirmed it was Navneet Panda’s work but I’ve never found a citation of that.
Given the Panda naming, I wonder if there’s anyone named Penguin at Google π
Hi David,
Thanks. Figured I started the post off with mushrooms, and I had to get them back in at the end. π
Hi Michael,
As I was reading through the patent this morning, those were some of my concerns as well.
The patent doesn’t focus specifically upon pages and organic search results, but it definitely presents ways that those can use a method like the one described within it to classify web pages. The PLANET and Predicting Bounce Rates papers originally focus upon sponsored search and being able to use decision tree classification system with very large data sets. Both mention that the processes described within them could be used with other large data sets at Google.
One of the things that I found in common with them is that both papers include an acknowledgments section which lists a number of people from Google who contributed and provided feedback on the papers. One of those is Andrew Moore, who is one of the listed inventors on the patent, and the Head of Engineering of Google at Pittsburgh. He also seems pretty well versed in decision trees, as seen by this tutorial he offers on his site:
http://www.autonlab.org/tutorials/dtree18.pdf
The wording of the Wired Interview is interesting in a number of ways. For example:
Just what was that breakthrough? The approach in the first paper from Biswanath Panda and others, would make it possible to use a decision tree process on MapReduce for a very large data set, which is a significant breakthrough in itself. Those papers were published in 2009, which is more than a few months before the Panda update took place, but it’s possible that adopting the approach that was initially developed to predict sponsored search bounce rates to classifying web pages for organic search could be considered a breakthrough.
As for the person who came up with the questions used behind the approach, that doesn’t necessarily have to be Panda either:
But those questions could enable the creation of an initial set of features that a system like the one described in this patent could use.
I have been keeping an eye open for a pending or granted patent from Google to surface which has one of the Panda’s names on it. If this “breakthrough” is something that Google decides to patent, and it makes it into a patent filing, a patent application that needs to be published would have to be filed within a year of coming up with the approach, and then would have to be published within 14 months of the filing date.
But it’s possible to make an additional filing with a patent application that allows a patent to not be published until it’s granted. I’ve seen patents granted within a year at the quickest, and a handful that took more than 10 years.
It’s also possible that whatever that “breakthrough” might be, Google might treat it as a trade secret instead, and we might not see a patent.
I have been doing some more research on the Pandas at Google. It’s possible that “Panda” is the surname of the person mentioned in the interview, but it could also could be a nickname, or a shortened version of a longer name. I have seen at least one Google engineer with a last name that starts with p-a-n-d-a…
I hadn’t heard that Matt Cutts had confirmed that Navneet Panda was the Panda being referred to, but I’ll be digging deeper.
None with any pending or granted patents. π
Great find even though it mostly tells us what we already know. Guessing it would be too much to ask that they would actually hide some real factors in there. On the mushroom topic though, the problem is that anyone with conclusive data on the silver coin theory never bother to blog about it (for good reasons) π
As Michael Martinez above says, the patent application is too early and doesn’t cite Panda’s name, however I think the Panda update is a collection of many algorithmic changes and this could be a small part of it. Even if it isn’t, Google introduces many other changes to it’s search algorithm, so it’s useful to get an idea of the technologies they feel worthy of patenting to get a small understanding of how they think.
Cheers for this interesting article, however I’ve decided to stop building sites specifically for Google and investigate other methods of traffic generation instead π
Hi Magnus,
I actually really like running across patents from Google that confirm and reinforce things that we might have guessed or gleaned from Google blog posts, interviews, and our experiences. The patent introduces us to a framework for a system like Panda which would have been possible without something like it in place.
My great grandmother was using that mushroom test long before blogging or the Web was around, before Vannevar Bush even wrote about the memex, and during a time when coins were being minted that actually contained silver in them. π
Hi Walnut
Thank you. The Panda update pretty much seems to be a process in and of itself rather than a collection of many algorithmic changes. Google has confirmed that they are presently running this Panda process approximately once a month, and that the aim behind it wasn’t to penalize sites, but rather to improve the quality of search results that they display to searchers.
While the engineer “Panda” may have provided a “break through” that helped the search quality team launch the upgrade, that doesn’t mean that the engineer the upgrade is named after was one of the people who initiated the process, nor does it necessarily give us any indication of when Google first started working upon it. Google has patented processes that it has taken them years to integrate into their search engine. For example, the instant search results that Google shows us today were first mentioned as a possibility in a patent that was filed in 2005.
I’m not sure that it was ever a good idea to build sites specifically for Google. It’s much better to build sites that can attract attention from many different sources, and it really always has been.
Isn’t “penalizing a website” and “improving the search results quality” the same thing?
Hi Alex,
They aren’t the same. Penalizing a website is an action that might be taken against a specific site manually, or in an automated fashion because there’s a perception that the site is in some way attempting to manipulate Google’s search results. Improving the quality of search results aims at trying to rank pages returned in search results based upon some combination of relevance and quality.
That is the best ending to a post, ever. I agree, it is good seeing patents that reinforce what we already believe. Adds a few lighted corners to the Google black box.
Hi Bianca,
Thank you. True story.
Sometimes a patent describes something I’ve been observing for a while, and they even have a name for the phenomenon. That’s even better.