Machine Learning Inside Google

By OSX - Own work, Public Domain,
By OSX – Own work, Public Domain,

Understanding Systems

When I was in high school, one of the required classes I had to take was a shop class. I had been taking mostly what the school called “enriched” courses, or what were mostly academic classes that featured primarily reading, writing, and arithmetic. A shop class had more of a trade focus. I was surprised when the first lesson on the first day of my shop class was a richer academic experience than any of the enriched classes I had taken.

The instructor started talking about systems, and how many manufacturing processes involved breaking products down into different systems. We were going to start off by building an electric motor for this shop class, which was an important part of electrical systems within automobiles. This idea of looking at the internal functions of vehicles and classifying their parts, and understanding how they fit together was an exciting and interesting perspective. I’m reminded of that approach to understanding systems with a newly granted Google Patent that uses a machine learning algorithm to classify and understand how support pages might fit together.

Google Refocusing Upon Machine Learning

Steven Levy, author of In The Plex, which reveals stuff about the earliest days of Google has been sharing more with us, including a look at how Google has started relying upon machine learning approaches, and he tells us about that in a recent post, titled How Google is remaking itself as a machine learning first company.

I thought the machine learning article was interesting after reading a recently granted Google patent that attempts to understand what pages on the Web are about using a classification approach. The patent had me asking myself, “is Google going to say good bye to PageRank for a new way of ranking Webpages that doesn’t rely upon links from other sites?” They have used PageRank to rank pages from their earliest days.

This new patent focuses upon a way of classifying data that uses an approach based upon ” a hierarchical taxonomy of clustered data.”

The patent starts off by using an example of how information for a support center works. The patent tells us that keywords might be extracted from documents about providing support to users in a way that generates clusters of documents with similar keywords.

A classification algorithm might be used where classifications are based upon a taxonomy and documents are classified with a confidence level.

This is an interesting way of looking at the Web, and understanding its different parts and how they fit together.

The Patent

It wasn’t until I looked at the LinkIn profile for Nadav Benbarak that I gained a sense of why this patent came about. In his experiences at Google, we are told about one project he worked upon:

Product manager for new product development effort. Managed product vision, roadmap, design, and implementation. Led 15-person team of engineers and operations specialists.

• Created a new project to develop a suite of tools and data sources for reporting on the quality and effectiveness of Google’s customer service. Secured buy-in from senior management and garnered funding for 10 engineers.

• Launched a new machine learning algorithm for summarizing customer feedback from millions of users. This information drove significant product and operations improvements for the AdWords business.

• Core member of internal consulting team advising Google’s President of Sales on customer service strategy. Drove thought leadership for analysis plan. Team’s recommendations led to a reorganization of Google’s service team and a new initiative to increase customer satisfaction

This project that he was a project manager on appears to have been the inspiration behind this patent, and how it worked:

Methods and systems for classifying data using a hierarchical taxonomy
Inventors: Glenn M. Lewis, Kirill Buryak, Aner Ben-Artzi, Jun Peng, Nadav Benbarak
Assignee: Google
US Patent 9,367,814
Granted June 14, 2016
Filed: June 22, 2012


A method and system for classifying documents is provided. A set of document classifiers is generated by applying a classification algorithm to a trusted corpus that includes a set of training documents representing a taxonomy. One or more of the generated document classifiers are executed against a plurality of input documents to create a plurality of classified documents. Each classified document is associated with a classification within the taxonomy and a classification confidence level. One or more classified documents that are associated with a classification confidence level below a predetermined threshold value are selected to create a set of low-confidence documents. The low-confidence documents are disassociated from each of the associated classifications. A user is prompted to enter a classification within the taxonomy for at least one low-confidence document. The low-confidence document is associated with the entered classification and with a predetermined confidence level to create a newly classified document.

Take Aways

We have an idea of how the process described in this patent was used at Google, to help build a customer support taxonomy. It focused upon classifying customer support issues involving things such as “account management, billing, campaign management, performance, policy, etc.”

The patent tells us how useful collecting and making available information to customer support representatives would be by exploring the details of topics such as billing, to include things such as “payment processing, credits, refunds, etc.”

For instance, the patent tells us that in the subcategory of payment processing, there are issues such as:

1) Customer has questions on activation fee;
2) Customer’s account is marked delinquent;
3) Customer has questions on account cancellation; and
4) Customer has questions on forms of payment and/or invoicing.

The patent provides a rich look at how this taxonomy may have been helpful when having to supply information to advertising customers.

The patent shows us information about how the classification algorithm it uses might work to cluster documents and organize them in a hierarchical manner, like this:

In the above example, the clustering module may define a cluster that contains documents (or references to documents) having both the words “inbox” and “capacity” in their text. Another cluster may include documents having both the words “drop” and “call,” and so on. In some implementations, one or more rules can specify, e.g., what words may be used for clustering, the frequency of such words, and the like. For example, the clustering module can be configured to group together documents where a given word or synonyms of the given word are present more than five times. In another example, the clustering module can be configured to group together documents where any of a pre-defined set of words is present at least once.

Google has started using machine learning processes to solve problems like customer support. This approach aims at making it easier for people inside of Google to help solve customer problems by better understanding those problems and organizing information about how to solve them.

As an SEO, it had me a little excited to see a section that described how Google may rank solutions to problems. This doesn’t appear to be a replacement for PageRank; at least not quite yet. But the roots of organizing a web full of information may be found by starting with solving smaller tasks. This is the passage about ranking documents from the patent. It feels like (to me) there are some hints in there as to how documents might be ranked on the Web to use to respond to queries from searchers:

In some implementations, the document clusters may be ranked using the ranking module, which may also be executed on the server.

In some implementations, the ranking module ranks document clusters according to one or more metrics. For example, the ranking module may rank the clusters according to the quantity of documents in each cluster, as a cluster with many documents may represent a relatively significant topic (e.g., product issue).

As another example, the ranking module may rank the clusters according to an estimated time to resolution of an issue represented by the cluster (e.g., issues represented by a cluster “software update” may typically be resolved faster than issues represented by a cluster “hardware malfunction”), a label assigned to the cluster, a number of documents in a cluster, a designated importance of subject matter associated with a cluster, identities of authors of documents in a cluster, or a number of people who viewed documents in a cluster, etc.

In an example, a cluster representing an issue that historically has taken a longer time to resolve may be ranked higher than a cluster representing an issue with a shorter historical time to resolution.

In another example, several metrics are weighted and factored to rank the clusters. The ranking module can be configured to output the rankings to a storage device (e.g., in the form of a list or other construct).

We’ve heard about a machine learning approach from Google used on Web pages called Rankbrain, which appears to be focused upon rewriting queries in a way that seems helpful in producing relevant search results for those queries. We’ve been told by Google that There is no Rankbrain score and you don’t optimize for it.

What role may machine learning play in how information on the Web might be returned in response to queries? We don’t know at this point, and it’s possible that there’s a lot of learning about machine learning going on at Google these days; like in the building of the automated customer support taxonomy algorithm described in this patent. It appears to have helped solve some problems they were experiencing. We’ll see if it helps solve their mission to “organize the world’s information and make it universally accessible and useful.”

16 thoughts on “Machine Learning Inside Google”

  1. Hi Bill,
    Very informative post. I was knowing about the Google rank brain and its machine learning but through your post i was better able to understand it.
    Thanks for sharing.

  2. Great article, I studied machine learning at Uni 20+ years ago but never had a use for it back then, really interesting to see how Google is utilizing it – cheers

  3. Recent changes in Yoast’s plugin are checking for passive voice, transition words, sentence length and general readability. I know this is somewhat tangential to this post, but have you seen anything in the patents that indicates an increased focus by Google on these types of basic readability measures?

  4. Thanks for this… very helpful having already have worked on machine learning projects and neural networks prior to working in a Digital Marketing company.

    I think I already have ideas what the “trusted corpus” is, it’s pretty obvious. In addition, feedback loops using user input is important in feedback mechanisms in neural networks. I’d be interested in how iPython Notebooks are being deployed to servers contributing to the search ranking systems at Google based on TensorFlow or components of TensorFlow.

    What I caution most Search folks on is that neural networks are more than state machines — more than algorithms. My guys ask me about what works to influence Google — well, it’s like asking what works to influence a relationship. Quality. What’s a quality experience? Now, you’re asking what does Google consider a Quality Experience? Well to answer that you have to know the machine learning systems, their bias and their training — and well we know what is used as a “baseline” for training, it’s pretty well implied by the “trusted corpus”.

  5. Hi Roy,

    Thank you for sharing your experience and thoughts regarding this post and what it covers. This particular patent and the process it covers didn’t affect search, but rather how customer service was assisted by machine Learning at Google. It’s possible that people working at Google gained experience doing machine learning from a project like this one. I appreciate you sharing your thoughts.

  6. Hi Glenn,

    I’m guessing it’s possible that the changes at Yoast have been made to try to help people improve the quality of their blog posts, but I can’t say that they’ve made such changes based upon patents that suggest to increase readability of pages. I’ve not seen Yoast claim to make changes to his plugins based upon evidence found in patents. It is possible that he may have, but I haven’t seen such a claim from him (and I haven’t looked for one.) I’m guessing it’s possible to predict that doing things to improve quality of web pages, can mean that people might spend more time on those pages, and may be more likely to refer others to those pages, and to share them socially or by linking to them.

  7. Hi Nigel,

    It does seem that more people are now using machine learning than they had in the past. From what I understand is that faster computers have been making a difference in how effective such an approach can be. It is interesting seeing machine learning playing such an increased role at Google, like described in the BackChannel article.

  8. Hi Robin,

    I’m happy to hear that you liked this post. It’s about a different machine learning approach at Google than Rank Brain, and this particular patent doesn’t provide a lot of insights into how Rank Brain works, though it does give us an idea of how one approach that involves classifying web pages does work. It does show how Google has started relying upon machine learning for things like improving customer service for Google’s paid search.

  9. Great article!
    I love A.I. i even have bought robot Nao.
    It is amazing!
    Write some article about it

  10. Hi Pete,

    Google hasn’t shared a lot with us on how RankBrain works, but we are seeing Google actively learning more about machine learning, and I thought it was interesting seeing how else they might be using machine learning inside of Google itself. I think if we watch carefully, we might learn more. 🙂

  11. Thanks for this article Bill. Machine learning is something that has been taking the world by storm, and it is definitely something worth reading more into. You can definitely see that they are getting much better at it, and I’m excited to embrace the new changes that come along. Thanks for keeping us updated, and I always appreciate your posts.

  12. I really enjoyed this , it also remind me of my school days , when i was young i use to love the shop class because i love machine and love to know about and how it works , Thanks for sharing lovely post .

Leave a Reply

Your email address will not be published. Required fields are marked *