Google’s Word Vector Approach
In October of 2015, a new algorithm was announced by members of the Google Brain team, described in this post from Search Engine Land – Meet RankBrain: The Artificial Intelligence That’s Now Processing Google Search Results One of the Google Brain team members who gave Bloomberg News a long interview on Rankbrain, Gregory S. Corrado was a co-inventor on a patent that was granted this August along with other members of the Google Brain team.
In the SEM Post article, RankBrain: Everything We Know About Google’s AI Algorithm we are told that Rankbrain uses concepts from Geoffrey Hinton, involving Thought Vectors.
The summary in the description from the patent tells us about how a word vector approach might be used in such a system:
Particular embodiments of the subject matter described in this specification can be implemented so as to realize one or more of the following advantages. Unknown words in sequences of words can be effectively predicted if the surrounding words are known. Words surrounding a known word in a sequence of words can be effectively predicted. Numerical representations of words in a vocabulary of words can be easily and effectively generated. The numerical representations can reveal semantic and syntactic similarities and relationships between the words that they represent.
By using a word prediction system having a two-layer architecture and by parallelizing the training process, the word prediction system can be can be effectively trained on very large word corpuses, e.g., corpuses that contain on the order of 200 billion words, resulting in higher quality numeric representations than those that are obtained by training systems on relatively smaller word corpuses. Further, words can be represented in very high-dimensional spaces, e.g., spaces that have on the order of 1000 dimensions, resulting in higher quality representations than when words are represented in relatively lower-dimensional spaces. Additionally, the time required to train the word prediction system can be greatly reduced.
So, an incomplete or ambiguous query that contains some words could use those words to predict missing words that may be related. Those predicted words could then be used to return search results that the original words might have difficulties returning. The patent that describes this Word Vector Approach prediction process is:
Inventors: Tomas Mikolov, Kai Chen, Gregory S. Corrado and Jeffrey A. Dean
Assignee: Google Inc.
US Patent: 9,740,680
Granted: August 22, 2017
Filed: May 18, 2015
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for computing numeric representations of words. One of the methods includes obtaining a set of training data, wherein the set of training data comprises sequences of words; training a classifier and an embedding function on the set of training data, wherein training the embedding function comprises obtained trained values of the embedding function parameters; processing each word in the vocabulary using the embedding function in accordance with the trained values of the embedding function parameters to generate a respective numerical representation of each word in the vocabulary in the high-dimensional space; and associating each word in the vocabulary with the respective numeric representation of the word in the high-dimensional space.
One of the things that I found really interesting about this patent was that it includes a number of citations from the applicants for the patent. They looked worth reading, and many of them were co-authored by inventors of this patent, by people who are well-known in the field of artificial intelligence, or by people from Google. When I saw them, I started hunting for locations on the Web for them, and I was able to find copies of them. I will be reading through them and thought it would be helpful to share those links; which was the idea behind this post. It may be helpful to read as many of these as possible before tackling this patent. If anything stands out in any way to you, let us know what you’ve found interesting.
Citations behind this Word Vector Approach:
Bengio and LeCun, “Scaling learning algorithms towards AI,” Large-Scale Kernel Machines, MIT Press, 41 pages, 2007. cited by applicant.
Bengio et al., “A neural probabilistic language model,” Journal of Machine Learning Research, 3:1137-1155, 2003. cited by applicant .
Brants et al., “Large language models in machine translation,” Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Language Learning, 10 pages, 2007. cited by applicant.
Collobert and Weston, “A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning,” International Conference on Machine Learning, ICML, 8 pages, 2008. cited by applicant.
Collobert et al., “Natural Language Processing (Almost) from Scratch,” Journal of Machine Learning Research, 12:2493-2537, 2011. cited by applicant.
Dean et al., “Large Scale Distributed Deep Networks,” Neural Information Processing Systems Conference, 9 pages, 2012. cited by applicant.
Elman, “Finding Structure in Time,” Cognitive Science, 14, 179-211, 1990. cited by applicant .
Huang et al Improving Word Representations via Global Context and Multiple Word Prototypes,” Proc. Association for Computational Linguistics, 10 pages, 2012. cited by applicant.
Mikolov and Zweig, “Linguistic Regularities in Continuous Space Word Representations,” submitted to NAACL HLT, 6 pages, 2012. cited by applicant .
Mikolov et al., “Empirical Evaluation and Combination of Advanced Language Modeling Techniques,” Proceedings of Interspeech, 4 pages, 2011. cited by applicant.
Mikolov et al., “Extensions of recurrent neural network language model,” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5528-5531, May 22-27, 2011. cited by applicant .
Mikolov et al., “Neural network based language models for highly inflective languages,” Proc. ICASSP, 4 pages, 2009. cited by applicant .
Mikolov et al., “Recurrent neural network based language model,” Proceedings of Interspeech, 4 pages, 2010. cited by applicant .
Mikolov et al., “Strategies for Training Large Scale Neural Network Language Models,” Proc. Automatic Speech Recognition and Understanding, 6 pages, 2011. cited by applicant.
Mikolov, “RNNLM Toolkit,” Faculty of Information Technology (FIT) of Brno University of Technology [online], 2010-2012 [retrieved on Jun. 16, 2014]. Retrieved from the Internet: < URL: http://www.fit.vutbr.cz/.about.imikolov/rnnlm/>, 3 pages. cited by applicant .
Mikolov, “Statistical Language Models based on Neural Networks,” PhD thesis, Brno University of Technology, 133 pages, 2012. cited by applicant.
Mnih and Hinton, “A Scalable Hierarchical Distributed Language Model,” Advances in Neural Information Processing Systems 21, MIT Press, 8 pages, 2009. cited by applicant.
Morin and Bengio, “Hierarchical Probabilistic Neural Network Language Model,” AISTATS, 7 pages, 2005. cited by applicant .
Rumelhart et al., “Learning representations by back-propagating errors,” Nature, 323:533-536, 1986. cited by applicant .
Turian et al., “MetaOptimize / projects / wordreprs /” Metaoptimize.com [online], captured on Mar. 7, 2012. Retrieved from the Internet using the Wayback Machine: < URL: http://web.archive.org/web/20120307230641/http://metaoptimize.com/project- s/wordreprs>, 2 pages. cited by applicant .
Turlan et al., “Word Representations: A Simple and General Method for Semi-Supervised Learning,” Proc. Association for Computational Linguistics, 384-394, 2010. cited by applicant.
Turney, “Measuring Semantic Similarity by Latent Relational Analysis,” Proc. International Joint Conference on Artificial Intelligence, 6 pages, 2005. cited by applicant.
Zweig and Burges, “The Microsoft Research Sentence Completion Challenge,” Microsoft Research Technical Report MSR-TR-2011-129, 7 pages, Feb. 20, 2011. cited by applicant.