Google acquired the company Wavii for a little more than $ 30 Million in April. There was some speculation that Wavii was an effort to match Yahoo’s purchase of Summly, which summarizes news from the Web.
A Wavii app did do just that – acquired and summarized news from the Web. When Wavii emerged from stealth mode, it was touted as a personalized news aggregator based upon topics rather than keywords. The app closed down with Google’s acquisition of the company, and instead of providing news aggregation services, it appears that the technology will help fuel Google Now, Google’s Knowledge Base, and Google Glass, according to the TechCrunch article linked above.
So what is that technology?
Oren Etzioni hinted at Wavii being so much more in an article he originally published at Nature in 2011 in Nature when he wrote about the limitations (pdf) of Google, Bing, and Wolphram Alpha, and the future of search. What does the future of search mean? This video provides a short introduction:
I love the comparisons to Google and Google’s knowledge graph, in the video and the statement:
Our goal is to build the next generation of search engines.
I checked to see which patents Wavii held at the time of the Google acquisition, and it appears that one describing Open Information Extraction on the Web (pdf) was assigned to Wavii
The granted patent, and a follow up patent application are:
Open information extraction from the Web (granted original version)
Open information extraction from the Web (follow up continuation patent application, with new claims section)
Invented by Michael J. Cafarella, Michele Banko, and Oren Etzioni
Assigned to: University of Washington through its Center for Commercialization
United States Patent 7,877,343
Granted January 25, 2011
Abstract
To implement open information extraction, a new extraction paradigm has been developed in which a system makes a single data-driven pass over a corpus of text, extracting a large set of relational tuples without requiring any human input. Using training data, a Self-Supervised Learner employs a parser and heuristics to determine criteria that will be used by an extraction classifier (or another ranking model) for evaluating the trustworthiness of candidate tuples that have been extracted from the corpus of text, by applying heuristics to the corpus of text.
The classifier retains tuples with a sufficiently high probability of being trustworthy. A redundancy-based assessor assigns a probability to each retained tuple to indicate a likelihood that the retained tuple is an actual instance of a relationship between a plurality of objects comprising the retained tuple. The retained tuples comprise an extraction graph that can be queried for information.
Rather than breaking down the patent filings in detail, I’m going to leave you to the following resources to get more depth on how this Open Information Extraction system might work.
First is the video:
Open Information Extraction at Web Scale
(It’s a long one, but definitely worth watching)
These papers and pages also provide more details:
- Open Information Extraction
- Open Information Extraction: the Second Generation (pdf) by Oren Etzioni, Anthony Fader, Janara Christensen, Stephen Soderland, and Mausam Ollie
- Open Information Extraction Software
- Open Language Learning for Information Extraction (pdf), by Mausam, Michael Schmitz, Robert Bart, Stephen Soderland, and Oren Etzioni
Take Aways
Wavii isn’t bringing a news aggregator app to Google like the one they were offering before the acquisition. Instead, the Open Information Extraction approach that they are bringing to the search engine is aimed at reading through text on the Web, without predefined templates or supervision.
The extraction approach identifies nouns and how they might be related to each other by the verbs that create a relationship between them, and rates the quality of those relationships. A “classifier” determines how trustworthy each relationship might be, and retains only the trustworthy relationships.
These terms within these relationships (each considered a “tuple”) are stored in an inverted index that can be used to respond to queries. Here’s an example of relationships that might be identified during a crawl of the web that could be part of this index::
(
An example of this open information extraction using a limited amount of data is Revminer which can be used to search for information about restaurants in Seattle.
There is a lot of potential for a system like the one acquired with Wavii to improve Google’s knowledge base and Google Now, with predictive queries responded to based upon context. Open Information Extraction is still a work in progress, but it may be a big part of the future of search.
Nice write-up Bill. Seems like a key ingredient in getting from 10 blue links to “Computer, coffee black.”
Thanks, Gyi,
I stared at the patent filings for a week or so, trying to figure out how to best put them into context, and ultimately decided that the context was more important than a breakdown of the patent filings. Both videos are worth watching, and the longer one on Open Information Extraction does a good job of explaining how the technology behind the patent does.
The days of the 10 blue links look like they’re going to be disappearing quicker than we may know.
I recently spent sometime in the patent filings trying to find information on transactional vs. information queries. I never found what I was looking for, but your site came up when I was Googling! Thanks Bill, love your work and analysis!
Thanks for this writeup, it helps to clarify the patents that Wavii had. I wonder how this might relate to Google Glass and the searching that users might do on it.
Thanks, Gyi!
Search is ever evolving, and the idea of a system that “reads” the web, and classifies as it goes is really interesting. The fact that they were able to read through enough to capture real time data on timely topics makes me wonder what they can do with a lot more computing power. Google has a lot of projects going on, and some approaches that seem like they might overlap a little, and it’s tough to guess which approaches might be most effective, but this Open Machine Learning approach is pretty interesting. Imagine if a system like this was limited to just Google Plus, and provided Google Now with a social element to it based upon Google Plus – so that you would have an idea of posts on the social network that you might have missed otherwise on topics of interest to you. Not quite “Coffee black,” but definitely closer that what we had before.
Hi Anne,
Thanks for your kind words. The paper (rather than a patent) that I would definitely recommend starting with is Andrea Broder’s A taxonomy of web search at http://www.sigir.org/forum/F2002/broder.pdf
He was at Yahoo for a fair amount of time, and he moved to Google, where I believe he is still at. Definitely one of the pioneers of web search.
Hi Jameson,
I had heard about Google buying Wavii, but suspected it was about a lot more than a news aggregator app.
I’m thinking that it might be really helpful in making Google’s knowledge base a lot richer in terms of collecting facts about entities and facts associated with them. I also think that it has the potential to make Google Now a lot richer, by uncovering useful information at the appropriate times to deliver to people who might find it useful and aren’t necessarily searching for it.
I heard of summly when it was purchased by Yahoo and was intriged by the idea. When you’re killing time on your phone, small bits of information is a much better experience than a large article. Of course it would be nice to have the option to send the full article to your PC for full reading at a later time.
Will be interesting once we see how Google will be utilizing this.
Hi Chris,
Waviis ability to collect, summarize, and aggregate news related to topics you specify was definitely a great proof of concept of what they can do, but I do think there’s a whole lot more behind their open learning fact extraction approach than a diversion while “killing time.” It appears that the Wavii news app has closed, but how the ideas behind it might be used in Google’s knowledge base and Google Now is really interesting.