Google and Metaweb: Named Entities and Mashup Search Results?

Google’s recent purchase of Metaweb, who run the Freebase directory left many wondering at the motivations behind the acquisition. Did Google buy the company for its technology, for its Freebase directory, for the expertise of its employees?

A Google patent application published today hints at one reason behind the deal, with a mention of Metaweb’s Freebase, and how it could be used by Google in a process that may expand the amount of information that the search giant shows us about specific people, places, and things (including ideas and concepts such as democracy) in search results.

It might also result in search results that are mashups of different information relating to queries involving named entities, such as seen in the image below:

A mashup search result on a search for Mount Bachelor, showing different page segments including one for weather, one for Mount Bachelor Community College, another for lodging and hotels, and an additional on listing other mountains.

Google’s new patent application identifies references to specific people or places or things, referred to in the document as “named entities,” when they appear in queries, expands the amount of information that it might lookup to include concepts, or aspects related to those named entities. It might do this by looking at what a knowledge base such as Wikipedia or Freebase might contain about those entities. It would also look at previous queries that searchers submitted to Google that include the named entities to broaden the information returned to a searcher.

Why Broaden Results for Named Entities?

When someone searches for information, they often just want to find an answer to a specific informational need, such as the date a specific event happened, or a transactional need such as downloading some software or making a purchase. But a searcher may want more information than just a single search result, and may be trying to explore a topic that they don’t know much about.

Microsoft recently published a patent filing about how they might present the first page of search results in categories related to a query, based upon a similar notion of helping searchers see a range of concepts and categories related to their query. Google has also described in the past how they might look at wikipedia to find concepts related to queries and group search results based upon those concepts.

For example, if I search for “Hawaii,” or “Gandalf,” or “bicameralism,” or some other entity, whether real or fictional, place or person or idea, I may be interested in learning more about that person, place, or thing. My query may include an entity as well as some additional terms. So, if I search for Hawaii travel, I may want to find a range of information related to “Hawaii,” as well as the property “travel” that is included in my query. The search engine may look at previous searchers’ queries involving Hawaii in the log files it uses to keep track of queries, and it might look up Hawaii on Wikipedia or Freebase as well.

While looking at those query log files and knowledge bases, it might identify related aspects of entities found in a query, or as the patent filing defines them, “different axes of information along which additional information about an entity can be obtained.” For an entity such as “Hawaii”, some possible aspects might include “beaches,” “hotels,” and “weather.”

Search results shown by the search engine might not be presented in a single list of web pages, news, videos, etc., but could instead be shown as a set of categorized lists. For my “hawaii travel” search, it might show multiple sets of search results which combine my query with aspects related to it. These might include a set of results for “Hawaii beaches,” another for “Hawaii Hotels,” and yet another for “Hawaii weather.”

Those results might also include a summary of information about each of the different sets of results before listing links to other pages and resources on the Web.

The patent filing tells us that it might decide which aspects to display information about based upon both the popularity of those aspects and a diversity score that would help provide a range of aspects that might be missed if popularity was the only thing considered.

The patent filing is:

Identifying Query Aspects
Invented by Fei Wu, Jayant Madhavan, and Alon Halevy
Assigned to Google
US Patent Application 20100198837
Published August 5, 2010
Filed: July 30, 2009

Abstract

Methods, systems, and apparatus, including computer program products, for generating aspects associated with entities. In some implementations, a method includes:

  • Receiving data identifying an entity;
  • Generating a group of candidate aspects for the entity;
  • Modifying the group of candidate aspects to generate a group of modified candidate aspects comprising combining similar candidate aspects and grouping candidate aspects using one or more aspect classes each associated with one or more candidate aspects;
  • Ranking one or more modified candidate aspects in the group of modified candidate aspects based on a diversity score and a popularity score; and
  • Storing an association between one or more highest ranked modified candidate aspects and the entity.

The aspects can be used to organize and present search results in response to queries for the entity.

About the Inventors

It’s not surprising that one of the inventors listed on the patent is Alon Halevy, who came to Google with the acquisition of Transformic, and has worked on projects involving Google’s efforts to extract and organize data about different named entities as described in Uncovering the Relational Web and WebTables: Exploring the Power of Tables on the Web. He seems to be one of the forces at Google behind extracting facts and information from unstructured data on the Web.

Co-inventor Jayant Madhavan also joined Alon Halevy in co-authoring an Official Google Blog post a couple of years ago on how Google has begun experimenting with Crawling through HTML forms, which could provide access to other knowledge bases other than just Wikipedia or Freebase. The two have also collaborated on papers on Google Fusion Tables and other ways to extract information from web pages.

Fei Wu was an intern with Alon Halevy and Jayant Madhavan at Google, and according to his resume (no longer available), he developed the process that finds aspects related to named entities described in this patent filing. Some of his previous work before joining Google included a paper on Open Information Extraction using Wikipedia (pdf) and significant work on The Intelligence in Wikipedia Project at the University of Washington.

Extracting Information about Entities

Entities can have properties associated with them. For example, “travel” can be a property associated with “Vietnam” because people travel to Vietnam. A property can be used to limit or refine a search for information about an entity.

Aspects that might be identified about entities might be based upon the entity itself, or upon a class that the entity belongs to. For example, “Daffodil” may be associated with the class “flower,” because a daffodil is a type of flower. Looking at aspects associated with classes that entities may belong to can be helpful, especially when there isn’t much information about a specific named entity, and expanding the gathering of aspects to classes that an entity belongs to may provide additional candidate aspects that might be included in search results.

Aspects can be located in a number of places. For example, search histories might be viewed in search query log files to see how people have refined queries in the past. If I search for popcorn, and then follow that up with a search for “microwave popcorn,” I’ve refined my query. Microwave popcorn would be considered a query refinement. A query refinement doesn’t have to include the original query term – for example, if I search for “computer,” and then “laptop,” I’ve also refined my query.

Query refinements can be used to identify potential aspects of named entities. In my search for “Hawaii” above, I may then go on to search for “Hawaii Beaches,” “Hawaii hotels,” and “Hawaii weather.” Each of those searches identify possible aspects of the named entity Hawaii – beaches, hotels, and weather.

Query superstrings may also be found in a search engines query logs, and unlike query refinements, they don’t have to appear during a query session where one person is modifying their searches to find more information.

So, if someone searches for “vietnam travel packages,” they’ve included a named entity (Vietnam), a property of the entity (travel), and a possible aspect of that entity and property (packages).

A query superstring can include just an entity and an aspect rather than an entity, property, and an aspect. So, if a number of people search for “hawaii beaches” as standalone searches instead of as refinements during a query session, “hawaii beaches” could be considered a query superstring.

Aspects can also be located by analyzing knowledge bases such as Wikipedia and Freebase.

The patent provides a fair amount of details on how aspects could be identified, and how very similar aspects might be combined.

Conclusion

My description of the patent filing is from a fairly high level. The document goes into much more detail on how different aspects involving named entities might be identified, and used to capture much more information than might be presented in a single list of links and documents that might be relevant to a query that includes one of those named entities.

I cited the following quote from a recent Microsoft paper in my post on Google’s acquisition of Metaweb, and it bears repeating here:

According to an internal study of Microsoft, at least 20-30% of queries submitted to Bing search are simply name entities, and it is reported 71% of queries contain name entities.

If Google decides to start using the processes described in this patent filing, we might start seeing search results broken down by categories, or aspects, related to named entities on Google’s search results pages in the future. At least when those queries include named entities – which seems to happen frequently.

Google’s acquisition of Metaweb seems to point us in that direction.

Added: 8/9/2010 – if you haven’t seen this video about Metaweb, and what they do, it’s a nice way of learning more about them:

data="http://www.youtube.com/v/TJfrNo3Z-DU&hl=en_US&fs=1">

Share

19 thoughts on “Google and Metaweb: Named Entities and Mashup Search Results?”

  1. Categorized search results is not a bad idea at all, in fact, it is good, at least all options are presented and one would just have to go to the category that’s most suited to the user’s need.

  2. So Google is trying to corner the market, so to speak, on named entities? Yes?

    Google is so big now, it’s a wonder anyone can fathom what they are up to at all. I guess it all boils down to search world domination. Cornering the market on the best search technology and software. Honestly the whole thing is beyond most of us. They move in a world of their own design, far removed from the humble user.

  3. Google are so huge and acquire so many companies (a lot of which fail to work out as stand alone products) that trying to second guess them is just about impossible. Having said that, it was a very good read!

  4. The use of query logs in such a manner can really improve the user’s search; however, I wonder if Google could use other data that they collect. For example, I may not search for the latest science news, but I may go to the Seed magazine site and pick certain articles, or go look at my twitter account to see which article is being referenced by a few scientists that I follow. Now Google may not be tracking my actions specifically to see which articles I am reading to suggest to others, but what if Google Analytics is showing that a certain article about exploration of life on other planets is becoming popular, even though users are not searching for it (the see the link on twitter, facebook, or their favorite science site), could that become an entity? Added as a trending piece under the category of science, evolution, or space exploration?

  5. Hi Edoardo,

    There are a number of patent filings and papers from Google that suggest they’ve been working towards this kind of information extraction on the web for years – before wolfram abram’s launch even. We can see this in places like the following paper from Google Founder Sergey Brin back in 1998:

    Extracting Patterns and Relations from the World Wide Web

    The abstract from that paper:

    The World Wide Web is a vast resource for information. At the same time it is extremely distributed. A particular type of data such as restaurant lists may be scattered across thousands of independent information sources in many different formats. In this paper, we consider the problem of extracting a relation for such a data type from all of these sources automatically. We present a technique which exploits the duality between sets of patterns and relations to grow the target relation starting from a small sample. To test our technique we use it to extract a relation of (author, title) pairs from the World Wide Web.

  6. Hi Andrew,

    Good point. I think the challenge is in being able to find categories that may be appropriate for different queries and different named entities. For instance, take two different actors, and come up with categories for them. It would be easy to generalize and say that appropriate categories for all actors might be: biography, films appeared in, awards won, etc. But you have actors like Ronald Reagan or Arnold Schwarzenegger who have also had careers in politics, and to not include that as a category for them would be a bad search result.

  7. Hi Joella,

    It’s not just Google that is trying to take advantage of the idea of “named entities.” For instance, Bing often shows categorized results for named entities when you search for them there. Yahoo will show pages that are rich in images, news, videos, etc., for well-known people. Ask has an interesting feature in their news that analyzes relationships between named entities, so that if you search for one in the news, they might show you news results for related named entities as well.

  8. Hi Steve,

    Thank you. You’re right that there are so many moving parts, and so many different directions that a company like Google follows that it can be hard to guess where they might be moving next. Guess that’s why I like to try to keep an eye on the patents that they file, and the whitepapers that they produce – I think it helps.

  9. Hi Frank,

    I think that’s a good point. We know that Google tracks how popular different queries might get, and I would guess that they also track how frequently those terms might show up in places like twitter, in news articles, in blog posts, and on other pages and documents that appear on the web.

    I wrote a post a couple years ago called How Burstiness of Search Queries Could Increase Page Rankings which discussed how a search engine might look at sources other than just queries to see if certain terms might be becoming more popular.

  10. It concerns me what Google is now doing with their privacy policy. Everything that I have read has not shown a “red flag,” but it something just feels fishy about it. I know that the reason is to compress all of their policies into one, but I feel like they are also using this so that they have more freedom with our search data.

  11. Hi Neal,

    Not sure what that has to do with Google’s acquisition of MetaWeb or this patent about identifying different aspects of entities in queries.

    Having Google’s privacy policy all in one place instead of spread out across many different services makes a lot of sense. I’m not sure that it’s going to impact how they use data that they collect in any meaningful way, but I understand your apprehension about it. Google does collect a lot of data about us, and that’s something to be concerned about by itself.

Comments are closed.