Millions of searches stream into Google everyday as people try to meet their informational and situational needs. But those searches don’t disappear after the searches. They provide Google with some very interesting and useful information in return. For instance, they tell Google what people are interested in real time – right at this moment.
Those queries can help Google populate its knowledge base with more information as well.
When Google collects information about entities – people, places, and things, including products and brands, it might collect information about entities as well as information about attributes associated with those entities.
A couple of days ago, the Google Research Blog told us about how it might include that kind of factual information in search results, what they called Structured Snippets. In that post, Google gave us the news that Google finds information like this from Tables across the web.
I hadn’t purposefully set out to find a snippet that actually included a table in it, but I asked for “all” the provinces in Canada, and that’s what I got – a snippet that included an actual table.
Information about entities is all over the Web, and it also fills queries people perform when they search on the Web. Considering that these queries are representative of things people look for when they search, they seem like a good source of information to use to find out more about entities and associated facts about them.
The patent describes how it might treat entities and attributes in queries if they include proper names, or somewhat generic attributes (it might ignore them). But what I found interesting was how Google tried to use linguistic patterns to try to find, identify and extract entities and attributes associated with them.
This process can be done without human oversight or intervention. It involves using search query logs to see what queries people searched for.
Given the numbers of searches that people do at Google everyday, there’s no shortage of queries to use.
I took the following examples of linguistic patterns that Google might use to identify entities and attributes related to them from the patent.
The patent is:
Inferring attributes from search queries
Invented by Alexandru Marius Pasca and Benjamin Van Durme
Assigned to Google
US Patent 8,812,509
Granted August 19, 2014
Filed November 2, 2012
Systems, techniques, and machine-readable instructions for inferring attributes from search queries. In one aspect, a method includes receiving a description of a collection of search queries, inferring attributes of entities from the description of the collection of search queries, associating the inferred attributes with identifiers of entities characterized by the attributes, and making the associations of the attributes and entities available.
Linguistic patterns can be used to infer entity attributes from search queries. This can be done for a log of search queries to identify attributes for entities.
One extract pattern can be used to scan keyword-based queries for text that matches the format “what is the <attribute> of <entity>.”
- What is the capital of Brazil?
- What is the airspeed velocity of an unladen swallow?
Another extract pattern can be used to scan keyword-based queries for text that matches the format “who is the <attribute> of <entity>.
- Who is the mayor of Chicago
- Who is the CEO of Google
A third extract pattern might look through queries for text that matches the format “the <attribute> of <entity>.”
- the capital of France
- the manager of the Yankees
And, a different extract pattern may try to find answers for “who is the <entity>’s <attribute>.”
- who is the Yankees’ manager
- who is the airplane’s pilot
An extract pattern can also scann keyword-based queries for text that matches the format “<entity>’s <attribute>.”
- Rosemary’s baby
- Michelangelo’s David
This isn’t an exhaustive list of extract patterns, but it should give you an idea of how effective it could potentially be.
It’s interesting that Google isn’t trying to extract this information from pages published to the Web, with the kinds of patterns shown above, but instead from that vast stream of data from many searchers.