In creating a knowledge base, there seem to be a number of approaches that can be used to supply entities and facts from sources like web pages and query logs.
In my last post, I wrote about how search queries might be used, along with linguistic patterns, to extract attributes about facts from those search queries, as described in a patent titled Inferring attributes from search queries.
A Microsoft paper from 2009, Named Entity Recognition in Query, tells of a manual analysis they performed of 1,000 queries, and told us that 70% of those queries contained named entities.
So entities do appear in queries, and Google receives a lot of queries a day (as does Microsoft and Yahoo).
Millions of searches stream into Google everyday as people try to meet their informational and situational needs. But those searches don’t disappear after the searches. They provide Google with some very interesting and useful information in return. For instance, they tell Google what people are interested in real time – right at this moment.
Those queries can help Google populate its knowledge base with more information as well.
When Google collects information about entities – people, places, and things, including products and brands, it might collect information about entities as well as information about attributes associated with those entities.
A couple of days ago, the Google Research Blog told us about how it might include that kind of factual information in search results, what they called Structured Snippets. In that post, Google gave us the news that Google finds information like this from Tables across the web.
Entities change all the time, and facts about them do as well. Imagine when Derek Jeter retires from playing baseball, that he might decide to become a coach. Or Tom Cruise acting in a new movie, and deciding to try directing it and producing it as well. And Scotland decides whether or not it should be independent of the UK after 300 years.
What we think of entities can change over time, when it comes to the type of entity they are, and the facts associated with them. When populations of places change, and they do on a regular basis, how does that information get updated? And unfortunately, sometimes some information never quite makes it to Google’s knowledge base.
A patent application published last week looks at some ways that a knowledge base might be updated when a question answering query is asked of it, and the search system notices that some information is missing.
When Google introduced us to the knowledge graph, it also introduced us to pictures and the possibility of other kinds of rich content (video, audio, etc.) in those knowledge panels, and pictorial lists displayed in carousels at the top of pages in response to a query, such as “What is the tallest building in the World?”
A Google patent granted a couple of weeks ago, describes how Google processes search system queries, and might display knowledge graph answers to questions that include images. Here’s where they introduced carousels, in their page on the Knowledge Graph:
This is officially part of the story I’m telling in a presentation I prepared for SMX East, in a couple of weeks in New York. The name of the session I’m in is “Hummingbird and the Entity Revolution,” which reminds me of a Prince song from the 1980s.
The story starts off with a student given a tour by another student whom he gets into a fight with. They liked fighting with each other, and ended up becoming close friends. They studied together, and when their supervising professor went away to Japan for a year, they stopped working on their advanced degrees, and played on the internet instead. They created something they called Backrub. It later had its name changed to Google, and many people in the present day think it is the internet.
On March 10, 1999, Sergey Brin filed a “Miscellaneous Incoming Letter” (this is what it is described as in the USPTO’s PAIR database). It’s a provisional patent titled Extracting Patterns and Relations from Scattered Databases Such as the World Wide Web (pdf) (Skip quickly past the first couple of pages. It becomes much more legible from the third page on.)
I recently found a patent with two Google search engineers, Joshua Ain and Justin Boyan, listed as two of the three inventors. Last summer, at Google I/O in San Francisco, they joined together to talk about some tools that can more easily help webmasters add markup for structured data on the Web. The patent appears to be for Google’s Data Highlighter, which was one of those tools.
It inspired me to try to add structured data markup to my website. A task likely to fail for a few reasons.
I hadn’t read the patent yet last night, and I hadn’t done anything to improve the patterns found on my site, to make them more consistent. In other words, I learned the hard way, much like most non-developers, and non-programmers would.
The video below is an introduction to a number of Google tools, including the Google Data highlighter.
Google has been answering queries with its search engine for over 15 years, and has been showing us it can answer questions with facts from its Browsable Fact Repository and/or the Google Knowledge Graph.
Might Google at some point bring the two together?
To a degree, Google has been merging some results, showing a set of search results (from the search engine) and a knowledge panel (from the Knowledge Graph) on the same results page. But you could say that those are separate and unique entities on search results pages.
In earlier days at Google, when you used to ask a question, you could sometimes get a response providing answers to questions such as:
“When was George W. Bush’s birth-date?”.
We knew that Google could answer some questions like that, even if it might have been challenging, but we didn’t have much of a clue regarding the existence of something like Google’s Knowledge Graph until 2011. The answers we would see would sometimes be regular snippets where a word such as “birth-date” might be bolded.
Our set of 17 “related patents” that I first saw mentioned in a patent I wrote about this past Tuesday, and which was granted on August 19th, appear to have been created by a team under Andrew Hogue who was tasked to create “an annotation framework” to index more objects and facts associated with them on the web, which he would discuss more deeply during the presentation The Structured Search Engine, which is highly recommended.
He also oversaw the acquisition of MetaWeb by Google and the introduction of 25 former Meta-Web staff members from the company into Google.