Category Archives: Fact Extraction and Knowledge Graphs

Techniques and approaches that search engines might use to extract facts and information from the Web, as uncovered in search-related patents and whitepapers.

Open Data Commons Opportunities

There are a lot of Government Web sites that have made the data that they collect and compile freely available to the public. The licenses that data has been released under are described on the following Pages:

ODC Public Domain Dedication and License (PDDL)
Open Data Commons Open Database License (ODbL)
Open Data Commons Attribution License

If you are considering starting a project using that kind of data, you should read the Open Data Handbook, which provides a lot in the way of details, and much more information is available on Data.gov, including a broad overview of different types of topics that data is available about, including:

  • Agriculture,
  • Business,
  • Climate,
  • Consumer,
  • Ecosystems,
  • Education,
  • Energy,
  • Finance,
  • Health,
  • Local Government,
  • Manufacturing,
  • Ocean,
  • Public Safety,
  • Science & Research.

Continue reading Open Data Commons Opportunities

How Google May Identify Central Entities from Resources

A Google patent granted this week describes how Google might try to understand Entities that appear on Web pages, and how that awareness might influence the search results that the search engine shows off in search results.

An Entity is a specifically named person, place, or thing (including ideas and objects) that could be connected to other entities based upon relationships between them. Some pages may make certain Entities to be the main Subject of a page, while other may include additional information about entities that are related in some manner to those first entities. When some entities appear on pages, they may be presented in an ambiguous manner that doesn’t make them the main topic for the page they appear upon.

Entities are said to exist in a graph that connects them to other entities based upon relationships between them. For instance, Google and Bing are both Search Engines, both internet domains, both employers of many search engineers, and have CEOs, Vice Presidents, Marketing staff, headquarters, data centers, Web indexes. There are a lot of related entities that might show up on Web pages about both.

This view of Entities being related to each other, and belonging to an “Entity Graph” is very similar to what the Microsoft Patent I wrote about recently in How Bing May Expand Queries Based upon Finding Entities Within them. A number of the ideas behind how that patent works and this one are similar in that some knowledge about an entity might cause a search engine to display information about related entities.

Continue reading How Google May Identify Central Entities from Resources

Google’s Knowledge Cards

In the Google patent “Providing Knowledge Panels With Search Results” is a reference to an earlier Google patent filing describing Knowledge Cards in depth. The patent provision is titled, “Apparatus and Method for Supplying Search Results with a knowledge Card”, and it is identified as being Patent Application No. 61/515,305, filed on Aug. 4, 2011.

This provisional patent is not linkable from the Web, otherwise I would provide a link to it.

It is supposedly “incorporated fully” into that later patent filing, but a lot of details about what a knowledge card is have been left out of the later patent filing. I wrote about that later patent in a post titled, How Google Decides What to Know in Knowledge Graph Results, but the patent specifically about knowledge cards contains information not in the later patent.

Knowledge Panel results are part of Google’s Semantic Web search results which include a mix of result types such as Direct Answers, Structured Snippets, Rich Snippets and are part of an evolution of search results happening at Google and Bing and Microsoft that go much beyond yesterday’s 10-Blue links. I’ll be following this post with one about the rich search results that show up in response to queries at Bing.

Continue reading Google’s Knowledge Cards

Google on Crawling the Web of Data

A patent granted to Google this past fall explores how the search engine looks for patterns on Web pages to use to find facts on the Web to fill up Google’s data repository (Knowledge Base).

An image from a local park in Carlsbad symbolizing the Sun.
An image from a local park in Carlsbad symbolizing the Sun.

I recently wrote a series of posts about Google collecting data to enable them to answer Direct answers. starting with one titled Direct Answers – Natural Language Search Results for Intent Queries.

In one of those posts, I write about a paper (pdf) that the inventors of that patent co-authored which describes ways that Google was finding and extracting facts from pages to include in a repository of facts.

Continue reading Google on Crawling the Web of Data

How Google was Corroborating Facts for Direct Answers

When someone searches the web, and asks a question such as “what is the capital of Poland” or “what is the birth date of George Washington” a web search engine such as Google may not be very helpful in providing an answer if it provides a list of web pages that might answer that query instead of an actual answer. People in the SEO community have been referring to such answers as “direct answers.”

Google answering a direct question with a factual answer.
Google answering a direct question with a factual answer.

A patent granted to Google this week describes how Google indexes data across the web, and may look to a large collection of facts (in a fact repository such as a knowledge graph) to check upon and verify such answers, so that it can deliver them with more confidence and certainty, like in the answer to the question about George Washington’s birthday shown above.

The patent tells us that some efforts to build a search engine that can “provide quick answers to factual questions have their own shortcomings.” One of these is that the answers may come from a single source, such as “a particular encyclopedia.” Why this is perceived as a shortcoming is that it is:

Continue reading How Google was Corroborating Facts for Direct Answers

Google Queries for Instances of Data Help Reveal the Classes Where They Belong

You are cloxacillin, a kind of medication and an entity that some people may not know a lot about, but part of a bigger class of medicines that people are familiar with. And you’re taking a visit through a search engine as someone has been recently prescribed to you, and they want to know more about you.

cloxacillin molecule diagram
{{Information |Description ={{en|1=Ball-and-stick model of oxacillin molecule. The structure is taken from ChemSpider. ID 5873}} |Source ={{own}} |Author =MarinaVladivostok |Date =2013-07-22

They copy your spelling from the bottle they got at the pharmacy. They couldn’t read the handwriting of the doctor who initially prescribed in. Good thing pharmacists are trained in reading doctors’ writing.You name is spelled out, and a press of the search box button and knowledge is on its way.

A Google Knowledge panel for colxacillin

Continue reading Google Queries for Instances of Data Help Reveal the Classes Where They Belong

Rich Snippets and Patterned Queries

Revisting the Subscribed Links Patent Five Years Later and Finding the Rich Snippets Patent

I first looked at this patent five years ago, but called it the Subscribed Links Patent.

At the time, Google had a Subscribed links program, where site owners could create specialized search results based upon certain patterns of queries, that would show additional content for a searcher. For some of those, you had to log into your Google Account and subscribe to certain links to be shown special content.

Oddly, some of those specialized search results didn’t require subscriptions, and didn’t require logging in. Much like these NFL sports Scores from this weekend:

A Football Score Rich Snippet

Continue reading Rich Snippets and Patterned Queries

How Google May Answer Fact Questions Using Entity References in Unstructured Data

A Google patent application explores how Google may answer factual questions from unstructured Web pages and results rather than from more structured sources such as Freebase or Wikipedia. The processes described in the patent are pretty interesting, and they might be more familiar to an SEO trained audience than a Semantic Web one, like a result that ranks well because of a “query deserves freshness” approach.

They also avoid a problem for the search engines that I’ve been thinking about for weeks.

The problem was one that came to me when I attended The Semantic Web Business and Technology 2014 conference around a month or so ago. In a presentation by Yahoo!’s Nicolas Torzec, he discussed Yahoo!’s relatively new Knowledge Graph, and was asked a question by someone from the audience about

Continue reading How Google May Answer Fact Questions Using Entity References in Unstructured Data