Category Archives: Fact Extraction and Knowledge Graphs

Techniques and approaches that search engines might use to extract facts and information from the Web, as uncovered in search-related patents and whitepapers.

Finding Entity Names in Google’s Knowledge Graph

Most of us searchers and site owners and search engine optimizers are familiar with Google’s Link Graph, and how Google uses the connections between websites to help in ranking pages on the Web. In part, Google looks at the relevance of the content of a page compared to a query that a searcher enters at the search engine.

In addition to “relevance”, Google also uses the patented method of PageRank, in which the quality and quantity of links pointed to a page are used as a proxy for the quality of the page being linked to. The higher the quality of a page (and the higher PageRank it possesses), the more PageRank it likely passes along.

links between pages, from the reasonable surfer patent

The link graph is one example of how Google ranks and measures and possibly sorts web pages. Another that Google might look at is the attention graph – how Google might use topics and concepts that may be searched upon frequently to change rankings of pages based upon freshness and hot topics.

Continue reading

How Google Might Identify Synonyms for Entities Using Anchor Text

When Google indexes the Web, it’s often been convenient to think about the search engine running two different methods or approaches that seem to run in parallel. One of those involves the crawling and indexing and ranking of pages on the web (and images, videos, news, podcasts, and other documents).

The other approach doesn’t look at pages as much as it indexes objects it finds on the Web, or what we often refer to as named entities, which are specific people, places, or things – real or fictional. We see this second kind of crawling often referred to as fact extraction and see the results of such extraction as Knowledge Panel results or even things like Google’s OneBox Question & Answer results.

Not a web-based robot, but this image is from a Boston Robotics patent.

When SEOs talk about Google and the programs it uses to crawl and index pages on the Web, we usually refer to those crawlers as robots or spiders or even Googlebot, and don’t differentiate these crawling programs much. Not the kind of robot above (which is a new twist from Google), but it’s probably time to start thinking of Googlebot differently.

Continue reading

Entity Associations with Websites and Related Entities

When we talk about how web sites are related, it’s not unusual for us to talk about links between sites and pages. Google pays a lot of attention between such links, and they are at the heart of one of its most well known ranking signal – PageRank. PageRank is now more than 15 years old, predating the origin of Google itself in the BackRub search engine.

Google is exploring other signals that may be used to rank pages in search results, including social signals that may result in reputation scores for authors, in relationships between words that might appear together on pages ranking for the same queries, and in relationships between pages that show up in the same search results and in the same search sessions. The Google paper presented at an October 2013 natural language processing conference, Open-Domain Fine-Grained Class Extraction from Web Search Queries (pdf), provides some interesting hints at a possible Google of the future.

Google also seems to be very interested in building a knowledge base of concepts that better understands things like what different businesses or entities are ‘Known for’ or by defining entities better in ‘is a’ relationships. Sometimes pages for specific entities show up at the top of search results because they seem to be the page that people are looking for when they include that entity within a query, like the first two results on a search for [Roald Dahl], as seen in the image below:

Search results showing authoritative results for Roald Dahl and then results for books he wrote.

Continue reading

How Google Finds ‘Known For’ Terms for Entities

Google finds terms and phrases to associate with entities that can be considered terms of interest for businesses, locations, and other entities. These terms can influence what shows up in search results and in knowledge panels for those entities. Consider it part of a growing knowledge base of concepts, entities, attributes for entities, and keywords that shape the new Google after Hummingbird. Semantics play a role as things that specific entities are known for are identified.

The Red Truck Bakery in Warrenton, Virginia

For example, the Warrenton, Virginia, Red Truck Bakery (local to me) is known for:

Continue reading

How Google Decides What to Know in Knowledge Graph Results

A transformation was triggered at Google with their announcement of the Knowledge Graph in the Official Google Blog post, Introducing the Knowledge Graph: things, not strings. That transformation was one less concerned with matching keywords, and more concerned with matching concepts, understanding entities, and bringing knowledge about entities to searchers in knowledge panels next to search results.

Google published a patent application last week that describes the knowledge panels that appear next to search results as part of the new knowledge graph. Here’s the video that accompanied the post (note the reference to a “panel” in the presentation):

Continue reading

Building Google’s Knowledge Base and Identifying Locations in Web Pages

When we talk about indexing and crawling content on the Web, it’s usually within the context of pages being ranked on the basis of a number of signals found on Web pages that might be ranked in response to queries. Google has told us that the future of search involves Knowledge Bases, and the indexing of Things, Not Strings. Gianluca Fiorelli explored Google’s ideas of Search in the Knowledge Graph Era earlier this week.

A few years back, I wrote some posts about some Google Patents that explored how Google might be extracting and visualizing facts, and using Data Janitors to process that information and clean it up and sort it. Google was granted another patent this week that’s very much related, looking at how Google might understand locations for places collected from Web pages. One of the inventors, Andrew Hogue, gave this Google Tech Talk presentation last year:

Continue reading

Search Engines and the Most Popular Search Terms

When you walk into the lobby of Building 42 at the Googleplex, you can see a display that shows you queries entered into the search engine at any one time. It’s a mesmerizing sight, and I found myself wondering about the people and motivations behind some of the search terms I saw flowing down the screen.

Imagine that instead of seeing one query at a time, that search information was analyzed, and queries were bundled together, to maybe provide us with more meaning.

Can search engines be used to tell us what the world is thinking at anyone time? Would looking at the most popular keywords or queries that people type into a search engine provide us with some insights?

Popular Search Information from Search Engines

Continue reading

How Google Sets Works

A tool from Google that is often overlooked is Google Sets (no longer available), which allows you to “automatically create sets of items from a few examples.”

Google Sets was one of the first applications in the Google Labs (no longer available) pages.

Those pages are “Google’s Technology Playground,” and contain a number of programs that may or may not be tomorrow’s useful applications from the search engine. As Google tells us,

Google labs showcases a few of our favorite ideas that aren’t quite ready for prime time. Your feedback can help us improve them. Please play with these prototypes and send your comments directly to the Googlers who developed them.

Google was granted a patent this week on the process behind Google Sets, and the patent document provides some details on how the program finds additional words based on “items from a set of things” that you enter.

Continue reading