The World Wide Web is a vast resource for information. At the same time it is extremely distributed.
A particular type of data such as restaurant lists may be scattered across thousands of independent information sources in many different formats. In this paper, we consider the problem of extracting a relation for such a data type from all of these sources automatically.
We present a technique which exploits the duality between sets of patterns and relations to grow the target relation starting from a small sample. To test our technique we use it to extract a relation of (author, title) pairs from the World Wide Web.
A few years ago, I presented at SES San Jose and someone asked me what they should be keeping an eye upon in SEO. I told them “named entities.” I was reminded of that conversation as I gave a talk today about named entities and other semantics.
I presented this morning at San Jose McEnery Convention Center at the Semantic Technology and Business Conference (#SemTechBiz2014).
Barbara Starr and I gave a 3 hour Tutorial on Semantic Search to an enthusiastic and engaged audience. We also discussed which might be a better name for the tutorial, “Semantic Search” (the name it had) or Semantic SEO (what do you think?).
Here’s Barbara’s presentation, which is the first half of the tutorial Thanks, Barbara – totally brilliant stuff:
I’ve been saying for at least a couple of years that Google’s local search is a proof of concept for the search giant to use on how to find and understand entities.
With local search, Google goes out and looks for a mention of a business on the Web, especially when it it accompanied by geographic location information. It collects and gathers facts related to businesses (entities are people, places, and things) and then it clusters information about the objects it finds to make sure that those mentions across the Web are all referring to the same places.
If you start reading about local search, you’ll see people referring to the importance of consistency in how you present address information for a business, and the same thing is true for entities.
When Google crawls the Web to collect information about objects or entities, it also collects facts about those entities. These facts are separated into different categories or attributes associated with those entities. For example, a book may have attributes such as an author, a publisher, a year published, a web site it can call home , a genre, and more.
Identifying Entities by their Attributes
A search that includes those attributes can be used to identify the entity the attributes might be associated with.
Google was granted a patent recently that describes how those attributes could be searched within an attribute data store to find the entity. The patent shows how the process described within it might be used to answer some complex queries, and some interactive Answerbox type queries. The issue that this patent addresses can be summed up in a single question:
Most of us searchers and site owners and search engine optimizers are familiar with Google’s Link Graph, and how Google uses the connections between websites to help in ranking pages on the Web. In part, Google looks at the relevance of the content of a page compared to a query that a searcher enters at the search engine.
In addition to “relevance”, Google also uses the patented method of PageRank, in which the quality and quantity of links pointed to a page are used as a proxy for the quality of the page being linked to. The higher the quality of a page (and the higher PageRank it possesses), the more PageRank it likely passes along.
The link graph is one example of how Google ranks and measures and possibly sorts web pages. Another that Google might look at is the attention graph – how Google might use topics and concepts that may be searched upon frequently to change rankings of pages based upon freshness and hot topics.
When Google indexes the Web, it’s often been convenient to think about the search engine running two different methods or approaches that seem to run in parallel. One of those involves the crawling and indexing and ranking of pages on the web (and images, videos, news, podcasts, and other documents).
The other approach doesn’t look at pages as much as it indexes objects it finds on the Web, or what we often refer to as named entities, which are specific people, places, or things – real or fictional. We see this second kind of crawling often referred to as fact extraction and see the results of such extraction as Knowledge Panel results or even things like Google’s OneBox Question & Answer results.
When SEOs talk about Google and the programs it uses to crawl and index pages on the Web, we usually refer to those crawlers as robots or spiders or even Googlebot, and don’t differentiate these crawling programs much. Not the kind of robot above (which is a new twist from Google), but it’s probably time to start thinking of Googlebot differently.
When we talk about how web sites are related, it’s not unusual for us to talk about links between sites and pages. Google pays a lot of attention between such links, and they are at the heart of one of its most well known ranking signal – PageRank. PageRank is now more than 15 years old, predating the origin of Google itself in the BackRub search engine.
Google is exploring other signals that may be used to rank pages in search results, including social signals that may result in reputation scores for authors, in relationships between words that might appear together on pages ranking for the same queries, and in relationships between pages that show up in the same search results and in the same search sessions. The Google paper presented at an October 2013 natural language processing conference, Open-Domain Fine-Grained Class Extraction from Web Search Queries (pdf), provides some interesting hints at a possible Google of the future.
Google also seems to be very interested in building a knowledge base of concepts that better understands things like what different businesses or entities are ‘Known for’ or by defining entities better in ‘is a’ relationships. Sometimes pages for specific entities show up at the top of search results because they seem to be the page that people are looking for when they include that entity within a query, like the first two results on a search for [Roald Dahl], as seen in the image below:
Google finds terms and phrases to associate with entities that can be considered terms of interest for businesses, locations, and other entities. These terms can influence what shows up in search results and in knowledge panels for those entities. Consider it part of a growing knowledge base of concepts, entities, attributes for entities, and keywords that shape the new Google after Hummingbird. Semantics play a role as things that specific entities are known for are identified.
For example, the Warrenton, Virginia, Red Truck Bakery (local to me) is known for: