Entity Associations with Websites and Related Entities
When we talk about how web sites are related, it’s not unusual for us to talk about links between sites and pages. Google pays a lot of attention between such links, and they are at the heart of one of its most well known ranking signal – PageRank. PageRank is now more than 15 years old, predating the origin of Google itself in the BackRub search engine.
Google is exploring other signals that may be used to rank pages in search results, including social signals that may result in reputation scores for authors, in relationships between words that might appear together on pages ranking for the same queries, and in relationships between pages that show up in the same search results and in the same search sessions. The Google paper presented at an October 2013 natural language processing conference, Open-Domain Fine-Grained Class Extraction from Web Search Queries (pdf), provides some interesting hints at a possible Google of the future.
Google also seems to be very interested in building a knowledge base of concepts that better understands things like what different businesses or entities are ‘Known for’ or by defining entities better in ‘is a’ relationships. Sometimes pages for specific entities show up at the top of search results because they seem to be the page that people are looking for when they include that entity within a query, like the first two results on a search for [Roald Dahl], as seen in the image below:
A Google patent application published earlier this year also explores drawing connections between different named entities (specific people, places, or things) by looking more closely at how certain entities might be associated with specific websites, and by understanding “related entities” for those original entities.
For example, on a search for “John Wayne,” the official John Wayne website shows up as the top result in Google and the second result is the John Wayne Wikipedia page. It’s possible that those rank well not necessarily because of what we might think of as traditional ranking signals such as PageRank and information retrieval scores based upon relevance, but rather because they are pages that have been identified as authoritative on the entity “John Wayne,” and great responses to those queries as navigational results.
While the Roald Dahl search result from the patent application shows books authored by Roald Dahl, the Knowledge Panel result for John Wayne shows movies that he has starred in, and other people whom searchers also look for when they search for John Wayne.
How similar are the processes for including related entities within a set of search results, and including related entities within a knowledge panel in Google Results? This patent application tells us that it looks at search results to try to identify related entities, while the knowledge panel results also appear to look at query log files as well, to find things that people also search for when they search for an entity that triggers a knowledge panel result. The patent filing is:
Invented by Peter Jin Hong, Pravir K. Gupta, Nathaniel J. Gaylinn, Ramakrishnan Kazhiyur-Mannar, Kavi J. Goel, Omer Bar-or, Jack W. Menzel, Christina R. Dhanaraj, Jared L. Levy, Shashidhar A. Thakur, Grace Chung, and Benson Tsai
US Patent Application 20130238594
Published September 12, 2013
Filed: February 22, 2013
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying entities that are related to an entity to which a search query is directed. One of the methods includes:
- Receiving a search query, wherein the search query has been determined to relate to a first entity of a first entity type, and wherein one or more entities of a second entity type have a relationship with the first entity;
- Receiving search results for the search query;
- Determining that a count of search results identifying a resource containing a reference to the first entity satisfies a first threshold value;
- Determining that a count of search results identifying a resource having the second entity type as a relevant entity type satisfies a second threshold value; and
- Transmitting information identifying the one or more entities of the second entity type as part of the response to the search query.
Here’s an abbreviated look at the process described in the patent filing, using images from the patent application:
Search results from a query are explored to see whether or not there are authoritative resources for an entity within them. If so, then those results are said to be targeted towards that entity.
If the search result titles and snippets also contain related entities, they may be identified and included within a database of related entities.
The patent does tell us that these related entities might be presented in a ranked order, and provides some of the signals that could be used to order the related entities. (Note that there’s not a link involved at all.)
Ranking scores for Related entities can be based at least in part on:
- How often someone searches for the related entity after submitting a query for the first entity.
- How globally popular the related entity might be (sounds like search volume).
- How often a recognized reference to the related entity co-occurs in a same previously submitted query as a recognized reference to the original entity.
- If there is data indicating that two or more of the related entities of the second entity type are members of a set of entities that has a specified order, and matching that order (For example, if the entity is a person with children and the children are usually listed in birth order.)
- If there is data indicating that two or more of the related entities are better known as being part of a broader entity; and replacing them with the broader entity in ordering of the related entities.
When Google decides to associate an entity with a particular query, it may also identify whether or not there are “related” entities showing up in those search results within places like titles and snippets, and include those entities within the search results as well. This wouldn’t require matching keywords with the original query or a PageRank analysis.
The patent application shows how this would work within search results, but it seems to be applicable to knowledge panel results as well.
As Google’s knowledge base grows, things like relationships between entities will likely be a part of it.