When we talk about how web sites are related, it’s not unusual for us to talk about links between sites and pages. Google pays a lot of attention between such links, and they are at the heart of one of its most well known ranking signal – PageRank. PageRank is now more than 15 years old, predating the origin of Google itself in the BackRub search engine.
Google is exploring other signals that may be used to rank pages in search results, including social signals that may result in reputation scores for authors, in relationships between words that might appear together on pages ranking for the same queries, and in relationships between pages that show up in the same search results and in the same search sessions. The Google paper presented at an October 2013 natural language processing conference, Open-Domain Fine-Grained Class Extraction from Web Search Queries (pdf), provides some interesting hints at a possible Google of the future.
Google also seems to be very interested in building a knowledge base of concepts that better understands things like what different businesses or entities are ‘Known for’ or by defining entities better in ‘is a’ relationships. Sometimes pages for specific entities show up at the top of search results because they seem to be the page that people are looking for when they include that entity within a query, like the first two results on a search for [Roald Dahl], as seen in the image below:
When specific people, places, and things show up in queries or in web pages, that can be a signal to search engines to do something special in the results that they show. How prepared are you to understand and anticipate how the search engines treat them? Do you have a strategy in place?
Named entities show up in a lot of queries – they may even be one of the kinds of things that people look for most online. In a 2010 white paper from Microsoft, Building Taxonomy of Web Search Intents for Name Entity Queries (pdf), we are told how large of a role that “named entities” play in search:
According to an internal study of Microsoft, at least 20-30% of queries submitted to Bing search are simply name entities, and it is reported 71% of queries contain name entities.
Within the announcement Google made earlier this year about the Hummingbird update is the search engine might rewrite queries, substituting some terms within them, when they think doing so might improve the results that searches see, and a very recent Google patent describes how Google might use a data driven approach to explore how effective those substitutions might be.
There is a history of Google making changes to queries and results to try to provide better search results.
Titles – In January of 2012, a Google Webmaster Central blog post told us that Google might sometimes change the title of a page in search results if they thought the new title might lead to more clicks and views of a page. While that might not be what the author of a page intended, it shows that Google is trying to make it easier for people to find the information they are searching for. I’ve run across sites where all the pages had the same titles, but unique main headings, and saw Google add the text for the main headings to those titles for each page.
Is Hummingbird the key to understanding the expertise of an author for things like In-Depth articles, and a possible future Author Rank? With content from an author considered using a concept-based knowledge base, it’s quite possible.
The Google Hummingbird rewrite of Google’s search engine wasn’t just aimed at providing a way to better understand long and complex queries, like the type that someone might speak into their phone. It was also likely aimed at better understanding the concepts and topics written about and discussed on Web pages, and in social signals such as posts at Google+ and comments on those posts, in Tweets, in Status Updates, and other short text based messages where there might not be a lot of additional context to go with those messages.
The following screenshot shows the concepts that might appear for Tweets when they are analyzed using the Probase Concept-Based knowledge base (from Short Text Conceptualization using a Probabilistic Knowledgebase):
There are a few different parts to this story, though I’m not sure how many there will be because I’m still in the middle of writing them. I started with a prologue, titled Are You,Your Business, or Products in a Knowledge Base?, which introduced Microsoft’s Conceptual Knowledge Base Probase.
Microsoft’s Probase Knowledge Base
Sometime between when Microsoft acquired semantic search company Powerset and now, the software company began work on one of the largest knowledge bases in the world, Probase. Why Bing doesn’t use it now is a mystery, but it doesn’t appear to. There are a few papers about Probase, including one titled, Concept-Based Web Search. Here’s a snippet from the paper, which might evoke some recent memories of Google’s Hummingbird update:
It is important to note that the lack of a concept-based search feature in all main-stream search engines has, in many situations, discouraged people from expressing their queries in a more natural way. Instead, users are forced to formulate their queries as keywords. This makes it difficult for people who are new to keyword-based search to effectively acquire information from the web.
Added 2013-11-10 – Google was granted a continuation version of this same patent (Search queries improved based on query semantic information) on November 5th, 2013, where the claims section has been completely re-written in some interesting ways. It describes using a substitute term for one of the original terms in the query, and using an inverse document frequency count to see how many times that substitute term appears in the result set for the modified version of the query and for the original version of the query. The timing of this update of the patent is interesting. The link below points to the old version of the patent, so if you want you can compare the claims sections.
Back in September, Google announced that they had started using an algorithm that rewrites queries submitted by searchers which they had given the code name “Hummingbird.” At the time, I was writing a blog post about a patent from Google that seemed like it might be very related to the update because the focus was upon re-writing long and complex queries, while paying more attention to all the words within those queries. I called the post, The Google Hummingbird Patent because the patent seemed to be such a good match.
Google introduced a new algorithm by the name of Hummingbird to the world today at the garage where Google started as a business, during a celebration of Google’s 15th Birthday. Google doesn’t appear to have replaced previous signals such as PageRank or many of the other signals that they use to rank pages. The announcement of the new algorithm told us that Google actually started using Hummingbird a number of weeks ago, and that it potentially impacts around 90% of all searches.
It’s being presented as a query expansion or broadening approach which can better understand longer natural language queries, like the ones that people might speak instead of shorter keyword matching queries which someone might type into a search box.
When you search, especially for topics that you know little about, chances are that you might not include the most relevant terms in your query, or you might use words that may have ambiguous meanings.
One of the areas where search engines focus a lot of attention upon is in reformulating queries through query suggestions and query expansion to help searchers better meet their situational and informational needs quickly.
When you search, you might see a number of query suggestions at the bottom of the results that were first returned, like the ones above on a search for [find airedale terrier puppies]. Or a search engine might include synonyms or substitute queries to expand your original query.