Google finds terms and phrases to associate with entities that can be considered terms of interest for businesses, locations, and other entities. These terms can influence what shows up in search results and in knowledge panels for those entities. Consider it part of a growing knowledge base of concepts, entities, attributes for entities, and keywords that shape the new Google after Hummingbird. Semantics play a role as things that specific entities are known for are identified.
For example, the Warrenton, Virginia, Red Truck Bakery (local to me) is known for:
With Google’s Penguin update, it appears that the search engine has been paying significantly more attention to link spam as attempts to manipulate links and anchor text to a page. The Penguin Update was launched at Google on April 24th, 2012, and it was accompanied by a blog post on the Official Google Webmaster Central Blog titled Another step to reward high-quality sites
The post tells us about efforts that Google is undertaking to decrease Web rankings for sites that violate Google’s Webmaster Guidelines. The post is written by Google’s Head of Web Spam, Matt Cutts, and in it Matt tells us that:
…we can’t divulge specific signals because we don’t want to give people a way to game our search results and worsen the experience for users, our advice for webmasters is to focus on creating high quality sites that create a good user experience and employ white hat SEO methods instead of engaging in aggressive webspam tactics.
Is Hummingbird the key to understanding the expertise of an author for things like In-Depth articles, and a possible future Author Rank? With content from an author considered using a concept-based knowledge base, it’s quite possible.
The Google Hummingbird rewrite of Google’s search engine wasn’t just aimed at providing a way to better understand long and complex queries, like the type that someone might speak into their phone. It was also likely aimed at better understanding the concepts and topics written about and discussed on Web pages, and in social signals such as posts at Google+ and comments on those posts, in Tweets, in Status Updates, and other short text based messages where there might not be a lot of additional context to go with those messages.
The following screenshot shows the concepts that might appear for Tweets when they are analyzed using the Probase Concept-Based knowledge base (from Short Text Conceptualization using a Probabilistic Knowledgebase):
There are a few different parts to this story, though I’m not sure how many there will be because I’m still in the middle of writing them. I started with a prologue, titled Are You,Your Business, or Products in a Knowledge Base?, which introduced Microsoft’s Conceptual Knowledge Base Probase.
Microsoft’s Probase Knowledge Base
Sometime between when Microsoft acquired semantic search company Powerset and now, the software company began work on one of the largest knowledge bases in the world, Probase. Why Bing doesn’t use it now is a mystery, but it doesn’t appear to. There are a few papers about Probase, including one titled, Concept-Based Web Search. Here’s a snippet from the paper, which might evoke some recent memories of Google’s Hummingbird update:
It is important to note that the lack of a concept-based search feature in all main-stream search engines has, in many situations, discouraged people from expressing their queries in a more natural way. Instead, users are forced to formulate their queries as keywords. This makes it difficult for people who are new to keyword-based search to effectively acquire information from the web.
Added 2013-11-10 – Google was granted a continuation version of this same patent (Search queries improved based on query semantic information) on November 5th, 2013, where the claims section has been completely re-written in some interesting ways. It describes using a substitute term for one of the original terms in the query, and using an inverse document frequency count to see how many times that substitute term appears in the result set for the modified version of the query and for the original version of the query. The timing of this update of the patent is interesting. The link below points to the old version of the patent, so if you want you can compare the claims sections.
Back in September, Google announced that they had started using an algorithm that rewrites queries submitted by searchers which they had given the code name “Hummingbird.” At the time, I was writing a blog post about a patent from Google that seemed like it might be very related to the update because the focus was upon re-writing long and complex queries, while paying more attention to all the words within those queries. I called the post, The Google Hummingbird Patent because the patent seemed to be such a good match.
This week, Google was awarded a patent that describes how they might score content on how much Gibberish it might contain, which could then be used to demote pages in search results. That gibberish content refers to content that might be representative of spam content.
The patent defines gibberish content on web pages as pages that might contain a number of high value keywords, but might have been generated through:
- Using low-cost untrained labor (from places like Mechanical Turk)
- Scraping content and modifying and splicing it randomly
- Translating from a different language
Gibberish content also tends to include text sequences that are unlikely to represent natural language text strings that often appear in conversational syntax, or that might not be in text strings that might not be structured in conversational syntax, typically occur in resources such as web documents.
Google introduced a new algorithm by the name of Hummingbird to the world today at the garage where Google started as a business, during a celebration of Google’s 15th Birthday. Google doesn’t appear to have replaced previous signals such as PageRank or many of the other signals that they use to rank pages. The announcement of the new algorithm told us that Google actually started using Hummingbird a number of weeks ago, and that it potentially impacts around 90% of all searches.
It’s being presented as a query expansion or broadening approach which can better understand longer natural language queries, like the ones that people might speak instead of shorter keyword matching queries which someone might type into a search box.
In my college days, I cooked at some local restaurants (free meals made it an attractive option for a starving college student). One of the restaurants was in the center of town, at one end of Main Street, and it was a popular place for local residents who returned over and over. It had a great reputation, and word-of-mouth propelled advertising for the place. Another dining venue I worked at was outside of the center of town, nearby an interstate highway. It didn’t have a great reputation, and very few regular customers, except for people who would stop during mealtime from the busy interstate. The “food” sign from the highway attracted most of the traffic to its dining room.
Funny thing is that most of the regulars that frequented the first restaurant rarely had to look up its location, because it was so well known. Most of the people who visited the second restaurant had never been there before, and relied upon the highway sign. There’s another restaurant in that location now, and I have no doubt that many people find it via maps or navigation on their mobile devices or in-car navigation. I mention this because I have some issues with a recently granted Google patent.