There are a few different parts to this story, though I’m not sure how many there will be because I’m still in the middle of writing them. I started with a prologue, titled Are You,Your Business, or Products in a Knowledge Base?, which introduced Microsoft’s Conceptual Knowledge Base Probase.
Microsoft’s Probase Knowledge Base
Sometime between when Microsoft acquired semantic search company Powerset and now, the software company began work on one of the largest knowledge bases in the world, Probase. Why Bing doesn’t use it now is a mystery, but it doesn’t appear to. There are a few papers about Probase, including one titled, Concept-Based Web Search. Here’s a snippet from the paper, which might evoke some recent memories of Google’s Hummingbird update:
It is important to note that the lack of a concept-based search feature in all main-stream search engines has, in many situations, discouraged people from expressing their queries in a more natural way. Instead, users are forced to formulate their queries as keywords. This makes it difficult for people who are new to keyword-based search to effectively acquire information from the web.
Added 2013-11-10 – Google was granted a continuation version of this same patent (Search queries improved based on query semantic information) on November 5th, 2013, where the claims section has been completely re-written in some interesting ways. It describes using a substitute term for one of the original terms in the query, and using an inverse document frequency count to see how many times that substitute term appears in the result set for the modified version of the query and for the original version of the query. The timing of this update of the patent is interesting. The link below points to the old version of the patent, so if you want you can compare the claims sections.
Back in September, Google announced that they had started using an algorithm that rewrites queries submitted by searchers which they had given the code name “Hummingbird.” At the time, I was writing a blog post about a patent from Google that seemed like it might be very related to the update because the focus was upon re-writing long and complex queries, while paying more attention to all the words within those queries. I called the post, The Google Hummingbird Patent because the patent seemed to be such a good match.
Google introduced a new algorithm by the name of Hummingbird to the world today at the garage where Google started as a business, during a celebration of Google’s 15th Birthday. Google doesn’t appear to have replaced previous signals such as PageRank or many of the other signals that they use to rank pages. The announcement of the new algorithm told us that Google actually started using Hummingbird a number of weeks ago, and that it potentially impacts around 90% of all searches.
It’s being presented as a query expansion or broadening approach which can better understand longer natural language queries, like the ones that people might speak instead of shorter keyword matching queries which someone might type into a search box.
When you search, especially for topics that you know little about, chances are that you might not include the most relevant terms in your query, or you might use words that may have ambiguous meanings.
One of the areas where search engines focus a lot of attention upon is in reformulating queries through query suggestions and query expansion to help searchers better meet their situational and informational needs quickly.
When you search, you might see a number of query suggestions at the bottom of the results that were first returned, like the ones above on a search for [find airedale terrier puppies]. Or a search engine might include synonyms or substitute queries to expand your original query.
When I talk about, or write about entities, it’s normally in the context of specific people, places, or things. Google was granted a patent recently which discusses a different type of entity, in a more narrow manner. These entities are referred to as “search entities”, and the patent uses them to predict probabilities and understand the relationship between them better. This kind of analysis might result in some pages ranking higher than they otherwise might because of their similarities to other sites, and in some sets of search results favoring fresher results as well.
These search entities can include:
But I’m a substitute for another guy
I look pretty tall but my heels are high
The simple things you see are all complicated
I look pretty young, but I’m just backdated, yeah
– Peter Townsend
When you search at Google, how easy is it to find what you’re looking for? Do you search again, but try different but related words if your first attempt doesn’t uncover pages that you find useful?
If I search for “car repair” and follow it up on a search for “auto repair,” I would suspect that I would see a lot of the same pages, but perhaps not in the same order. I would also expect to see local search results for both, and I do. The local search results aren’t in the exact same order either. Some words or phrases do make good substitutes for others though, as can be seen in the image below:
Somewhere out there is a universe that looks exactly like this one, and appears to run exactly like this one. Except something’s a little different. A little off. It’s as if search engines took a left turn instead of a right turn, back in the early 2000s. Instead of using only using meta descriptions and possibly body text from web pages for descriptive text, or snippets, for those pages in search results, they learned a new trick. Imagine that the content surrounding anchor text in a link to a page was collected and evaluated based upon a quality score, and that this associated and usually descriptive text was used to generate snippets instead?
My thought on the possibility is that often anchor text doesn’t do the best job of describing a page, and often links to a page are from a third party who might not have the same interest in writing text that might make a good snippet for a page. But, Google filed a patent for such an approach back in 2003. And it was granted this week – so they pursued what was described within the patent for over a decade as well. The patent does mention that headings on pages might also be used as potential snippets for pages, and provide the following example: “Computers > Algorithms > Compression”. But that’s a small part of the patent. They don’t limit it to anchor text that a site might provide itself, like in breadcrumb trail navigation for a page.
There’s also a part to this approach that recognizes that many pages have more than one link to them, so a choice would need to be made as to the best “snippet” to show.
Google was granted an updated version of a patent this week that looks at how the search engine might use directories in URL structures to help it better understand the categories on a Web site, and to categorize new pages and directories that might be added to a site. The patent tells us that this might enable the search engine to add supplemental information to pages, such as advertisements that fall within the categories displayed upon the site.
Some other patents I’ve written about in the past shows that the search engines might be doing more with categories than just deciding upon which ads to show on a page
Imagine that you have a site about car parts, and you decided to organize the pages of the site first by car make, so the main categories on your site are different brands, and your second level of directories is organized by car models. You might then have sub-sub-categories that are organized by different systems within cars, such as “electrical,” “transmission,” “cooling,” “suspension,” and so on. URLs for a couple of your pages might look like: