Concept-Based Web Search

There are a few different parts to this story, though I’m not sure how many there will be because I’m still in the middle of writing them. I started with a prologue, titled Are You,Your Business, or Products in a Knowledge Base?, which introduced Microsoft’s Conceptual Knowledge Base Probase.

Microsoft’s Probase Knowledge Base

Sometime between when Microsoft acquired semantic search company Powerset and now, the software company began work on one of the largest knowledge bases in the world, Probase. Why Bing doesn’t use it now is a mystery, but it doesn’t appear to. There are a few papers about Probase, including one titled, Concept-Based Web Search. Here’s a snippet from the paper, which might evoke some recent memories of Google’s Hummingbird update:

It is important to note that the lack of a concept-based search feature in all main-stream search engines has, in many situations, discouraged people from expressing their queries in a more natural way. Instead, users are forced to formulate their queries as keywords. This makes it difficult for people who are new to keyword-based search to effectively acquire information from the web.

In my last post on Hummingbird, Google’s Hummingbird Algorithm Ten Years Ago, I included the suggestion that people learn how to do concept-based keyword research, instead of keyword-based keyword research.

That post on concept research was about Bing, and how a knowledge base might be used to improve search results from Bing. The patent involved a two-step approach to (1) associate entities in a Knowledge Base queries with (2) probable user intent behind the query. While we’ve seen Bing add a knowledge panel to its search results, with entity information displayed, which it called Snapshots, the Bing knowledge-base used now doesn’t seem like the concept knowledge base Probase. Instead, those knowledge base results seem like an addition to Bing’s search results that don’t affect those results.

In March of 2011, ReadWrite Web wrote about some of Microsoft’s “Research Projects,” including Probase, in their post Microsoft Research Watch: AI, NoSQL and Microsoft’s Big Data Future. In my prologue post, I provided a link to The Release Page for Probase which is described in the ReadWrite article. On the release notes page linked from that page, there’s a note that says that “this release is only available for internal use for now.” There is a manual (pdf) for it that provides more details on how it might work.

The Death of Powerset?

Has Probase stalled, apparently like Bing’s implementation of Powerset‘s semantic search? I do remember in the early days of Bing, there were links to wikipedia pages that appeared to have been annotated by Powerset. The readWrite article refers to them as Bing Reference pages. The Bing Powerset blog looks like a ghost town, with broken images instead of broken windows. The post Learn more with Bing Reference hints at what could have been. The Bing Powerset blog lasted one more post, which notes that future Powerset announcements would be made in the main Bing Blog:

From time to time, our PMs and developers will post about a new feature we launched, or what we have in store for the next generation search snippets or answers, two of the areas where the Powerset team is making direct contributions to the quality of Bing. Let me assure you, there is a ton of cool stuff coming, and we can’t wait to be able to share it with you all. So, as we say goodbye to the Powerset blog, I hope you will all still be eager to know what makes Bing special and follow the Bing community blogs, to keep up to date with everything Bing

The post author, and former CEO of Powerset, Lorenzo Thione, appears to have left Bing to be involved in a startup in the art world by the name Artify. I didn’t check to see if there were any further updates in the Bing Blog.

Was PowerSet a dead-end for Bing? Was Probase? I don’t know. The quote from above about a “concept-based search feature” is from a Microsoft White Paper Concept-Based Web Search (pdf) (a second link to make it easier to visit, and more likely that you will) that hints at the demise of Powerset at Bing:

There have been some attempts to support semantic web search using some form of a knowledge base. A well-known example is PowerSet which maintains huge indexes of entities and their concepts. This approach, however, will not scale because upates on the entities can be extremely expensive.

The paper does make it seem like Probase doesn’t have that limitation, and the quote about it (above) makes it sound like the type of query re-writing we see in Google’s Hummingbird is one of the features that a concept-based web search engine would bring. Since Microsoft hasn’t made a hummingbird announcement about re-writing queries, it’s likely that Probase hasn’t been folded into Bing yet, if it will ever be.

Concept-Based Search at Google and Bing

I have found a few recently granted Google patents that describe how concepts might be gathered from Web Content, but the Bing white papers on Probase illustrate an approach to this kind of search that are worth discussing. In this one on Probase, it concludes with the statement:

In this paper, we propose an idea to support concept-based web search. Such search queries contains (sic) abstract or vague concepts in them and they are not handled well by traditional keyword based search.

We use Probase, an automatically constructed general-purpose taxonomy to help understand interpret (sic) the concepts in the queries. By concretizing the concepts into their most likely entities, we can then transform the original query into a number of entity-based queries which are more suitable for traditional search engines.

In short, how Probase works is to take a searcher’s query and sort it into possible term sequences that might include concepts, entities, attributes and keywords. It may then identify the intent (or the semantics) of those term sequences based upon a set of query patterns. (Keep that in mind for when we discuss the Google patents in a later post).

The queries might then be re-written into a set of candidate queries, based upon their likelihood, estimated by word association probabilities (see my post on the Hummingbird patent for a description of one way that word associations or co-occurrences might be used to re-write queries.) These candidate queries might then be submitted to Probase or Bing to obtain a set of search results.

Conclusion

The paper goes into much more detail on these steps, but at this point it’s possible that Probase will never be released publicly. We’ll get into the Google patents in future posts.

I’ve been watching the videos on a Stanford Class on Natural Language Processing and reading The Big Book of Concepts to dig deeper into the methods behind a Concept-Based Web Search. Both are recommended and seem to be helpful (to me) so far in digging into this topic. These seem like good starting points for learning about search that involves understanding concepts to deliver search results.

Share

18 thoughts on “Concept-Based Web Search”

  1. Awesome article, I really appreciate the scholarly references (and the fact that there are people this smart doing such important research).

  2. There’s obviously a lot more to search queries than meet the eye. Appreciate all the relevant links – a lot of good information.

  3. Thanks, Sprigley

    The Hummingbird update from Google hit a lot of people by surprise, but perhaps it shouldn’t have – if more information about what Microsoft was trying to do with Probase was more well known.

    I love that there are articles out there like the ones that I linked to in my post.

  4. Hi Phil,

    Thank you. The ultimate goal of search is in trying to understand best how to respond to a query. I liked the quote from the Concept-Based Web Search paper a lot when it discussed how people new to the web might find searching with the use of keywords difficult. I’ve seen that first hand, with people I know looking for information that they are sure is online, but not knowing the right “magical” keywords that might bring that information to them. Search shouldn’t work like that.

  5. Hi Bill,

    Great post! Thanks for highlighting the study on contextual search its a very compelling read. If search engines are becoming more proficient at decoding the concept of search terms does this also hint that they are becoming more proficient at understanding the concept of the websites that are ranking? For example the idea of co-citation where mentions of a website are treated much the same as a traditional link even if the physical link is not present? Have there been Google patients that hint towards this kind of website analysis?

    Thanks

  6. Hi Ryan,

    Yes, if we are learning to understand queries better, we are also learning to understand the content of web pages better, and the concepts that they contain.

    I don’t recall seeing co-citation mentioned in any of the patents, or even hinted at. Co-citation may have been mentioned in previous patents, but not for analysis of either queries or the content of pages. What you’re describing, Rand’s concept of how co-citation works from a whiteboard Friday of a few months ago, isn’t really how it works.

  7. Hi Bill

    Thanks for your reply. I understand that its a big topic but it would be great to hear your opinion on the co-citation theory and how passing relevance on from one site to another really works.

  8. Hi Bill,

    In short my way of looking at search.
    In some sense search is finding a match between (a sequence of) spoken or written words and some index. If the index is build of words in webpages you can have keyword search :-(. If the index is made of properties of concepts (names e.g.) you will find concepts (and the pages in which they occur).
    I like to see the index as the knowledge base of a search.
    The real question is then: how do I build the index? Part of information extraction is well on its way, but we are not yet where we would want to be. And that is of course needed for building concept based indexes.
    A few days ago I wrote a post in my blog about that (http://ronaldpoell.blogspot.com/2013/11/information-extraction-yes-but-right-way.html)

    Kind regards

    Ronald

  9. Hi Ryan,

    The co-citation theory is based upon Rand using the word “co-citation” when he should have used the word “co-occurrence” in a White Board Friday post he did. He has since changed the name of that white board Friday to replace “co-citation” with “co-occurrence.” It didn’t really capture the idea of the use of co-occurrence quite right, but it is close.

    See:

    Not All Anchor Text is Equal and other Co-Citation Observations
    It’s Not Co-Citation.. but it’s still awesome! (Or what’s really going on in the SERPs?)
    How Google May Substitute Query Terms with Co-Occurrence

  10. Hi Ronald,

    Thank you for sharing your blog post. I’ll be revisiting your thoughts on information extraction and your blog. Appreciate your sharing them.

    I agree with you about the index being the knowledge base of search, and that information extraction isn’t quite were we would want it to be.

    The information extraction approach described by the white papers around Probase call for the extraction of concepts, entities, attributes, and keywords from content on web pages, and I think that’s a step in the right direction. It’s hard to say from this perspective what the present state of Probase is, and if it’s been abandoned or is moving forward.

    The Google patents, which I haven’t written about yet but will sometime very soon, describe how they may go about extracting conceptual relationships from content on the Web as well. Not all keywords within a query should be searched upon – they don’t necessarily provide search results that are better in any way than if they were skipped. I did feel like I had to spend some significant time looking at how Probase works before digging into how Hummingbird works more fully though, and what its relationship is with Google’s knowledge base.

  11. Hi Bill,
    I’m not in the mood for patent analysis and will read your post on that with interest.
    Imho traditional NLP techniques (even with future enhancements) will show their limits. My post on a possible leap in artificial intelligence is not dissociated from the one on information extraction, but you’ll have to read between the lines. In other words combine the two post into one.
    If we really want to achieve high quality information extraction we might very well need a whole new solution (sort of Solve for X: not 10 % but 10 x).

    Ronald

  12. Superb post….The main goal of search is how to respond a query. In some sense search is finding a match between spoken or written words and some index. If the index is build of words in webpages you can have keyword search .If the index is made of properties of concepts you will find concepts. I like your post.

  13. Traditional web search engines are keyword-based. Such a mechanism is effective when the user knows exactly the right words in the web pages they are looking for. However, it doesn’t produce good results if the user asks for a concept or topic that has broader and sometimes ambiguous meanings

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>