Search Taxonomies and Search Engines: Answering Questions vs. Indexing Webpages

If you were to search for [Ronald Reagan Movies] at Google or Yahoo or Bing, would you expect to see a list of movies that the former President and actor appeared in?

It’s more likely that you would see a set of web pages that contain the words “Ronald” and “Reagan” and “Movies,” which might contain the names of films starring the former politician and thespian.

A patent application from Yahoo published last week explores ways to return information directly to searchers, based upon building taxonomies of information about specific people, places, and things, gathered from information found on web pages, rather than having searchers look through multiple web pages to find answers to queries such as “Ronald Reagan movies.”

Both Yahoo and Google do some question answering when faced with certain queries that involve “named entities,” or the names of well-known people, places, and things. For example, search at either search engine for [Babe Ruth birthplace], and above the web pages on the search results pages appears an answer to that question:

Google search result showing Babe Ruth's Place of Birth.

Yahoo search result showing Babe Ruth's Place of Birth.

But neither search engine provides more detailed sets of information, such as lists of quotes from certain people, or movies that they might have appeared within, or political offices held. Would these be the kinds of things that people might like to see in search results? Yahoo’s patent filing explores how they might build taxonomies of information about someone like Ronald Reagan, and extract information from web pages to build answers to questions like those for display on a search result page. The patent filing is:

Creation and Enrichment of Search Based Taxonomy for Finding Information from Semistructured Data
Invented by Sudharsan Vasudevan, Rohan Monga, Hemanth Sambrani, and N S Sekar
US Patent Application 20090282010
Assigned to Yahoo!
Published November 12, 2009
Filed June 18, 2008

Abstract

Techniques are provided for creating and updating a entity hierarchy (taxonomy) based on information captured about user interaction with a system. Techniques are also provided for using the taxonomy to determine the nature of entities represented by terms submitted to a search engine. Search logs analyzed for related sets of entities, and used to improve the taxonomy for storing information.

Once the taxonomy is created, information across data sources are fetched and aggregated based on the taxonomy. When the system is queried, the query is modified to a predefined template, and the best fit result is promptly returned. A feedback mechanism is also provided to enhance taxonomy and entity data based on search volumes. This system enables search engines to provide accurate answers when entities, their attributes and relationships are involved.

The inventors of the patent use Ronald Reagan as one example because he can fit into more than one “main” category in a taxonomy or classification system, with a history as both a movie star and a politician. Under the “movie stars” category might be attributes such as “date of birth” as well as “movies acted.” Under the “politicians” category, we might also see “date of birth,” but other attributes such as “offices held” may also be included.

What’s kind of interesting about this is that it reminds me of the structure at the roots of Yahoo’s origin as a directory. We’re told that Yahoo would build taxonomies using a combination of feedback from search query logs and from manual human intervention. The human editing aspect of building a taxonomy would help with making sure information is correct, and automated feedback from query logs would help make sure that the taxonomy was up-to-date and included very recent information from search trends.

Many of the examples included in the patent description involve well known people or places or things, often referred to as “named entities,” such as Johnny Depp or the Empire State Building, but the patent filing tells us that it might include broad and specific categories that don’t involve named entities as well, from “humans” to “11th grade teachers.”

For many of the taxonomies that a system like this might create, the search engine might start with pre-existing data sources that provide information such as the Internet Movie Database (not specifically named in the patent filing), or yellow page directories.

When it comes to certain types of categories, such as those that might list people, there may be default attributes associated with those that are defined by human editors, such as a “date of birth,” or “place of birth,” or a “date of death.”

Other attributes that could be applied to categories might be learned by looking at search logs to see what people are looking for. An example might be that people often look for quotes from someone like “Mark Twain.” If those kinds of searches tend to be common, it might be reasonable for a search engine to collect Mark Twain quotes to show to searchers when there are queries for [mark twain quotes].

Some people, places, and attributes have common alias or alternative names. For example, when you search for a date of birth for someone, you might use the words “birthday,” or “born,” or “d.o.b”. When someone searches for Johnny Depp, they might also search for “Johnny D.”, “J. Depp”, and “Jack Sparrow”. A search query including “United States” might call the country “US,” or “USA” or “United States of America.” A search engine may learn to associate these alias names automatically from search logs.

Sources for categories and attributes to display to searchers might be identified by editors as well as from click-through information from searchers as seen in search logs. Some sources might be given higher “confidence levels” from those human editors, or by the search engines, though the patent doesn’t tell us the qualities that might be used to determine those confidence levels. Presumably, if a particular web page was used to provide information and answers to queries, the search engine would link to that page like in the examples above about Babe Ruth’s birthplace from Google and Yahoo.

Conclusion

The patent does go into more detail on how it might build taxonomies, and how it might decide what information might be shown in response to certain kinds of queries.

While that is worth spending some time with, what is most interesting about this patent application is that it shows a desire to answer questions to queries directly, rather than presenting searchers with pages that may or may not provide those answers. Of course, the search engines will likely continue to show web pages that might be good results for queries from searchers after displaying answers.

If you’re a searcher looking for information on a topic, and search engines answered your questions directly like this, how comfortable would you feel with those answers?

If you’re a site owner, would it bother you that a search engine might mine your web site to display answers and potentially keep visitors from coming directly to your web site for those answers?

Share

23 thoughts on “Search Taxonomies and Search Engines: Answering Questions vs. Indexing Webpages”

  1. Hi Bill,

    Great post – thanks for bringing this to my attention – well worth a tweet :)
    BTW I’m now following you via @headup and @pop_art…

    Cheers,
    Mike

  2. This is great … since the focus is on people, places and things it right away made me think of wikipedia and that’s where typically you would expect to find all needed information about famous people, place or things so I wonder how wikipedia feels about this … as a searcher I can see this being valuable for not yet famous people, places, things where they may not even have their won website but particles of content about them is spread all over and this brings them together

  3. So does this mean that search engines like Yahoo will take other people’s content ( hard work ) from their sites and repackage it for their own purposes, keeping searchers on their sites and potentially clicking on their own ads? —

    “A patent application from Yahoo published last week explores ways to return information directly to searchers, based upon building taxonomies of information about specific people, places, and things, gathered from information found on web pages, rather than having searchers look through multiple web pages to find answers to queries such as ‘Ronald Reagan movies.'”

  4. Hi Nick,

    Nice post. I do think that the presentation of information in a manner like this, with a link back to the source does present an opportunity to sites and sources that provide information. The question I have is, will people stop once they’ve found their answer, or will they be likely to click through to the source of that information? I’m leaning towards the possibility that they will click through a fair amount of the time – which would be positive for the site that is the source.

  5. Hi Shiva,

    I thought of wikipedia as I was reading the article as well. It’s not surprising that Bing is using powerset technology on wikipedia article when they display results involving “named entities.” Google and Yahoo both do a limited amount of question answering already – the question is how much more of it would they consider doing. This patent application shows that they could possibly give us much more detailed answers to questions. Will they?

    Will they also start answering questions about people or places or things that aren’t very well known? I don’t know. The patent mentions the possibility of that as well, but not in much depth.

  6. Hi People Finder,

    To some degree it does mean exactly that – the search engines extract facts from web pages about people, places, and things, and provide information about them, with links to the sources of that information. Which is why, in my post, I ask – should a search engine be somewhere that people search for answers, or a place where people search for pages that provide answers.

  7. Wow this is interesting. Bill, you seem to know a lot about patents, is Yahoo going to have to prove that they were the first to ever do this? I thought you had to prove that you originally developed something in order to patent it? Maybe this is different enough from what was done before?

    This is actually a fantastic idea for Yahoo. I’m sure I’m not the only internet user who is annoyed during those somewhat rare situations where you are searching for something very simple but have difficulty finding it through a search engine because of all the false positive hits. I’m not sure that Google is doing a good enough job of staying ahead of SEO tricks, and this could be a chance for Yahoo to grab back a bit of market share.

    BTW Bill, great content here – I just found your blog recently and I am very impressed.

  8. It’s an interesting topic Bill. As a searcher there are times when I don’t need to see the web page. I just want a quick answer. The Babe Ruth birthplace query is a good example. I’d rather a search engine give me the answer than having to click through to a page and scan the results.

    Od course as a site owner I want people clicking through and visiting and scanning my site. I wouldn’t be bothered if a search engine presented an answer as simple as Baltimore, Maryland, but as they expand how much of an answer they give it would bother.

    Take the listing of Ronald Reagan movies. As a searcher I think it would be great to have the list right there in the search results. But is it fair that someone compiled the list only to have a search engine steal the effort? Perhaps if the search engine added a link below the information, something like “compiled from” or “according to” it might be a good compromise. Credit where credit’s due.

  9. Hi Buzzlord,

    Thank you. I’ve been spending a lot of time digging through patents over the past few years.

    A general (and generalized) rule of thumb for patents is that they need to describe an actual process that is new, non-obvious, and useful. They don’t have to describe something so innovative that they will transform the world completely, and many patent filings cover something that is fairly small in scope – a new way of doing one thing or another.

    Given that Google, Yahoo, Microsoft, and Ask all have very similar goals in mind when it comes to delivering search results, it’s possible to see patents from each that cover some very similar ground. As I mentioned in the post, both Google and Yahoo already do something like this in response to queries like birthdates. Is this patent different enough to be granted? We really won’t know until it actually is.

  10. Hi Steven,

    Good questions and points. Quick answers can be good, and it’s possible that attribution in the form of a link does make it worth being the “source” of information shown.

    There are other questions that can be asked though. Is there a point where taking the answers to a question from a site raises copyright issues? Are there fair use issues involved? As you point out, how much information is too much?

    Web publishers want visitors, and they want people to visit their pages. Search engines want to provide answers to searchers, whether directly or by leading them to pages that provide those answers. At some point there is the potential for a conflict when a search engine starts providing direct answers, taken from sites that they’ve crawled on the Web, and a link to the source may not be enough to make site owners happy.

    If I wanted a list of Ronald Reagan movies, it would be nice to see them listed for me in the search results – but I might feel like I’m given better information visiting the Internet Movie Database (IMDB.com). Not everyone will feel that way, though. Does IMDB lose out because Yahoo might take that information from them?

  11. Taking Google as an example, I already see when searching that organic results are being pushed down the page to make more room for paid results. I see other moves that are designed to keep the searcher on the site too. From my industry, I am making note that Google now has property listings on their maps, so a searcher does not need to go to Zillow, Trulia, or their Realtor Association. Moreover, Google is bringing its mortgage app from the UK over to US searchers, causing users to stay on their site.I think in a way, we are already seeing this argument playing out with newspaper/magazine publishers arguing about newsfeeds on Google. With this approach, website owners may see nothing wrong at first, but then they will rail against it at some point. If a searcher from California found an answer on my site, but the search engine did not send them through, I might be fine with that, because I do not do business there. However, I would want a searcher from Houston to come through, in the hopes of making a conversion. Maybe some line of code in our htaccess files connected with creative commons license telling the search engine which action we would like them to take would be in order.

  12. Hi Frank,

    Is there a comfortable balance or equalibrium that a search engine can reach between being a portal that tries to keep visitors on its own pages as long as possible, and shows them advertising, and being a search index that gets visitors to return on a regular basis by delivering the best results on other sites as effectively as possible?

    Would it make a difference if the services provided by a search engine are more innovative or useful than those found on sites offering property listings or mortgage applications?

    In viewing Google’s news listings, I’m not completely convinced that Google’s news feeds are bad for news sites – if a story that I see on Google News is interesting to me, I usually want to find out more, and the snippet that Google shows is more likely to have me click through to find out more. Does fair use cover those listings? I’m not sure that cutting Google off from showing those snippets is a wise move on behalf of newspapers and magazines.

    Having robots.txt evolve so that it can incorporate a machine readable copyright license like creative commons might be worth considering, and it would be something that would be worth seeing. But legally there has never been a need for defining copyright rights when it comes to fair use.

    When it comes to aggregation of data, fair use, search engines, APIs, and machine readable licenses, we are in a fastly changing and evolving area of law and property rights. That includes how much information a search engine can aggregate and display (even with attribution in the form of links), before the use transforms from fair use to copyright infringement. These are interesting times.

  13. In the light of News Corps announcement, I’ve been mulling this over.
    First, if the feed is interesting enough, I would click through, so I think that places an onus on site owners to work on their headlines and excerpts.
    Second, as for a search engine providing answers, maybe we can develop some rules that are similar to the expectation of a research paper? Common knowledge, like a birthplace does not need to be sited. When clearly expressing someone else’s work, a more appropriate citation is in order.
    The fuzzy area for me comes in items that may be common knowledge, but maybe only for a specified group. For example, the smart grid is obtaining some attention now publicly. For those of us who have been following these developments since the eighties, we have a common knowledge of terms associated with this idea. If I write a post about a mid grid, which few sites have done, I am discussing a topic that is common knowledge, but not in the general pool. Should my post then receive a better citation?
    In the end, site owners have to follow the advice of not placing all of their marketing hopes with the search engines, as you have stated previously.

  14. Hi Frank,

    Thoughtful points. Thank you.

    I’m not sure that we’ve seen enough of this question answering approach from the search engines to get a handle on how to present this kind of information in a way that will be both engaging enough for people to want to visit the source, and plain enough that the search engines would want to use it as an answer. Maybe if we look more closely at the way that Google treats definitions, we might get some ideas from them (do some searches for define:*keyword phrase* to see examples).

    Facts, or common knowledge, by themselves should be outside of copyright, but the expression of those facts may remain within copyright protection. It might be safer in many instances to provide a citation as a source. When you talk about specific terms of art that may be well known within an industry, but not as well known to the mainstream population, it makes a lot of sense to provide a decent attribution or citation and a link.

    Many people do turn to search engines, but they aren’t they only way that people find sites. That’s part of what makes marketing on the Web interesting.

Comments are closed.