A new patent filing from Yahoo raises the question, “How much has social media influenced the expectations of searchers, and forced search engines to change?”
Before I can begin to even think about that, I have to ask if looking at Yahoo patents even a good idea after their 2009 deal with Microsoft to have Bing power their search results.
The Yahoo patent application was filed after the agreement between Yahoo and Microsoft, and was published last week. Are Yahoo patents are still worth spending time with? After reading through the Yahoo patent application about how the search engine might use information from social media platforms to discover recently hot topics and webpages that are relevant to those topics, I would say that they are. The terms of the agreement between Yahoo and Bing includes a 10 year exclusive right for Microsoft to use search technologies developed by Yahoo, and doesn’t stop Yahoo from applying those technologies itself.
The patent filing explores “recency-sensitive” queries, where searchers are looking for resources that are both topically relevant as well as fresh, such as novel information about an earthquake. If you’ve been watching twitter streams, Facebook updates, and other social media, you’ve seen that sometimes these sources are the best and fastest places on the Web to find that kind of information.
It’s possible that a search engine that ignores sources like those isn’t going to be able to return any relevant results for those types of queries – what the patent’s inventors call a “zero recall” problem.
Whether it’s Charles Barkley first announcing his retirement on Twitter, or news of an earthquake traveling across the social network faster than its shockwaves might spread across the firmament, or another event that someone eyewitnessed and reported upon before the media had time to file a story to be edited and published, social media has increased the speed with with news travels across the globe. These time sensitive stories are forcing the search engines to look to social media to find information that people are interested in hearing about, as close to the time they happen as possible.
The freshness of content found on the Web is going to be influenced by crawl policies imposed upon a web crawler, so Googlebot or Yahoo’s slurp or bingbot may visit a particular page and then not return for a while, based upon their specific schedules for crawling a site. That means that sometimes breaking news from a resource that might report on it isn’t always going to make it to a search engine that relies upon crawling as quickly as stories might be reported upon by sites that might write about them.
An even if a search engine were to be able to capture such fresh information, the ranking signals that most search engines use tend to be based upon features that relate to “long-term popularity and usage that can be used for ranking such as in-link statistics, Web page rank, click-based statistics, or the like.”
This patent application introduces a crawler that monitors microblog data streams which include things like tweets and status updates to discover and index fresh content that might uncover information to be used in response to recency sensitive queries.
Last year, I wrote about a Yahoo patent that described how they might decide which pages to use as “seed sites” to start webcrawls with to identify quality pages, in the post: What Makes a Good Seed Site for Search Engine Web Crawls?, and there was a lot of discussion in the comments about the potential value of looking at social media sites as starting points for crawls. The biggest value in using those does appear to be in finding very up-to-date content, and using a data stream directly from those sources means less work in actually crawling those pages.
The patent filing is:
Ranking of Search Results based on Microblog data
Inventws by Anlei Dongm, Pranam Kolari, Ruiqiang Zhang, Jing Bai, Yi Chang, Zhaohui Zheng
Assigneed to YAHOO! INC.
US Patent Application 20110246457
Published October 6, 2011
Filed: March 30, 2010
An information retrieval system is described herein that monitors a microblog data stream that includes microblog posts to discover and index fresh resources for searching by a search engine. The information retrieval system also uses data from the microblog data stream as well as data obtained from a microblog subscription system to compute novel and effective features for ranking fresh resources which would otherwise have impoverished representations.
An embodiment of the present invention advantageously enables a search engine to produce a fresher set of resources and to rank such resources for both relevancy and freshness in a more accurate manner.
The patent points at the following as reasons to look at microblogging information to help respond to recency sensitive queries:
(1) Microblog posts are likely to contain URLs of important documents that have not yet been indexed by a search engine via conventional Web crawling;
(2) Documents linked to from a microblog post may be relevant to recency-sensitive queries;
(3) Text found in microblog posts can be used to expand the text used to find resources involved in such a query; and
(4) Other aspects of a social network can be used to rank search results.
There’s been some discussion on the Web about how search engines like Google might or might not use social signals like tweets to rank pages found through those types of resources. This patent filing points at how those resources might be discovered. Rather than presenting the tweets or status updates themselves to searchers, the aim of the process described in the patent is at finding pages through microblog posts to present to searchers.
The patent does provide a fair amount of discussion about how those discovered URLs might be ranked as well, and describes using a number of the approaches that have been developed by Microsoft in ranking pages. This tie-in with Microsoft approaches is one of the things that leads me to the conclusion that Yahoo is working to develop ways to find and rank content that will fit in with Microsoft’s indexing of content.
Things like tweets or status updates might be ranked as sources of information based upon a combination of textual features related to them as well as social networking features associated with people who provided the posts.
Some of those social networking features may include things like the number of followers the poster has, the number of posts they have made, the average number of responses they may have received in response to their posts, number of people who shared or retweeted posts, and others.