The Oracle at Yahoo: Using Yahoo News to Search the Future

Imagine exploring millions and millions of news pages and other documents to find information about events that are scheduled to happen in the future, to help predict the future.

The oracle Sibyl at Delphi

This kind of future search, or future retrieval, might be able to support the making of decisions in many different fields.

News information could be used to obtain information about possible future events, and that information could be made searchable, so that it can help people plan for the future.

The Yahoo patent application is:

Techniques for Searching Future Events
Invented by Ricardo Alberto Baeza-Yates
Assigned to Yahoo
US Patent Application 20080040321
Published February 14, 2008
Filed August 11, 2006

Under this process, time would become a standard part of information collected about documents. A ranking model would be built based on time segments.

Much of the news does contain information about future events. The authors of the patent application tell us:

An exemplary sample from a web-based news service on Dec. 1st, 2003, included more than one-hundred thousand references to years 2004 and beyond. About 80% of the references related to the immediate future (e.g., within days, weeks, or a few months) and, on average, more than one future reference was included per article.

We estimated that there were at least half a million references to future events in the sample. Assuming that there is a ten-fold repetition redundancy (i.e., similar articles in different newspapers), this yielded an estimate of about fifty thousand unique articles about the future. A similar analysis only on headlines gave around 10% of that number.

They also looked closely at future event information for a date in 2005:

In a sample taken from the same news service on Jul. 15th, 2005, the number of references to years 2006 or later was over 250 thousand. For example, for the year 2034, news items relating to the following topics were included in a sample of almost 100 news items:

(1) The license of nuclear electric plants in Arkansas and Michigan will end;

(2) The ownership of Dolphin Square in London must revert to an insurance company;

(3) Voyager 2 should run out of fuel;

(4) Long-term care facilities may have to house 2.1 million people in the USA; and

(5) A human base in the moon would be in operation.

So, when searching for “energy” or “health” in the future, a future retrieval system should return, for example, items 1 and 4, preferably classified by year. On the other hand, when searching for “2034″ and “space,” the system should return items 3 and 5.

This kind of future search would include an information extraction system that would recognize expressions about time, dates, and durations, and the probabilities that certain events will happen.

It would also include an information retrieval system, so that people can search using text queries, and possibly specify time segments during their searches. So, if you search for the year 2034, you might find the most important topics or likely events or both, associated with that year.

In addition to providing information about possible futures that might be used to help support decision making in many fields, the same system could be turned backwards to look at past events, and perhaps understand them better.

Some related publications from the inventor listed on the patent application, Ricardo Alberto Baeza-Yates:

Share

26 thoughts on “The Oracle at Yahoo: Using Yahoo News to Search the Future”

  1. A Yahoo Future Search sounds extremely interesting and, on some basic level, workable. I just hope it doesn’t degenerate into some sort of “Yahoo Tarot Card” or “Yahoo Palm Reader” scenario.

    However, the general idea of a future search engine for news and events could be very practical. The whole concept seems like something right out of two recent books I have read: Fooled By Randomness and The Black Swan – The Impact of the Highly Improbable, both by author Nassim Nicholas Teleb.

    At the very least, a future search like this could prove to be an interesting novelty that offers up some basic insights into future events.

  2. Thanks.

    I think the approach they are taking towards it is pretty interesting, and I’d love to see it in action. I hope it’s something that Yahoo is serious about developing and sharing with us.

    Thanks for the Nassim Nicholas Teleb references, too. I’ll be looking out for those.

  3. Sorry Bill,

    My bad on the author’s last name. It is Taleb, not Teleb.

    The Black Swan: The Impact of the Highly Improbable

    Fooled by Randomness: The Hidden Role of Chance in Life and in the Markets

    Obviously, from the titles, Taleb deals a lot with chance and randomness as well as the impact of Black Swans ( wildly unpredictable statistical outliers ) on the future.

    However, he totally debunks much of traditional statistics, economics and other academic disciplines for their false sense of security and their smug, holier-than-thou ivory tower tendencies to explain away things from the perspective of hindsight. Think: hindsight isn’t 20/20, it only appears to be. LOL.

    Both books are mind boggling and amazing, albeit somewhat long-winded and self-referential in parts. Still, both are as mind altering as The Wisdom of Crowds by James Surowiecki. Another great book, especially as it relates to our current collaborative web 2.0 culture.

    Thanks again for your unique blog posts.

  4. It reminds me of Google’s recently announced timeline-based search, but extended into the future. Interesting idea, would be curious to see it in use. Just want to note that results from the future will generally have much lower probability of happening than results from the past (which are usually at 100% probability barring mistakes and attempts to rewrite history). So the user’s perception of the “future” results should probably be guided. Thanks Bill for sharing this.

  5. Bill,

    Don’t pay for them. I am finished reading both of them and would probably just end up selling them back to Half Price Books for nothing. I will send them to the address on this blog. Consider it payback for all the great information you post.

    Thanks.

  6. Thanks for the links.

    I may run down the street to my local bookstore during lunch to see if they have either of them. They sound interesting. Much appreciated.

    You’re welcome.

  7. Ahh yes the force is strong with this one! Maybe I am understanding this wrong…

    But with anything algorythmic it can be manipulated. So if that is the case can I go see the Yahoo oracle and get my future changed?

    I am going to have to check into this further. Great find. Quick question for you though. Do you just hangout around the US patent office waiting for information to come in like this? I have seen a lot of really fresh SE patent info coming across lately.

  8. @ Dmitri, I was reminded a little of Google’s timeline based search, too. I’d guess that there are some similarities.

    I like that they gave us some specific examples of the kind of future looking information in the patent application.

  9. @ Chris, Thanks. I guess that predictions work like that. Hopefully, a future search like this one will give you more information to act upon than you had access to before. Not so much changing your future as providing a useful decision making tool.

    I mostly get my information from the online US Patent and Trademark Office database, though sometimes I’ll find something interesting on the WIPO site. No hanging around government offices here. :)

  10. I think that this is more important than you have picked up on.

    One of the best IR based ideas I have seen come from Yahoo rather than Google is to basically design your index so it spans the searcher’s query space better than the underlying document space. This changes in time, so with the ability to predict what topics are going to be hot in the next month comes the ability to build your index so it targets the areas that users are going to be searching in the next month.

    This can not only improve precision of search, but it can improve performance and the cost of running the search service – when you hash by this data you can improve performance significantly (and thus power) by predicting the data you are going to be hashing better.

    Tim Wintle

  11. Hi Tim,

    There is a lot of value to this process in many ways, but I think that you’ve expressed one of the most important aspects of it very well – being able to understand and perhaps focus upon events that will happen in the future is something that can be very powerful for a search engine, and for an information portal (we can’t ignore that about Yahoo either).

  12. However, he totally debunks much of traditional statistics, economics and other academic disciplines for their false sense of security and their smug, holier-than-thou ivory tower tendencies to explain away things from the perspective of hindsight. Think: hindsight isn’t 20/20, it only appears to be. LOL.

  13. Wow! Love it! Trying to predict the future trends by analysing what is currently being said about it? Did I get this right? More or less….soooo, where does this leave us?

    Great awareness can be created via this medium of really serious, relevant potential vectors or outcomes….all based on what people are talking about right now. Interesting concept that can be also be used to reinforce certain social programming, and supress other non-desirable inputs, especially in a more censored web 2 environment. Am very interested to see where this is going!

  14. @Bill: Good point, I think I have a tendency to focus on search and IR, and completely forget about the directory aspect to Yahoo! (I only go there every six months or so to check my rankings). It’s certainly a revenue-maker on several fronts.

    @lory au: As someone who spends most of my time around an “ivory tower” or doing research of some kind, I’d like to say that Statisticians and (research) economists have always been completely aware of the nature of hindsight. A friend of mine did his undergraduate project on showing that the best model available for simulation of the stock maket is statistically no better than taking a random guess (he’s now working in the stock market).

    The problem normally comes with the people who the statistics get passed onto – mathematicians are aware of what figures mean (and what they don’t mean – which is always far more than what they do), but the people who then view the data, e.g. reporters (and increasingly the general public with more websites giving statistics on demand), don’t always know enough mathematics to know what you can’t take from the results.

    Take the google “result 1-10 of about 100,000″ result:

    A computer scientist would know that that is *probably* correct to the order of magnitude +- 1 (so 10,000 – 1,000,000), and that the number shows the number of publicly crawlable pages that Google thinks are in the ? [m/b]illion pages most worth indexing (not necessarily best), with some duplicates, and for this month. They also know that is all that you know about the data.

    The newspapers often runs with stories that show they think this means there are almost exactly 100,000 webpages about your query online. They also seem to think that these numbers can be directly compared – so by running articles like “more cat lovers than dog lovers online”. This type of article is obviously flawed on so many levels once you start to think about it.

    Similarly, when creating statistical models, researchers will normally say the past data fits their model with …% accuracy, and this model shows interesting behaviour in the future. They generally focus on this future behaviour because it is the point where the model will be tested for accuracy the most. It’s very easy for reporters to run off with this interesting behaviour and say “Scientists predict that in the future … “. Unfortunately there is little reason for scientists to discourage this, since news coverage helps to increase funding and interest in the research.

  15. Hi Bill, interesting post.

    Yahoo’s 2034 example sounds great now, but one thing I wonder is how useful the feature will become as the date gets closer, and the data surrounding the year grows exponentially?

    Instead of the 100-odd stories that the query returned in 2005, the same query in 2030 might return many hundreds of thousands. The search engine’s ability to associate date with the items it’s finding will be no less useful, but the user would still have to refine their search much further.

    On a tangentially related note, not sure if you’ve seen Hubdub, which seems to be trying community-based news forecasting. Got covered in the Guardian tech blog

  16. @ lory au, I’m looking forward to digging into those books, which just arrived yesterday.

    @ Jacques Snyman, It’s an exciting idea, without a doubt. Armed with more relevant information taken from the news, can we predict aspects of the future, or are we just capable of making better decisions armed with better sources of information?

    @ Tim Wintle, One of the difficulties about data is that it needs to be interpreted, and that it can often be interpreted in ways that were completely unintended. We see what you are talking about often with public polls, too. What people rarely ask when they see poll results are questions like, “who paid for this poll” and “what reason did they have to commission its creation?”

    @ Simon, information overload can make the use of a predictive tool difficult. The patent application does indicate that information will be searchable and sortable like normal search results, with results shown in response to search queries. I hadn’t seen Hubdub – looks interesting. May be spending a little time over there figuring out what might be going on in the future. Thanks.

  17. Hey Bill,
    Can’t believe I missed this posting… Seems to me the coming expansion of access to the “deep” or “semantic” web data would support this type of predictive search in a big way. For example, having improved ability to correlate date from multiple databases such as just-published research papers and time-to-market industrial data could allow predictions of new drugs or products.

    Of course, we all know that change often happens disruptively or algorithmically, rather than on a smooth, incremental path (making predictions difficult or impossible). And that the huge human factor – interest, will, political clout, funding, etc. etc. have a huge impact on whether things that are possible become actual.

    But to me predictive search seems like a natural outgrowth. I wouldn’t be surprised to find it gets tied in with 3D modeling already used for govt. and military applications, especially for close-term predictions…

    Thanks for this, made my day!
    –Sandra

  18. You’re welcome, Sandra.

    I agree. Using the news is one step, but expanding to information that might be found on the deep Web may open up a lot of additional possibilities.

    An approach like this one may bring some interesting things to us in the future. I’m hoping that a group like the Nature Conservancy can make as much use of it (or more) as industry.

  19. Agreed, Bill. There are many very worthwhile causes that could benefit from advance knowledge. Hopefully this type of functionality will become commonly available.

  20. Bill, ran across this related article today, about ‘surprise modeling’ and Microsoft’s SmartPhlow application, used to predict traffic patterns:

    http://www.technologyreview.com/read_article.aspx?ch=specialsections&sc=emerging08&id=20243&a=

    …”The question is how wide a range of human activities can be modeled this way. While the algorithms used in SmartPhlow are, of necessity, domain specific, Horvit­z is convinced that the overall approach could be generalized to many other areas.”

  21. Isn’t that what Bing declared it was? A Decision Engine? Or was “decision engine” just a catch phrase when Microsoft launched Bing.

    I wonder where Yahoo will get trusted, reliable, investigative news stories from. As it is it would seem 80% of all news content publishers will lock up their content behind paywalls.

    How will the search engine separate facts from fiction stories with reference to the future? Can it tell gossip/rumours from truth? Internet is a place where gossip and rumours run amock and spread like wild fire so I wonder how relevant the search results from this search engine would be?

Comments are closed.