If you search for news at Google News, you’ve probably noticed that you can view news articles by date or by relevance.
Many of the news articles that you find in Google News are from sources like wire services, where the information is shared amongst many newspapers. Reporters have the option of adding additional information, but often wire service articles at different papers contain little more than the original material, and may often contain less than the original.
So it’s possible that there may be many articles that are substantially the same, and if those are the most relevant result for a search at Google News, it’s likely that Google doesn’t want to show all of those article in their results.
How does Google decide which articles on the same subject to show in Google News, and how to rank those news articles?
You may sometimes see a short list of blog posts at the bottom of search results at Google in response to a particular query. If Google comes across a number of blog posts that are about very similar topics, and it wants to rank those based upon the ones that present the most novel information, how does it choose amongst them? How might it decide which to show along with web page results?
A newly granted patent from Google describes a method of finding unique content amongst documents that may play a role in determining what results we see within instances like those:
Detecting novel document content
Invented by M. Bharath Kumar and Krishna Bharat
Assigned to Google
Filed March 20, 2006
A system determines an ordered sequence of documents and determines an amount of novel content contained in each document of the ordered sequence of documents. The system assigns a novelty score to each document based on the determined amount of novel content.
The process involved in determining whether or not documents are novel first requires that the search engine find enough similarities between news articles or blog posts or other kinds of documents. It might do this by looking as some different aspects of those documents, which it refers to as “information nuggets.” Information nuggets are small pieces of information that might be shared between different sources. Some examples might be:
1) Named entities – Do the articles mention specific people, places, or things? For example, news articles or blog posts that are published within a certain period of time that mention Walt Disney might be related.
2) Sequences of words in titles – News articles or blog posts that might share some sequences of words in their titles may also be related.
3) Numbers appearing in documents – documents about Mount Everest might include the height of the mountain, or a number that is close. The patent notes that while one document might mention the mountain’s height as 29,000 feet, and others as 29,028 feet, the different values may be “determined to be equivalent information nuggets.”
There are possibly other information nuggets that could be compared from one document to another that aren’t covered in this patent, but the basic idea is to find enough similarities so that news articles or blog posts or other kinds of documents could be clustered together. Once they are, differences between the documents may then be considered to determine how novel each may be.
Time also plays a role in this process, with similar documents published a certain number of days later not figured into a relevance determination for the rest of the documents, even if they do share a substantial number of information nuggets.