If you search for news at Google News, you’ve probably noticed that you can view news articles by date or by relevance.
Many of the news articles that you find in Google News are from sources like wire services, where the information is shared amongst many newspapers. Reporters have the option of adding additional information, but often wire service articles at different papers contain little more than the original material, and may often contain less than the original.
So it’s possible that there may be many articles that are substantially the same, and if those are the most relevant result for a search at Google News, it’s likely that Google doesn’t want to show all of those articles in their results. Google likely has a preference to show searchers novel content. How might it identify that novel content, and might it make a difference regarding how pages are ranked in search results?
How does Google decide which articles on the same subject to show in Google News, and how to rank those news articles?
You may sometimes see a shortlist of blog posts at the bottom of search results at Google in response to a particular query. If Google comes across a number of blog posts that are about very similar topics, and it wants to rank those based upon the ones that present the most novel information, how does it choose amongst them? How might it decide which to show along with web page results?
A newly granted patent from Google describes a method of finding unique novel content amongst documents that may play a role in determining what results we see within instances like those:
Detecting novel document content
Invented by M. Bharath Kumar and Krishna Bharat
Assigned to Google
Filed March 20, 2006
A system determines an ordered sequence of documents and determines an amount of novel content contained in each document of the ordered sequence of documents. The system assigns a novelty score to each document based on the determined amount of novel content.
The process involved in determining whether or not documents contain novel content first requires that the search engine find enough similarities between news articles or blog posts or other kinds of documents. It might do this by looking at some different aspects of those documents, which it refers to as “information nuggets.” Information nuggets are small pieces of information that might be shared between different sources. Some examples might be:
1) Named entities – Do the articles mention specific people, places, or things? For example, news articles or blog posts that are published within a certain period of time that mention Walt Disney might be related.
2) Sequences of words in titles – News articles or blog posts that might share some sequences of words in their titles may also be related.
3) Numbers appearing in documents – documents about Mount Everest might include the height of the mountain or a number that is close. The patent notes that while one document might mention the mountain’s height as 29,000 feet, and others as 29,028 feet, the different values may be “determined to be equivalent information nuggets.”
There are possibly other information nuggets that could be compared from one document to another that isn’t covered in this patent, but the basic idea is to find enough similarities so that news articles or blog posts or other kinds of documents could be clustered together. Once they are, differences between the documents may then be considered to determine how novel each may be.
Time also plays a role in this process, with similar documents published a certain number of days later not figured into a relevance determination for the rest of the documents, even if they do share a substantial number of information nuggets.
Note that this patent is co-invented by Krishna Bharat, who started Google News, and has filed a number of patents that apply to how Google News results are ranked. Google News often ranks many news stories that are similar, so having novel content as something that makes a difference in ranking makes a lot of sense