If you search for news at Google News, you’ve probably noticed that you can view news articles by date or by relevance.
Many of the news articles that you find in Google News are from sources like wire services, where the information is shared amongst many newspapers. Reporters have the option of adding additional information, but often wire service articles at different papers contain little more than the original material, and may often contain less than the original.
So it’s possible that there may be many articles that are substantially the same, and if those are the most relevant result for a search at Google News, it’s likely that Google doesn’t want to show all of those articles in their results. Google likely has a preference to show searchers novel content. How might it identify that novel content, and might it make a difference regarding how pages are ranked in search results?
How does Google decide which articles on the same subject to show in Google News, and how to rank those news articles?
You may sometimes see a shortlist of blog posts at the bottom of search results at Google in response to a particular query. If Google comes across a number of blog posts that are about very similar topics, and it wants to rank those based upon the ones that present the most novel information, how does it choose amongst them? How might it decide which to show along with web page results?
A newly granted patent from Google describes a method of finding unique novel content amongst documents that may play a role in determining what results we see within instances like those:
Detecting novel document content
Invented by M. Bharath Kumar and Krishna Bharat
Assigned to Google
Filed March 20, 2006
Abstract
A system determines an ordered sequence of documents and determines an amount of novel content contained in each document of the ordered sequence of documents. The system assigns a novelty score to each document based on the determined amount of novel content.
The process involved in determining whether or not documents contain novel content first requires that the search engine find enough similarities between news articles or blog posts or other kinds of documents. It might do this by looking at some different aspects of those documents, which it refers to as “information nuggets.” Information nuggets are small pieces of information that might be shared between different sources. Some examples might be:
1) Named entities – Do the articles mention specific people, places, or things? For example, news articles or blog posts that are published within a certain period of time that mention Walt Disney might be related.
2) Sequences of words in titles – News articles or blog posts that might share some sequences of words in their titles may also be related.
3) Numbers appearing in documents – documents about Mount Everest might include the height of the mountain or a number that is close. The patent notes that while one document might mention the mountain’s height as 29,000 feet, and others as 29,028 feet, the different values may be “determined to be equivalent information nuggets.”
There are possibly other information nuggets that could be compared from one document to another that isn’t covered in this patent, but the basic idea is to find enough similarities so that news articles or blog posts or other kinds of documents could be clustered together. Once they are, differences between the documents may then be considered to determine how novel each may be.
Time also plays a role in this process, with similar documents published a certain number of days later not figured into a relevance determination for the rest of the documents, even if they do share a substantial number of information nuggets.
Note that this patent is co-invented by Krishna Bharat, who started Google News, and has filed a number of patents that apply to how Google News results are ranked. Google News often ranks many news stories that are similar, so having novel content as something that makes a difference in ranking makes a lot of sense
Notice the use of the words “might” and “possibly”…
They must be using some sort of Madlib technique…. but we can only wonder. Also I guess they can keep dates of when content was discovered by the spider – so if two articles set off the radar for having strong content similarities, maybe the one that was crawled first scores heavier?
I think of SEO as a kind of Turing test – like in Blade Runner – only the replicants are web pages, not humanoid robots….
Sooner or later someone may evolve content spinning algos that create content indistinguishable from human made pages…
I think Google uses a “lottery” algorithm which randomizes results pages a bit and keeps anyone from actually figuring out their algos π
Interesting, it sounds like they’re thinking of almost reducing each page to a set of Mad Libs blanks and comparing the blanks (template to template) and what’s in the blanks (similarity) to check for “mostly duplicate” content.
If they do decide to implement an algorithm like this, it’ll be interesting to see how they determine the “best” article to show. Cross-referencing with other trusted sources?
And separately, if they take it too far, how would a smaller news source that might have more accurate hyperlocal information but still use 80% of the copy from an AP article compare to a more popular source that only has generic information but immense popularity?
Google does an awful job with judging editorial quality, and while 29,000 and 29,028 are answers to the same Mad Lib blank, they aren’t equivalent information nuggets, one is clearly more precise than the other.
Hi Steven,
It’s possible that they are using something like this in Google News now, though the techniques that they use may differ from what was originally cited in the patent after using it over time.
They do limit the sites that they show results from in Google News, so perhaps they feel that many of them are already “Trusted” resources. I’m not sure how much “popularity” plays a role in this process, but that’s a good question about a site that provides great local information mixed with a wire source information compared to a more popular site with more generic information.
The numeric information nugget is probably something to be watched carefully, but it’s possible that an article that estimates something like the height of a mountain could be a better article than one that copies the information exactly. The patent made me want to spend a fair amount of time comparing Google News results, to see what was listed on a “relevance” search, and what was hidden as a “duplicate.”
Hi Alex,
I have seen blog entries and news entries make it into Google’s index very quickly – we can’t ignore that the search engine is receiving feeds from sources as well as crawling sites. It’s possible that the site that showed information on a topic first might be given some kind of preference, but we don’t know that with any certainty.
It’s possible that content creating programs may be able to replicate human created pages – which might be a good reason to look at information outside of those pages as well, such as reactions to those pages in the form of links, visits, bookmarking, etc.
I’m not sure that a search engine needs to use some kind of lottery feature to protect algorithms – I think it’s possible that they change and evolve frequently enough so that may not be necessary. But it could be something that they do.
Very interesting idea. Perhaps session data could be used to determine “novelty”. I personally love to digest every bit of NBA news I can (hoops fanatic), and I often find myself reading duplicated content. So, when I run across a duplicate article – or a section of an article with duplicate content – I usually close that tab and move onto the next one.
Of course, this would require quite a bit of monitoring, but it might explain the apparent “lottery” that Google seems to use now to rank identical news articles and decide which to place first.
Yes, duplicate content, along with spam, is the polluting white noise of the internet. The search engine that deals with it effectively will have a real edge in the future of search. Obviously, Google understands this.
Hi Jason,
Good points. I think it’s smart to consider how Google might be incorporating user data into other algorithms that they might use. Session data involved in results viewed for a particular query, amount of time spent at each result, distance scrolled down a page at one of those results, and other user search and browsing data could definitely play a role in determining which results to show also.
Hi People Finder,
It is frustrating to perform a search, and see the same content at different locations on multiple pages. I do think that Google does a decent job of keeping this from happening too often.
Some duplicated content, like that found at news sources created in part or full from wire services is a fairly legitimate reason for the duplication of content (as opposed to copyright infringement, for example). But that still doesn’t mean that I want to see the same story over and over again when searching for news. The approach outlined in this patent seems pretty reasonable.
>>They do limit the sites that they show results from in Google News, so perhaps they feel that many of them are already Γ’β¬ΕTrustedΓ’β¬Β resources.>receiving feeds from sources
Psstt… looks like the formatting got messed up on the comment I just did.
Hi Marcia,
I appreciate your stopping by to comment, but I’m not quite sure what point you were trying to make, or question you were asking, with your comment.
Bill, I was trying to quote previous comments and comment on few points but messed up the formatting, so what I was trying to write didn’t get through.
About trust, there are sites getting through occasionally that shouldn’t be there at all (blogs, MFA sites); it looks like they’ll have to turn up the knobs on authority factors to prevent Google News from getting gamed.
Example of a page title for a MFA blog post:
Headline for Hot News Topic | Big Money Adsense Keywords
The story is slightly re-written, and the last paragraph will segue into a call to action using an anchor text link to the actual under-lying money making topic of the site.
Thanks for following up, Marcia.
Sorry about the formatting problems.
There do seem to be sites that are getting through the Google News screening process that shouldn’t be there.
As you point out, there are sites rewriting the news, for the sole purpose of adding a call to action and a link to a commercial site. I hope that Google figures out how to filter those out of the news they show.