Are large news agencies, with a wide scope of international coverage on multiple topics, with large numbers of reporters, and finely edited articles, better news sources than smaller and more local papers, or narrow niche blogs?
A patent on ranking articles in Google News was granted this week that was originally filed in 2003. It discusses several ranking factors that it might use to present news articles based upon the “quality” of the news sources involved.
What is very interesting about it is that it provides some insight into the assumptions behind those ranking factors. I suspect that Google may have changed its stance on some of the assumptions behind those factors since then.
The patent doesn’t include a full range of signals that Google probably considers in ranking news stories, such as the freshness of the news (as noted in Google’s patent filing on Universal Search), or whether or not a certain source is the original.
As an aside, a fairly technical, but interesting paper on the topic of finding real-time, or very near real-time origins of content from News articles or blog posts or web pages, from Google researchers is Detecting the Origin of Text Segments Efficiently (pdf).
The premise behind developing quality signals for news articles is established early on in the patent:
For example, suppose a person wishes to obtain the latest news regarding a particular topic via the Internet. The person accesses a website that includes a conventional search engine. The person enters one or more terms relating to the topic of interest, such as “Iraq,” into the search engine to attempt to locate a news source that has published an article relating to the topic.
Using a search engine in this manner to locate individual websites that provide news articles relating to the desired topic often results in a ranked list of hundreds or even thousands of “hits,” where each hit may correspond to a web page that relates to the search term(s).
While each of the hits in the ranked list may relate to the desired topic, the news sources associated with these hits, however, may not be of uniform quality.
For example, CNN and BBC are widely regarded as high-quality sources of the accuracy of reporting, professionalism in writing, etc., while local news sources, such as hometown news sources, may be of lower quality.
Therefore, there exists a need for systems and methods for improving the ranking of news articles based on the quality of the news source with which the articles are associated.
I’m questioning that assumption, that sources such as CNN or BBC, may be better sources of quality information than hometown news sources in many instances. I think it’s often possible that a local reporter and a local hometown news source may hold the potential to provide details and insights, and information that a larger organization may miss. It is worth looking at the signals that are listed in the patent, though.
The patent is:
Systems and methods for improving the ranking of news articles
Invented by Michael Curtiss, Krishna Bharat, and Michael Schmitt
Assigned to Google
US Patent 7,577,655
Granted August 18, 2009
Filed September 16, 2003
Abstract
A system ranks results. The system may receive a list of links. The system may identify a source with which each of the links is associated and rank the list of links based at least in part on the quality of the identified sources.
Source Rank
At the heart of the patent is a method of ranking sources for articles that may be on the same topic, to present those articles in order (or determine which might be shown on the front page of Google News, or in a Google News search result.
The process of coming up with a source rank score for a news source is based upon looking at several metrics for each news source, which measure different attributes of the source.
Here are those metrics:
Number of articles produced by the news source during a given time period
Presumably, the more articles (non-duplicate articles) produced by the source over some time, the better. We’re told that as an alternative, the search engine might consider the number of original sentences published by the news source during that time.
Average length of an article from the news source
It could be measured in words or sentences. If CNN’s articles average 300 words, while a local source averages 150 words per article, CNN might be given a value of 300 for this metric while the local source might be given a value of 150.
Are longer articles better? If a search engine were to look at CNN’s top 100 news stories from the past week, and the top 100 news stories from another source, and compare the length of those, should the source with the longest articles be considered higher quality? If the search engine instead clustered together all articles on a specific story and looked at the length of those, would the longest again be the higher quality story? This metric appears to indicate that it is a signal to consider.
Breaking news score
How soon after an important event happens does the news source publish a story about it? If all of the stories about that event were clustered together, and the publication dates and times were viewed, the sources that responded quickest would have a higher “breaking news score.”
Usage pattern
If the search engine were to track how many people followed links to particular news sources when they were presented with links to those sources, which sources did people tend to visit more? This doesn’t measure the “popularity” of news sources as much as it does whether or not people follow links to particular sources when they see those links in search results.
Human opinion of the news source
People who use the search engine may be polled to identify news sources that they enjoy reading or have visited. Other measures may also be used as well. For instance, we are told that newspapers can be compared based at least in part on the number of Pulitzer prizes the papers have won. We’re also told that the age of a news source “may be taken as a measure of confidence by the public.” As another alternative, evaluators might be shown a selection of articles from different sources, and be asked to assign a score for their sources.
Circulation statistics of the news source
The circulations statistics of print publications associated with a source, agency usage statistics “such as Media Metrix and Nielsen Netratings,” and other possible ways of measuring traffic to a source might be considered.
The size of the staff associated with the news source
The number of distinct journalist names from articles in the news source might be viewed.
The number of news bureaus associated with the news source
This seems to favor larger and more established news agencies.
Original named entities appearing in articles produced by the news source
A named entity is a specific person, place, organization, or thing.
If all the stories about a particular event were clustered together, and one included mentions of named entities that other articles on the same topic don’t include, it might rank higher than others. This metric is supposed to indicate that news sources are “capable of original reporting.” There are some limitations to using this approach. For example, the publication dates of the articles might be considered to see which article included which named entity when. Variations in spelling and abbreviation might also be examined when determining whether the named entities in articles are unique.
Number of topics on which the source produces content
Articles from news sources might be categorized into different topics, and the range of those topics might be considered as an indication of the breadth of that source. This seems to favor more general sources than ones focused upon a narrower niche. A more focused source may have higher quality articles about the topics that they specialize in.
International diversity of the news source
This looks at the number of countries from which the news site receives traffic on the Web. The search engine might look at something like the IP addresses of people who click through links to the sources, to see how to spread out their audience might be across the globe.
The writing style used by the news source
The search engine might use automated tests to measure spelling, grammar, and reading levels for a news source.
Other signals might also be considered, such as the number of links that might be seen pointing to the news website.
Conclusion
There have been a few other patent filings from Google about Google News, but none of them have gone into the kind of specific detail on signals that the search engine might look at in ranking sources and articles like this one have.
While this was filed almost 6 years ago, it does provide details for an algorithmic approach to assigning scores for news sources that could be used to rank news articles in Google News, and many of the assumptions behind specific factors in that algorithm. It’s possible that some version of this algorithm is still in use today, and a number of the ranking factors involved may also be in use.
I do question some of the assumptions that are made.
For instance, if a breaking story came out about a discovery in Physics, and a reputable and well-respected site on Physics News published an insightful and detailed article on the discovery, it could be a better source for the topic than a news source that may have written about the discovery first, has many more reporters and much wider circulation, gets seen by a much more international audience, has a wide number of news bureaus, has been publishing since the 1800s, and was written by someone who doesn’t know much about physics at all.
If you were interested in that discovery, which story would you rather see?
Bill thanks so much I glanced over that a few times. Saw some names that were familiar. It did look like a framework that could work in the current “social” landscape. Agree to some extent on the importance of the supporting article… but… it always comes down to different strokes for… which is always what we deal with.
FWIW, This is “news” so freshness is an important signal as seen in the Micheal Jackson story and Iraq elections SE’s used to own that… now… not so much. IMO, SE’s see the social signal as their “vulnerability” right now. I mean, it is the Internet… trend or fad is always a concern for an internet technology company.
Bill, this is a great post.
I think information reveals an even greater flaw in Google News: publications that break stories are often not ranked. Instead, the url’s like AP or CNN with a higher domain authority are able to leverage the hard work and creative journalism of small publications into huge traffic scores.
I think this really outlines one of the major issues of the notion of domain authority. The link rich will get richer, and the small contributors will get choked off. In the end, this will be astonishingly counterproductive to Google’s original goal, which is to reward and serve unique content.
I have a number of ideas that IMO would fix this problem, but I might save it for my own blog π
Hi Terry,
You’re welcome. Absolutely. Freshness is a key element of news – the question is though, once you’ve identified an important event or news worthy topic, which source do you point people towards. And how do you do it as quickly as possible, especially on your news page? As they say in the article on Detecting the Origin of Text Segments Efficiently :
The patent focuses upon using signals based upon the “quality” of news sources, but we’ve come a long way in a short period of time since it was published. And I think people have greater expectations now of news and search. I think the quality measures are still very much real concerns for Google, but what’s a search engine to do if a very news worthy story is broken on twitter. π
Excellent synopsis. I would have to agree, I assume that the quality and authority of the news provider would be designated as more relevant than a small local news station, because whenever I search for popular topics (politics, world news, etc) I often see news results from major agencies… I am curious if they evaluate user experience and time on the news website to find out which news providers offer the best content.
Hi Bill,
Isn’t this “news patent” about Google’s inability to control/own online rich keyword domain names/phrases and companion content – “the narrow niche?” Google is an aggregator of content and in order to maintain its’ “authority status” with the masses and premium advertisers, it has to aggregate real-time quality information – a news channel – from “authority/branded websites.”
When I search – for any keyword term/phrase – I always hit “news” – and never search “the web” – mainly because “the web” does not have the up-to-date, relevant information that I am seeking. I am seeking “quality information” from “authority sites.” If you take a look at the “advertisers” on any news search result vs web search results, you will begin to see a “premium” tier developing. The advertisers on the news results are advertisers with big budgets. General web search results, because of spam and the lack of relevant results, are going to become Google’s “remnant inventory” for advertisers. Think “parked domains.” If Google cannot “own” the domain name, and the owner of that competitive domain name is a “news authority website” and hence able to be place on the first page of organic SERPs by writing original content, and not have to use Google’s advertising services to get there, Google needs a “new” channel for SERPs to “compete.” This channel, I believe, is Google News.
AOL’s Tim Armstrong is hiring “journalists” who write original content. Some of these journalists own rich keyword domain names, already in the organic SERPs – content categories – and they are writing original content for AOL’s “engine.” AOL is building and buying brands — taking, in my opinion, keyword rich “domains” and monetizing these keyword phrase/categories with expert content — exactly what advertisers are looking for. And, Tim Armstrong is doing this under AOL’s umbrella. Google is an aggregator of information, and this current “news” patent will determine, by “quality and authority,” who and what category news gets listed in Google’s “news” channel. Will Google’s “news” channel eventually be a listing of “information” in categories featuring content results from AOL’s journalists (among others, of course….)?
I believe AOL is the one universal “brand presence” that has an opportunity to upset the search power of Google. I think Tim Armstrong knows this, and I think Google knows it as well. Google’s news patent is all about determining who said what first – original content. And, that is exactly what Tim Armstrong’s writers and editors are attempting to do – write original content and grab marketshare – first. Google also knows that it will be difficult to compete on “local content” besides the obvious address and telephone number. Enter AOL’s purchase of Patch. Again, real writers of real content in narrow niches. Just like MSNBC’s recent purchase of “Every Block.” So Google will have to aggregate this “quality” content from “authority websites” to stay “credible” with the masses and advertisers. Google is now having to think about “the next click.” Calendar marketing and timing will plays a huge role with large advertisers in content marketing of news and hyper-local information on the web — think back to school, Super Bowl, etc. The word “news” has always been a powerful word to capture our attention. This “news” patent is just the beginning of Google’s “marketing strategy” to demonstrate the online equity of original content.
Bill, do you think Google’s news patent has any bearing on Tim Armstrong’s push to turn AOL into a Newsroom? Do you think Google would buy AOL?
Interesting stuff and some of it I’m sure I can really use to help grow my own websites. Any time we can get any real info about Google operates it seems like something to celebrate.
Hi John,
I’m looking forward to your thoughts on this subject. One of my concerns is what I see as a flaw in the concept of PageRank and Domain Authority.
There are many people who write about topics in which they have a tremendous amount of expertise. For example, Schneier on Security, is a great resource to learn information about technology-relate security issues, but it’s written by a single author, with no international bureaus, no press staff, a single author, a very narrow niche (including his Friday squid blog posts), and other issues that may not result in a very high “source rank.” And yet, it’s a great source of news and information on security issues. Should Schneier on Security be considered a news source?
If CNN and the BBC were to cover the same event as Schneier on Security, it’s very possible that they would show up before posts from Bruce Schneier. And we might just miss out on more insightful and informative information from an expert on the subject.
Hi Monkey Joes,
One of the signals that the patent mentions is based upon user experience, so I think that they might try to take that into account. But, I think that you are right – that Google News does seem to favor results from major agencies, and many of the signals listed in the patent back that approach.
Hi Sylvia,
I think one of the reasons behind the development of universal search or blended search by all of the major search engines is to enable them to insert news results into their web results in an effort to try to capture more timely information.
If AOL is working on developing rich fresh content, I think it is because they understand the value of being a portal tht attracts visitors, and keeps them on their pages as long as possible, making it more likely that those visitors may see more advertisements and possibly even click through some of them.
I’m not sure that Google wants to be a content producer, or acquire one. Their focus seems to be more on making information and content accessible rather than creating that content.
The patent was filed in 2003, and AOL was a very different place then – more of a walled garden for its own members than the destination hub it seems to want to become now.
Hi Mark,
I agree. Anytime we have a chance to get a glimpse into some of the ideas behind the development of features at Google or Yahoo or Microsoft, it’s worth spending some time looking at and thinking about.
Nice Post Bill.
Again, thank you, CAP.
Hi Bill, interesting topic, I remember once when I was blogging using a blogspot low PR blog I picked a title that was very hot in the news at that time surprisingly it was the only post cached within seconds of pinging, this made me think of how google treats any post related to a new hot story regardless of the source
Hi Jake,
While the patent is about news search, some of the ideas and assumptions about the ranking of pages might flow over to web search. Blog posts can get discovered pretty quickly as a result of pinging a search engine, and if they are on a topic where there are a burst of searches, they might get a boost in search results for search queries that they may be relevant for.
As you note, that’s inspite of the source – Google News may place more weight on where a story comes from than web search, or blog search, when it comes to including timely information in search results – especially if a post is very relevant for a query term.
I really like real-time search engine OneRiot’s – http://www.oneriot.com – PulseRank system for determining relevant news based on a page’s popularity among social network users on Twitter, Digg etc. I wonder what Google will come up with to compete in the real-time search space.
Hi People Finder,
I hadn’t had a chance to explore OnRiot before. Thanks for the link.
I’ve seen the phrase “real time search” in interviews with executives from Google over the past couple of months, but without any indications of how they might work to achieve those results. We know that Google can find information from blogs within minutes, when those sources ping Google. News articles also can show up in Google fairly quickly, and that’s likely because Google only includes a limited number of news sources.
Freshness is often an important feature for those sources, as well as finding novel information from sources that produce substantially similar results, such as wire stories from the press.
I’m interested in seeing what Google comes up with as well.
I have to agree with John:
“I think this really outlines one of the major issues of the notion of domain authority. The link rich will get richer, and the small contributors will get choked off. In the end, this will be astonishingly counterproductive to GoogleΓ’β¬β’s original goal, which is to reward and serve unique content.”
Sad but true…who’s going to reward the new unique content? My only guess to resolve this problem over time would be the ‘readers’ themselves. If they allow some social ranking into the mix it will help avoid completely cutting off the entry of new quality publishing sources.
Once a new publisher hits a threshold, at that point it would likely be able to gain some link-richness.
Hi Mitchell,
That is one of the very real challenges of publishing online, regardless of whether you’re a new blogger, or a main stream media source that’s using its offline presence to fuel online readership. Don’t know if you’ve read Clay Shirkey’s Power Laws, Weblogs, and Inequality from a few years ago, but social ranking has its limitations as well.
Many of the best sources of information online, in blogs and other types of sites, aren’t the biggest, or the most well known, or have the most readers or links.
Wow, I’ve never thought that Google’s algorithm considers so many factors.
But as you said, it still has some errors. If I were interested in a discovery you mentioned, I would like to check the story in Physics.
Hi Daniel,
This post focuses upon searches at Google News, but it does show that Google will consider a lot of different factors when it ranks pages, and some of the most importan factors for one type of result such as News may not be as important for other types of results, such as web pages.
Isn’t Google developing and ready to launch Realtime Hub? check (404 page) for social search option.
Hi Ethan,
I think Google is firmly behind the development and use of PubSubHubbub for a number of their products, including Google Reader and Google Alerts and a few others.