Google was granted a patent yesterday on Blog Search, and how the search engine might filter blog posts out of blog search based upon a number of factors. The patent was originally filed in 2006, and it’s the first patent filing I’ve seen from Google that uses the term “splog.” The screenshot from the patent below shows some of that potential filtering process
I’ve written a couple of posts in the past about how Google might be ranking blog posts based upon other patent filings from Google, including Positive and Negative Quality Ranking Factors from Google’s Blog Search (Patent Application) in 2007, and How Google May Rank Blogs in 2010.
The patent application from the first post, Ranking Blog Documents, is still pending as of now, and the patent described in my second post, Indexing and retrieval of blogs, was granted at the time that I wrote about it.
Here’s the newly granted patent:
Providing blog posts relevant to search results
Invented by Kushal Dave, Joshua D. Mittleman, Kevin Scott, Vladislav Shchogolev, and David Alpert
Assigned to Google Inc.
US Patent 8,117,195
Granted February 14, 2012
Filed: March 22, 2006
A device identifies a search result document based on a search query, and searches a blog post repository to identify a blog post relevant to the search result document.
The device also rejects the blog post if the blog post has insufficient length, contains outgoing links located a predetermined distance from the beginning of the blog post, has a large out-degree, was created before or after a predetermined time, or has incoming links with a low link-based score.
The device further provides the blog post in connection with the search result document if the blog post was not rejected.
The patent primarily describes how Google might filter out certain blog posts from being included within its database of posts that might be returned to searchers in a blog search, and also describes how additional information from some posts might be presented by a search engine.
It’s pretty upbeat about blogs themselves, but recognizes that search engines don’t always make it easy for searchers to find blog posts:
Blogs may often provide useful information about a search result, such as honest reviews, contrasting opinions, links to related material, etc.
Unfortunately, search engines do not display blog posts that are relevant to a specific search result, making it difficult to find blog posts containing information useful to a search query.
Undesirable Blog Content
When someone performs a web search at Google, one of the options they have on the search results page is to click on link on the sidebar for blog search results to appear. Those results are displayed from a data repository that contains information about blogs. But it doesn’t capture every blog post.
Many posts that are considered “undesirable” might be filtered out. The patent provides the following examples of the kind of content that might cause a blog post not to be included in the blog repository:
- Stolen content
- Chain letters
- Fraudulent solicitations
- Unwanted pop-up advertisements
The patent isn’t just about possibly filtering out posts that might contain certain types of content, though. It also points to certain possible rules that might be used to filter out another type of undesirable posts:
One example of an undesirable blog is a spam blog, sometimes referred to by the neologism “splog.” Splogs may include blogs which the author uses only for promoting affiliated documents (e.g., documents linked to by the splog).
The purpose of a splog may be to increase the link-based score of affiliated documents, get advertising impressions from visitors, and/or use the blog as a link outlet to get new documents indexed.
The content on a splog may often be nonsense or text stolen from other documents with an unusually high number of links to documents associated with the splog creator which are often disreputable or otherwise useless documents.
In addition, additional blog posts might be filtered out of the repository based upon a review of the outgoing links of remaining blog posts, such as links to content filled with profanity or pornography.
Rules for Filtering Undesirable Content
Here are some of the rules that might be used to remove blog posts from blog search:
Number of outgoing links – If a post has more than a certain number of outgoing links, which might be a predetermined number, such as fifty, then it may be removed. Those outgoing links could possibly include advertisements.
The number of links for that initial threshold might not be a predetermined amount, but might instead be determined by a statistical model, based upon a machine learning approach, to find a number of outgoing links that might provide a good tradeoff between accepted blog posts and rejected blog posts.
If a post doesn’t go past the threshold of outgoing links, it might next be checked to see if it has any incoming links.
Lack of incoming links – If there are no incoming links for a post, it might also be rejected. We’re told:
For example, a blog post may have zero incoming links because the blog post does not contain any useful information and nobody is interested in it. Such a useless blog post may be removed from the repository.
Link score threshold – If there is at least one incoming link to the post, a link-based score for the link might be calculated for any links pointing to the post. If that link score pointing to the post doesn’t attain at least some minimal level, a post might not be included in the blog repository.
This link based score might be increased by incoming links to the post, and decreased by outgoing links to other documents.
Lack of Title – If the link-based score is high enough, the next step might be to determine if the post has a title. If it doesn’t have a title, it might be rejected:
For example, a blog post without a title may indicate that the blog post is not trustworthy and/or contains undesirable content. If the blog post has a title, then the blog post may remain in the repository and not be rejected.
Links to self or same domain – Blog posts with links to the same domain, whether to the post itself or other pages on the same domain, might also be removed from the repository, though the patent tells us that those links within the same domain might be ignored instead.
Links to electronic media – Posts with links to electronic media, such as images, movies, or audio, might also possible be rejected. Not stated in the patent, but it’s possible that rejection might be based upon the type of media being linked to, like the kinds of undesirable content listed above.
Sufficient Length – If a post isn’t of a sufficient length, it might also be removed. While that length might be required to be a certain amount of words, for instance, it might also be an amount determined by a machine learning process.
Distance of links from start of post – If the outgoing links in a post don’t appear within a certain predetermined distance from the start of a post, it might also be rejected. This appears to be intended to avoid posts that might contain too many links.
Recency of posts – Posts that are older than a certain predetermined amount of time, such as 2 weeks, might not be included in search results. Those recent posts might also need to have a certain link based score to be presented as well.
Categories of additional filtering rules
While the filters described above might be used, the patent tells us that it might only use some of them, or it could consider other heuristics, or rules, as well. Those could fall into categories that might consider: topicality, quality, freshness, and/or significance.
Topicality would involve whether a blog post is really discussing a query that it might be a search result for.
Quality could include whether a blog post is “well written, information rich, and/or generally useful.”
Freshness could be based upon a determination of whether a post is “recent and/or provides timely information.”
Significance involves whether the information provided by a post is important.
Some other heuristics might include:
- How many people subscribe to a blog post
- Whether a post has a particular political slant, such as conservative, liberal, and/or moderate
- If a post “expresses an opinion about a search result,” so that not all positive or negative or indifferent posts are shown to a searcher
The patent also provides some alternatives regarding how blog posts might be displayed in search results.
I find myself wondering if any of the listed inventors on this newly granted patent had been bloggers before the time that they wrote it. I find myself breaking a few of the rules above.
For example, it seems reasonable to link, like I did at the start of this post, to previous posts on this domain about how Google might be ranking posts in blog search.
I did write one post last summer where I provided a little over 1,000 links to patents at the USPTO that Google had been assigned by the USPTO. It made sense to do so at the time.
In my last post, I embedded a video from YouTube (a Google property) of a presentation from Google’s Director of Research, Peter Norvig.
The idea that Google might explore a number of different heuristics to determine when to filter posts out of blog search makes sense though, and basing them upon categories such as topicality, quality, freshness, and significance feels right.
The last couple of heuristics I wrote about, involving whether posts might involve political slants or opinions seems more like a decision to include diverse results based upon some kind of sentiment analysis.
I did perform a number of searches at Google’s blog search while writing this post, and I can’t say that I’m really satisfied with the results I was receiving.