Google’s blog search shows results in responses to searchers queries based upon a combination of relevance scores and quality scores.
The relevance scores are based upon the search query terms entered into the search box by a searcher, and use traditional styled information retrieval scores for documents. These scores could be created by looking at:
- Number of times search terms appear in a blog post.
- Places where search terms appear within the document (such as title or the text within the body of the post),
- Characteristics of search terms appearing on the pages (such as font, size, color, etc.),
- Search terms may be weighted differently from other search term when multiple search terms are present.
- Proximity of search terms when multiple search terms are present may influence the IR score, and;
- Other techniques for determining the IR score for a document can also be used.
In addition to a relevance score, the search engine looks at a quality score. A new patent application from Google discusses possible positive and negative ranking factors that might go into that quality scores that might be used by Google Blog Search, and provides some explanations for each of those factors.
Ranking blog documents
Invented by Andriy Bihun, Jason Goldman, Alex Khesin, Vinod Marur, Eduardo Morales, and Jeff Reynar
US Patent Application 20070061297
Published March 15, 2007
Filed: September 13, 2005
A blog search engine may receive a search query. The blog search engine may determine scores for a group of blog documents in response to the search query, where the scores are based on a relevance of the group of blog documents to the search query and a quality of the group of blog documents. The blog search engine may also provide information regarding the group of blog documents based on the determined scores.
According to this patent application, there are two distinct sets of data that are used to score and determine the ranking of results in a blog search. The first is a relevance score of the post, based upon the query used by a searcher. The second is a blog quality score, which is independent of the query terms used in the search.
So how does Google come up with this quality score for a blog or blog post?
We’re told that a database is maintained by the search engine, and one of the fields in that database contains a quality score for each blog post. The quality score could be used to “promote, demote, or even eliminate a blog document (i.e., blog and/or post) from a set of search results.”
Determining a Quality Score
First step involved in deciding a quality score has the search engine obtaining information about a blog document. That information may be from:
- The blog itself,
- The post,
- Metadata from the blog, and/or;
- One or more feeds associated with the blog document.
Positive Indicators of Blog Quality
The next step is to identify positive indicators of quality:
- popularity of the blog,
- Implied popularity of the blog,
- Inclusion of the blog in blogrolls,
- Existence of the blog in high quality blogrolls,
- tagging of the blog,
- References to the blog by other sources,
- A pagerank of the blog, and;
- Other indicators could also be used.
What Google says about each of those:
Popularity could be based upon news aggregator subscriptions:
A blog document having a high number of subscriptions implies a higher quality for the blog document. Also, subscriptions can be validated against “subscriptions spam” (where spammers subscribe to their own blog documents in an attempt to make them “more popular”) by validating unique users who subscribed, or by filtering unique Internet Protocol (IP) addresses of the subscribers.
Instead of explicit subscriptions, an “implied popularity” could be calculated from data collected from people searching on Blog Search, and examining the click stream of search results:
For example, if a certain blog document is clicked more than other blog documents when the blog document appears in result sets, this may be an indication that the blog document is popular and, thus, a positive indicator of the quality of the blog document.
Inclusion of the blog in blogrolls
Blogrolls are a dense collection of links to external sites (usually other blogs) in which the author/blogger is interested. A blogroll link to a blog document is an indication of popularity of that blog document, so aggregated blogroll links to a blog document can be counted and used to infer magnitude of popularity for the blog document.
Existence of the blog in high quality blogrolls
A high quality blogroll is a blogroll that links to well-known or trusted bloggers. Therefore, a high quality blogroll that also links to the blog document is a positive indicator of the quality of the blog document.
This is also based upon the assumption that a well-known or trusted blogger would not link to a “spamming blogger.”
Tagging of the blog document
Some existing sites allow users to add “tags” to (i.e., to “categorize”) a blog document. These custom categorizations are an indicator that an individual has evaluated the content of the blog document and determined that one or more categories appropriately describe its content, and as such are a positive indicator of the quality of the blog document.
References to the blog document by other sources
For example, content of emails or chat transcripts can contain URLs of blog documents. Email or chat discussions that include references to the blog document is a positive indicator of the quality of the blog document.
Pagerank of the blog
A high pagerank (a signal usually calculated for regular web pages) is an indicator of high quality and, thus, can be applied to blog documents as a positive indication of the quality of the blog documents.
When a blog post is new, it may not be associated with a pagerank. The new post could possibly inherit the pagerank of the blog with which it is associated until an independent pagerank is determined for the new post.
Negative Indicators of Blog Quality
- Frequency of new posts,
- Content of the posts,
- Size of the posts,
- Link distribution of the blog,
- The presence of ads in the blog, and;
- Other indicators may also be used.
Frequency of new posts
There are timestamps associated with blog posts. They might provide some interesting insights:
Feeds typically include only the most recent posts from a blog document. Spammers often generate new posts in spurts (i.e., many new posts appear within a short time period) or at predictable intervals (one post every 10 minutes, or a post every 3 hours at 32 minutes past the hour). Both behaviors are correlated with malicious intent and can be used to identify possible spammers. Therefore, if the frequency at which new posts are added to the blog document matches a predictable pattern, this may be a negative indication of the quality of the blog document.
The content of posts
Mismatches between the content of feeds and the actual content on pages may be a signal used as a negative factor:
Spammers may put one version of content into a feed to improve their ranking in search results, while putting a different version on their blog document (e.g., links to irrelevant ads). This mismatch (between feed and blog document) can, therefore, be a negative indication of the quality of the blog document.
Also, in some instances, particular content may be duplicated in multiple posts in a blog document, resulting in multiple feeds containing the same content. Such duplication indicates the feed is low quality/spam and, thus, can be a negative indication of the quality of the blog document.
Ok, so does the little copyright notice at the bottom of my blog feeds posts count as a negative quality factor?
Words/phrases used blog posts
This negative factor might point at being careful about what you discuss in your blog posts:
For example, from a collection of blog documents and feeds that evaluators rate as spam, a list of words and phrases (bigrams, trigrams, etc.) that appear frequently in spam may be extracted. If a blog document contains a high percentage of words or phrases from the list, this can be a negative indication of quality of the blog document.
A blog detailing the characteristics of Nigerian spam emails may not rank highly in Google’s Blog Search.
Size of blog posts
If you try to maintain a strict count of words in creating your blog posts, this one seems to indicate that you should mix up the size of your posts a little:
Many automated post generators create numerous posts of identical or very similar length. As a result, the distribution of post sizes can be used as a reliable measure of spamminess. When a blog document includes numerous posts of identical or very similar length, this may be a negative indication of quality of the blog document.
Link distribution of the blog document
It appears that under this quality scoring system, whom you are linking to is considered, too:
As disclosed above, some posts are created to increase the pagerank of a particular blog document. In some cases, a high percentage of all links from the posts or from the blog document all point to ether a single web page, or to a single external site. If the number of links to any single external site exceeds a threshold, this can be a negative indication of quality of the blog document.
Ads in a blog
If a blog contains a large number of ads, this may seen as a negative indication of the quality of a blog.
According to the patent application, blogs typically contain three types of content:
- The content of recent posts,
- A blogroll, and;
- Blog metadata (author profile information and/or other information about the blog or its author).
If ads are present, they usually appear within the blog metadata section or near the blogroll. If ads are seen in the recent posts part of blog, they could be considered a negative quality factor.
Chances are Google looks at other positive and negative quality factors in deciding upon a quality score.
It’s also possible that some of the ones listed above aren’t in use, but they provide some insight into the kinds of things that the folks at Google might be considering when finding ways to rank web posts and pages in response to a search.