Last December I wrote a blog post titled Do Search Engines Hate Blogs? Microsoft Explores an Algorithm to Identify Blog Pages. The inventors behind the patent filing described in that post have come out with a new patent application that says some positive things about blogs. Looking back at the original post, it appears that they may not hate blogs at all.
In the new patent document, they ask if the rankings of web pages in search results would be improved by a providing a slight increase in the PageRank of pages linked to by blogs. They tell us that:
This idea is based on the assumption (or hope) that blogs are still mostly human-authored, and that links from blogs generally represent sincere endorsements on the part of the authors.
The December post explored how a search engine might be able to identify blog pages and distinguish them from non blog pages, and told us that:
Search engines are increasingly implementing features that restrict the results for queries to be from blog pages.
But limiting the number of blogs that show up in search results doesn’t necessarily mean that a search engine doesn’t like blogs. It may mean that search engines would prefer to show a diversified set of search results, including blog pages and other results.
Search engines often look a couple of different kinds of ranking factors when determining the order that search results are shown to searchers.
Query-Independent and Query-Dependent
One way to classify ranking algorithms is query-dependent (or dynamic) or query-independent (or static).
Query-dependent ranking algorithms rely upon the query terms someone uses to rank pages, while query-independent look at other factors such as how important they may believe a page to be based upon things such as whether or not important pages link to that page (an example of a query-independent ranking algorithm would be PageRank).
Query-independent ranking algorithms assign a quality score to each document on the web, and can be run ahead of time. Query-dependent ranking algorithms depend upon the query used, and have to be run when a user submits a query.
Content, Usage, and Link Based Ranking Algorithms
It’s also possible to classify ranking algorithms as content-based, usage-based, and link-based.
Content-based ranking algorithms – use the words in a document to rank the document among other documents. For instance, a higher score might be assigned to a document that contains the query terms at the beginning of a document, in a prominent font, or in a certain kind of HTML element.
Usage-based ranking algorithms – may assign a score based on estimates of how often documents are viewed from looking at web proxy logs or looking at click-throughs on search engine results pages.
Link-based ranking algorithms – look at the hyperlinks between web pages to rank those pages, assigning a score to pages based upon links pointing to pages. endorsement of the page.
PageRank – an example of a query-independent link-based ranking algorithm.
The PageRank formula is often explained as follows. Consider a web surfer who is performing a random walk on the web. At every step along the walk, the surfer moves from one web page to another, using the following algorithm.
With some probability d, the surfer selects a web page uniformly at random and jumps to it; otherwise, the surfer selects one of the outgoing hyperlinks in the current page uniformly at random and follows it. Because of this metaphor, the number d is sometimes called the “jump probability,” namely the probability that the surfer will jump to a completely random page.
If the web surfer jumps with probability d and there are |V| web pages, the probability of jumping to a particular page is d/|V|. Since any page can be reached by jumping, every page is guaranteed a score of at least d/|V|. The PageRank of a particular web page is then the fraction of time that the random surfer will spend at that page.
But what if that surfer started favoring pages that were linked to by blogs a little more?
One of the problems behind using PageRank is that some commercial web sites try to inflate PageRank by creating links that point to a page solely for the purpose of endorsing that page, artificially increasing the value of the page.
This patent filing describes in some detail how a portion of PageRank from a page might be split (or distributed) equally amongst the links found on the pages of a site, and how the distribution of PageRank could be slightly altered to favor (or show a bias towards) pages that are linked to by blogs.
If blogs are, as the authors note in the patent, “still mostly human authored, and generally represent sincere endorsements of their authors,” then this bias might help counteract the artifical inflation of PageRank scores by people who would create links pointing to pages solely for the purpose of artificially increasing the PageRank of pages.
The patent filing is:
Ranking Method using Hyperlinks in Blogs
Inventors: Steve Chien and Dennis Fetterly
Assigned to Microsoft
US Patent Application 20080243812
Published October 2, 2008
Filed March 30, 2007
A method for static ranking of web documents is disclosed. Search engines are typically configured such that search results having a higher PageRank.RTM. score are listed first. A modified scoring technique is provided whereby the score includes a reset vector that is biased toward web pages linked to blogs. This requires identifying web pages as either blogs or non-blogs.
Some of the kinds of things that a search engine crawling program might look at when deciding whether a page is from a blog might include:
- Whether a page is hosted in a known blog hosting DNS domain such as blogspot or wordpress.com
- What features are contained in the non-HTML markup words and phrases contained in the page
- What the targets of outgoing links might be in the page, and
- Whether the string “blog” occurs in the URL
Experimenting with a Bias Towards Pages Linked to by Blogs
The authors of this patent performed experiments where they downloaded over 472 million pages, and found links to an additional 6 Billion pages within those pages.
They reranked the PageRank of these pages using a bias towards pages that they identified were linked to by blogs, with a preference towards using blog pages that had higher PageRanks, which they tell us tend to be “frequently updated, more informational rather than personal, and free of spam.”
They also tell us that some other characteristics of blogs may prove useful in refining this technique, such as looking at the number of subscribers to a particular blog, and associating a higher endorsement value to blogs with greater numbers of subscribers.
Can sending more PageRank to pages that are linked to by blogs something that will increase the relevance and importance of pages that show up in search results? Are links to pages from blogs still actual endorsements from the authors of those blogs?
Do search engines love blogs?