Note: This is an April Fools Day post. The post is a play on the fact that the Google PageRank algorithm was named after Google founder Larry Page, and that there isn’t an equivalent algorithm from Google named after Google co-founder Sergey Brin. With the exception of the link to the “Brin Rank” patent below, all of the links in this post are legitimate, and the post speculates upon what a ranking algorithm from Sergey Brin might look like based upon his history of research, and the increasing use of user-behavior data that Google appears to be looking at based upon the whitepapers and patents that they have published in the past few years.
Google finds itself in an interesting predicament with one of the core aspects of its search technology, PageRank, falling out of exclusive control later this year. Fortunately, Google was granted a new patent this week that looks like it contains a substitute that overcomes some of the weaknesses of Lawrence Page’s search innovation of the 90s.
PageRank was predicated on the assumption that the existence of links between pages on the Web was a signal that could be used to sort and rank pages on the Web, scoring pages on an importance scale based upon the links they received from “important” pages. An “important” page is one that has links to it from other important pages. As inventor Larry Page noted in the first PageRank Patent, Improved Text Searching in Hypertext Systems:
The reasons why my system works so well, is that it decides which documents to return, and in what order, by using an approximation to how well cited, or “important” the matching documents are. I will call this aproximation to importance PageRank from now on. Web pages get a higher PageRank from being mentioned on other pages. But, the PageRank a page gains from a citation is based on the PageRank of the page that cites it. This definition may sound circular because it is in fact circular.
No More Random Surfers
The original PageRank algorithm was based upon the movements of a “Random Surfer” who might visit a page and randomly follow any link on that page to another, with a 15% chance that he or she might just type in a new address in his or her browser and go somewhere else. The PageRank for a page is a statistical probability that if someone starts anywhere on the Web, and follows links in a random surfer style, they may end up at another specific page.
But, as Google’s Webspam head, Matt Cutts, noted in a blog post on PageRank Sculpting, that Random Surfer Model was in Jeopardy even from the early days of the company:
Disclaimer: Even when I joined the company in 2000, Google was doing more sophisticated link computation than you would observe from the classic PageRank papers. If you believe that Google stopped innovating in link analysis, that’s a flawed assumption. Although we still refer to it as PageRank, Google’s ability to compute reputation based on links has advanced considerably over the years.
By 2004, the Random Surfer was likely unemployed, with a Reasonable Surfer taking his place, as described in a Google patent titled Ranking documents based on user behavior and/or feature data. The patent was granted in May of 2010, and I wrote about it in Google’s Reasonable Surfer: How the Value of a Link May Differ Based upon Link and Document Features and User Data.
As we’re told in that patent:
Systems and methods consistent with the principles of the invention may provide a reasonable surfer model that indicates that when a surfer accesses a document with a set of links, the surfer will follow some of the links with higher probability than others.
This reasonable surfer model reflects the fact that not all of the links associated with a document are equally likely to be followed. Examples of unlikely followed links may include “Terms of Service” links, banner advertisements, and links unrelated to the document.
Kicking the Surfers Out for Good
Sergey Brin, somewhat uncomfortable with the surfer metaphor from the start, began pursuing topics that focused upon data mining and information extraction. His 1999 work, Extracting Patterns and Relations from the World Wide Web looked to extract structured data from unstructured web pages, and draw useful relationships from those pages, and a way to find meaningful data out of the chaos of the Web. The process he developed is known as DIPRE – Dual Iterative Pattern Relation Expansion, and it is part of the fruit that bore his new Brin Rank algorithm.
Another key element of Brin Rank was exposed in a work that Brin co-authored, Beyond market baskets: generalizing association rules to correlations, which defined rules about data, with a footnote example providing a pretty clear encapsulation of the concept:
A Classic Example is the rule that people who buy diapers in the afternoon are particularly likely to buy beer at the same time.
Brin’s patent Information extraction from a database (US 6,678,684) filed on March 9, 2000 and granted on January 13, 2004 points to research that further led to his Brin Rank innovation. The abstract tells us:
Techniques for extracting information from a database are provided. A database such as the Web is searched for occurrences of tuples of information. The occurrences of the tuples of information that were found in the database are analyzed to identify a pattern in which the tuples of information were stored. Additional tuples of information can then be extracted from the database utilizing the pattern. This process can be repeated with the additional tuples of information, if desired.
The turning point where this type of data mining seems to have become useful was in associating the mining of information from pages on the web, and the creation of association rules about that data, with a look at information collected about people’s actual use of that data through query logs and query sessions, and browsing information collected about users web histories.
Brin’s research noted, for instance, that people searching for science fiction novels in the morning, also tend to look up information related to their stock portfolios around the same time.
The patent is:
Extracting and Associating Informational Nodes in a Large Scale Database
Invented by Sergey Brin
Assigned to Google
US Patent 75,008,681
Granted April 1, 2011
Filed: April 1, 2007
Techniques for extracting and associating information from a large scale database are provided. Occurrences of tuples of information are explored in an index of the Web, and within searching and browsing patterns of web users to identify associative rules and identify authoritative pages on the web for that information. The strength of relationships identified by these rules can be used to develop a score for pages in response to a query, referred hereinafter as Brin Rank.
The patent provides a very detailed look at how Brin Rank is calculated, and how it improves upon ranking documents on the web based upon associative rules on how people behave on the web when searching for different types of information.
It has the benefit of being owned completely by Google, and not subject to an expiring exclusive license like PageRank.
No surfers were involved in the conceptualization of Brin Rank.
Added 3:18, 2011/4/1 If you clicked on the link above to the patent, you’ve probably noticed that you arrive back at this page. While there isn’t an actual Brin Rank patent, I’ve been wondering for a while what Sergey Brin may have come up with if it were his algorithm at the heart of Google rather than PageRank.
PageRank was an algorithm that provided much more relevant results than its competitors back when it was introduced, and it’s likely evolved considerably since those early days. Google has been looking at ways to rerank results, and adding more signals to their ranking algorithm than PageRank, and at present PageRank is only one amongst hundreds of signals that the search engines use.
It’s clear that Google has been spending significantly more time looking at user-behavior signals, and some of the things that I alluded to in Brin Rank quite possibly play a role in how Google ranks pages today.
I hope you enjoyed this April 1st post, and I thank everyone who passed it along.