A popular search engine developed by Google Inc. of Mountain View, Calif. uses PageRank.RTM. as a page-quality metric for efficiently guiding the processes of web crawling, index selection, and web page ranking. Generally, the PageRank technique computes and assigns a PageRank score to each web page it encounters on the web, wherein the PageRank score serves as a measure of the relative quality of a given web page with respect to other web pages. PageRank generally ensures that important and high-quality web pages receive high PageRank scores, which enables a search engine to efficiently rank the search results based on their associated PageRank scores.
A continuation patent of an updated PageRank was granted today. The original patent was filed in 2006, and reminded me a lot of Yahoo’s Trustrank (which is cited by the patent’s applicants as one of a large number of documents that this new version of the patent is based upon.)
I first wrote about this patent in the post titled, Recalculating PageRank. It was originally filed in 2006, and the first claim in the patent read like this (note the mention of “Seed Pages”):
What is claimed is:
1. A method for producing a ranking for pages on the web, comprising: receiving a plurality of web pages, wherein the plurality of web pages are inter-linked with page links; receiving n seed pages, each seed page including at least one outgoing link to a respective web page in the plurality of web pages, wherein n is an integer greater than one; assigning, by one or more computers, a respective length to each page link and each outgoing link; identifying, by the one or more computers and from among the n seed pages, a kth-closest seed page to a first web page in the plurality of web pages according to the lengths of the links, wherein k is greater than one and less than n; determining a ranking score for the first web page from a shortest distance from the kth-closest seed page to the first web page; and producing a ranking for the first web page from the ranking score.
The first claim in the newer version of this continuation patent is:
What is claimed is:
1. A method, comprising: obtaining data identifying a set of pages to be ranked, wherein each page in the set of pages is connected to at least one other page in the set of pages by a page link; obtaining data identifying a set of n seed pages that each include at least one outgoing link to a page in the set of pages, wherein n is greater than one; accessing respective lengths assigned to one or more of the page links and one or more of the outgoing links; and for each page in the set of pages: identifying a kth-closest seed page to the page according to the respective lengths, wherein k is greater than one and less than n, determining a shortest distance from the kth-closest seed page to the page; and determining a ranking score for the page based on the determined shortest distance, wherein the ranking score is a measure of a relative quality of the page relative to other pages in the set of pages.
Producing a ranking for pages using distances in a web-link graph
Inventors: Nissan Hajaj
Assignee: Google LLC
US Patent: 9,953,049
Granted: April 24, 2018
Filed: October 19, 2015
One embodiment of the present invention provides a system that produces a ranking for web pages. During operation, the system receives a set of pages to be ranked, wherein the set of pages are interconnected with links. The system also receives a set of seed pages which include outgoing links to the set of pages. The system then assigns lengths to the links based on properties of the links and properties of the pages attached to the links. The system next computes shortest distances from the set of seed pages to each page in the set of pages based on the lengths of the links between the pages. Next, the system determines a ranking score for each page in the set of pages based on the computed shortest distances. The system then produces a ranking for the set of pages based on the ranking scores for the set of pages.
Under this newer version of PageRank, we see how it might avoid manipulation by building trust into a link graph like this:
One possible variation of PageRank that would reduce the effect of these techniques is to select a few “trusted” pages (also referred to as the seed pages) and discovers other pages which are likely to be good by following the links from the trusted pages. For example, the technique can use a set of high quality seed pages (s.sub.1, s.sub.2, . . . , s.sub.n), and for each seed page i=1, 2, . . . , n, the system can iteratively compute the PageRank scores for the set of the web pages P using the formulae:
.A-inverted..noteq..di-elect cons..function..times..fwdarw..times..function..times..function..fwdarw. ##EQU00002## where R.sub.i(s.sub.i)=1, and w(q.fwdarw.p) is an optional weight given to the link q.fwdarw.p based on its properties (with the default weight of 1).
Generally, it is desirable to use a large number of seed pages to accommodate the different languages and a wide range of fields which are contained in the fast growing web contents. Unfortunately, this variation of PageRank requires solving the entire system for each seed separately. Hence, as the number of seed pages increases, the complexity of computation increases linearly, thereby limiting the number of seeds that can be practically used.
Hence, what is needed is a method and an apparatus for producing a ranking for pages on the web using a large number of diversified seed pages without the problems of the above-described techniques.
The summary of the patent describes it like this:
One embodiment of the present invention provides a system that ranks pages on the web based on distances between the pages, wherein the pages are interconnected with links to form a link-graph. More specifically, a set of high-quality seed pages are chosen as references for ranking the pages in the link-graph, and shortest distances from the set of seed pages to each given page in the link-graph are computed. Each of the shortest distances is obtained by summing lengths of a set of links which follows the shortest path from a seed page to a given page, wherein the length of a given link is assigned to the link based on properties of the link and properties of the page attached to the link. The computed shortest distances are then used to determine the ranking scores of the associated pages.
The patent discusses the importance of a diversity of topics covered by seed sites, and the value of a large set of seed sites. It also gives us a summary of crawling and ranking and searching like this:
Crawling Ranking and Searching Processes
FIG. 3 illustrates the crawling, ranking and searching processes in accordance with an embodiment of the present invention. During the crawling process, web crawler 304 crawls or otherwise searches through websites on web 302 to select web pages to be stored in indexed form in data center 308. In particular, web crawler 304 can prioritize the crawling process by using the page rank scores. The selected web pages are then compressed, indexed and ranked in 305 (using the ranking process described above) before being stored in data center 308.
During a subsequent search process, a search engine 312 receives a query 313 from a user 311 through a web browser 314. This query 313 specifies a number of terms to be searched for in the set of documents. In response to query 313, search engine 312 uses the ranking information to identify highly-ranked documents that satisfy the query. Search engine 312 then returns a response 315 through web browser 314, wherein the response 315 contains matching pages along with ranking information and references to the identified documents.
I’m thinking about looking up the many articles cited in the patent, and providing links to them, because they seem to be tremendous resources about the Web. I’ll likely publish those soon.