A couple of days ago, I posted about an Ask.com patent application that considered user behavior in ranking Web pages. Expect to see more of this from the search engines. Perhaps a lot more.
For example, in a recent patent application, the folks at Microsoft tell is about some of the issues involved with ranking algorithms that are based upon link structures:
Techniques which utilize the link structure of a graph (e.g., PageRank) usually make the incorrect assumption that all hyperlinks should be treated equally. In reality this is not true; a page links to many places because they include ads and navigational links which may not be important. In fact there are many links on a page that are never followed, and sometimes there are few links which get a majority of the click-throughs. Plus, many pages are just pages to pass through to get to another more important page.
In response to this problem, they have come up with a query independent method of ranking pages that can look at a mix of factors, including “the number of incoming links, the site traffic, how long the site has been around or the PageRank of the page.” This would be coupled with a relevance factor based upon the actual query searched for by the user.
The patent application is:
UserRank: ranking linked nodes leveraging user logs
Invented by Rangan Majumder
Assigned to Microsoft
US Patent Application 20070112768
Published May 17, 2007
Filed: November 15, 2005
The claimed subject matter provides a system and/or a method that facilitates utilizing transition probability in static rankings associated with at least one document. An interface can receive data related to a query, wherein the query can be associated with a search from a user. A rank component can provide query results that are prioritized utilizing a transition probability based on user activity included within a user log.
The rank component uses a transition probability of documents based upon the user behavior to account of the utility of links within such documents.
In particular, the user behavior and/or activity can be:
- An amount of time on a document (e.g., a user is on document A for X minutes),
- A log on to a document (e.g., a log on signifies a document of interest to a user),
- A log off to a document (e.g., a log off signifies the document contained information located therewith and no further document is of value),
- A document exited (e.g., indicating the information is located on such document and no further document is of value),
- A document request uniform resource identifier (URI),
- A document referrer,
In short, something like pagerank is based upon a probability that if you start at a certain point on the Web, and follow links on pages, how likely might it be that you would eventually end up at a specific page. The process in this patent application takes into account user activity on pages to try to make a more informed guess as to that probability.
6 thoughts on “Microsoft’s UserRank – Query Independent Ranking Based Upon User Logs”
This concept is very interesting, but where do you get user statistics for every site on the web? You have two options: toolbars ou user account in major search engines, and both of them don’t seem too accurate to extrapolate results.
Good question, Carfeu.
I think that you use as many different sources as possible if you want something like this to work well.
That would mean looking at information from the user through things like toolbars, bookmark programs, pre-fetching programs, notebook annotations, browser cache and history, and others
It would mean looking at information from site owners, such as analytics and split testing optimizer programs.
It might mean looking at information purchased from Internet Service providers.
Search engine log files would be a good source, too.
This is obviously a better measure of value of a page based on the number of visits and the time spent on page. With google analytics and webmaster tools you would think Google would implement something like this, but they can’t measure every page on the net…..that is the problem.
Google does have a number of patents which describe how they might collect an incredible amount of data about users, queries, and pages to try to rank pages. See my post Google and Large Scale Data Models Like Panda which provides some examples.
Google collects a lot of data in their search query logs, as well as the search and browsing histories of people who travel around the Web while logged into their Google accounts. The Google Toolbar collects information, and there are other services that Google provides that give them access to additional data as well. I’m not sure that Google really ever needs to look at Google Analytics information from any site to provide them with the kind of data that they might find useful.
Comments are closed.