Web Browsing History vs. PageRank?
There are many ways a search engine may decide upon how important a web page might be. That measure of importance might be used by search engines, along with a determination of relevance, as one of the ranking signals used to decide which pages to show first in lists of results shown to searchers. That importance might also be used to decide which pages a search engine crawling program should crawl and index, and revisit to see if the content on those pages has changed.
A search engine might view the links between web pages and decide that pages linked to frequently are more important than pages that aren’t. It might also determine that web pages that are linked to by important pages are more important than pages linked to less important pages. Google’s PageRank is one approach for determining how important pages might be based upon looking at links between pages.
There are other ways that a search engine might use to decide how important a web page might be, including actually attempting to see how many people actually visit and use that page.
Google, Yahoo, and Bing all offer browser toolbars that provide many useful features, including a toolbar search. All could collect information about which websites people use those toolbars visit. The search engines could also collect information about the pages people visit by accessing information from the Internet Service Providers (ISPs) that people use to connect to the Web.
This web browsing history information might include the history of web pages visited by a user and when those pages were accessed. It might also include demographic information describing the user. A pending patent application from Yahoo, published last week, explores the use of web browsing history as an alternative to looking at links to determine how important web pages might be.
The authors of the patent filing tell us that there are many benefits to using an approach based upon web browsing history information instead of considering links between pages.
1) The data mined about the actual use of a web page may be a more accurate view of that page’s importance than the links pointing to a page.
2) Other ways to rank web pages require constructing a computationally expensive map of web pages and links between those pages, to determine their relative importance instead of just measuring the traffic to a page.
3) Measuring the web browsing history of a page is an incremental approach – new browsing data can be added to known values. Using ranking pages based upon links means reconstructing a new map of web pages and links regularly.
4) Approaches to ranking web pages based upon links are subject to deliberate manipulation by those who create additional links solely to get pages to rank more highly. Using web browsing history instead to measure visits to web pages can filter out web browsing history created by automated approaches or deliberate attempts to boost visits to pages.
The patent application from Yahoo is:
Web Page and Web Site Importance Estimation Using Aggregate Browsing History
Invented by Gilad Mishne and Guangyu Zhu
Assigned to Yahoo
US Patent Application 20100082637
Published April 1, 2010
Filed: September 30, 2008
Abstract
Particular embodiments of the present invention are related to estimating the importance of websites based on the aggregate browsing history of one or more users.
A search engine might collect more information than just whether or not someone visited a particular page. For instance, a person’s web browsing history might contain information about a particular browsing session, including:
- Pages visited before and after a visit to a specific page
- When a browser window was opened and closed
- When a browser tab was opened and closed
- When a stored bookmark was followed
- When the contents of a page was refreshed
- What kinds of activities took place during a browsing session
- The total number of events that took place during a browsing session
- What time the browsing session took place
- What date the browsing session took place
- Demographic information about the person browsing
- The number of times a particular web site appears within the browsing session
- The total time spent viewing a particular web site
- The total amount of time spent during the browsing session
- The time it took a web page to load
The importance of a site could be calculated for a particular web browsing history session to come up with a “local importance value” for that site and pages on the site. Those visits and importance values could be aggregated for all visitors to that page.
In addition to helping provide an important signal for the ranking of a page in search results, this web browsing history could also be used to help create shortcut links (or query suggestions) to searchers, and to help the search engine define which pages should be crawled more frequently to capture new pages and changes to already indexed pages.
I’ve seen some suggestions that web browsing data information might be limited in use to search engines because there are possible ways that people might attempt to manipulate this kind of user-based data, such as by using automated programs such as botnets or by hiring people to click on links and visit pages through a system such as Amazon’s Mechanical Turk.
It’s quite possible that approaches like those could be filtered out of the data collected by search engines, looking at a wide range of information about web browsing history, such as the information listed above, instead of just individual visits to specific pages.
If search engines do start giving web browsing history information more value as a signal to rank web pages, then what does that mean for site owners? Quite possibly, one way to get pages to rank higher in search results is to create better experiences for visitors to those pages that have people spending time on the pages, bookmarking them, and returning to them.
Maybe I missed it, but isn`t it a major hole in the patent; the result being that sites that are already ranked high will stay in their positions, while new sites will have a hard time getting any traffic at all?
Hi forbrukslan1,
I’m not sure that is really a hole. There are many sites on the Web that are frequently visited by people and have been bookmarked many times, but may not have a high number of links pointing to them. There are other ways to get people to visit websites than having those pages rank well in search engines also, such as including the URLs in print, on TV and the radio, in letter head and invoices and other correspondence. Some sites may develop traffic because of high visibility in social networks and forums and advertisements. Links from news media and other sources can bring visitors directly to pages as well, ignoring any possible pagerank value of those links.
The patent isn’t describing a transition of a link-based ranking system to a browser-information based system, but rather an alternative approach to determining the importance of pages. Besides, a number of the factors that are listed in the patent applications are ones that are independent of search rankings based upon links, such as the amount of time spent on a page or site, the number of “events” that a visitor sees on a site, how quickly a page loads, and more.
Since it is only an alternative approach, do we have to worry about it as early as now? I do think yes. It may have rendered the task of ranking better difficult but at least it is harder to manipulate thus the result is really more accurate.
It is interesting that Yahoo has filed this patent, but Google has mentioned (via Matt Cutts) several times that Google Analytics data is not included as metric within their search rankings. This may due to possible privacy issues surrounding using a website’s analytics data. Do you think that Google is instead using tools such as the Google Toolbar to collect such information and include it instead? I would say that my own browsing behavior is a much better indicator of the sites I find important than the sites I choose to link to from my own website. This was an interesting read, thanks!
The technology to determine “local importance value” could also help the problem you talked about in an earlier blog about local towns and cities. For example you were talking about google searching a city looking for a local .gov website, based on the proximity to the search.
I’m not convinced that Page Rank is relevant anymore. Spammers and scammers have abused SEO so badly that I think its importance has really been damaged.
Hi Andrew,
It’s hard to tell whether something like this browsing history information ranking system is in use or not, but the improvements that it suggests are ones that have value in themselves, if you work on building a site that people will want to spend more time upon, and visit more pages, and bookmark it so that they can return.
Hi Jonathan,
You’re welcome. Good questions and points.
Information from Google’s search log files, personalized search and web history programs, and toolbar provide the search engine with a considerable amount of information about how people browse and use the Web without them having to dig into information from Google Analytics. So, they possibly could do something very similar without having to use data from Google Analytics.
I agree – browsing behavior is probably a better indication of the sites that I find important than what I might link to on my site as well.
Hi Keith,
Good point, but I think it’s a double edged sword. If I search for information about my local town, and I keep on visiting and returning to the town’s web site, it definitely does make my town look like an important site related to the queries that I used. Unfortunately, if my search results during my query sessions include many pages that are more informative than my town’s site, and I tend to go to those pages instead, it may not help my town’s site as much in rankings.
Hi Antti,
I do believe that Google still values PageRank as one part of the way that it ranks sites in search results, and in deciding which pages to visit during web crawling. But, Google does look at many other signals to rank pages based upon relevance and importance, and it’s quite likely that PageRank isn’t as important as it once may have been. We are seeing indications from all of the major search engines that they can use data from their search logs and other services (such as the toolbar and personalized search) to learn more about how important different web pages are to people, and it is quite possible that those signals are growing in importance.
I could see the use of such a ranking within intranet or collaboration / EDRM searching where it is unlikely that users would try to manipulate search results. In this case, search ranking is building on the collective experience of a finite group of users within an organisation, and therefore likely to provide a better rank than one simply based on the content. As such search applications do not have the benefit of page rank -equivalent features, I could see that this could be a valuable additional ranking mechanism.
It’s a little bit scary, and I think its pushing the boundaries as far as privacy is concerned. Does the average Web user really want the engines following their online activity that closely? Still, as you stay, it does all get back to developing websites which engage visitors.
Hi Ted,
I’m a little prone to believe that the more user data the search engines have when it comes to a ranking mechanism like this, the better. I’m not sure if the traffic from some intranets might be enough to be helpful, though.
It’s possible that someone might try to attempt to manipulate user-based data received by the search engines, but I suspect that kind of manipulation might be less costly to filter out than some of the manipulation that happens in a link-based ranking system.
Hi Steve,
I’m not sure if most web users realize how much of their online activity is being tracked, or can be traced back to them through their actions on the web. The approach described in this patent application seems to be more concerned with aggregated user data rather than that of individuals, however.
This is why you have to build your website for visitors and not search engines. You build it with search engines in mind.
Hi Mike,
I often like to look at it as search engines being just another visitor.
I am perplexed as to how Yahoo could claim this “invention” as so unique and non-obvious, as to justify a patent. What about the Alexa Toolbar?
I would also agree with the concerns expressed that this might contribute to stagnating search results. Nowadays most new traffic tends to come from search engines, thus creating a feedback loop – sites would be ranked high as a result of the traffic generated because, well, they were ranked high already!
This approach may be more useful not for site ranking, but rather for site DISCOVERY. I do know from my own experience that Google appears to be using their Toolbar in this fashion. Last year, we were working on a new site/domain for a client, and intentionally did not have ANYTHING linking to it, to keep it away from Google until it was completed. Of course, next thing you know, this half-done site starts showing up in the SERPs! We finally traced it to one of the PC’s used to develop the site had the Google Toolbar installed & activated!
Finally, with regards to the privacy issue, I recently had need to do some packet sniffing to diagnose an Internet problem, and happened across the latest Google Toolbar’s communications, including what specific information was being sent to Google. While we all knew / try to ignore the fact that in order for the Toolbar PageRank Display to function, it needs to send the URL of each of the web pages you visit to Google. But I was rather shocked to see what else it was sending!
In every PageRank request the Toolbar was transmitting in the background to Google for every web page I visited, in addition to the URL, something else was included in this behind-the-scenes communications to Google – the ubiquitous Google “cookie”!
That’s right, the same, infamous cookie that privacy advocates have been screaming about, that let’s Google keep track of every search you’ve done since the dawn of time, that uniquely identifies your specific records at Google, is also being included in all of the Toolbar’s background transmissions of the exact web pages you are viewing to Google!
Unlike the URL, I can think of no valid reason why sending this information would be necessary to display a web site’s PageRank. Nor would this be needed to collect information “in the aggregate”.
Why would Google feel it appropriate to combine personally identifiable information with every web page you ever visit? 🙁 Boggles the mind.
Aside from the obvious patent concerns, what is clear is that if ‘brand’ is seen as a factor for relevancy then why not traffic volume. There are obvious flaws to the argument as mentioned earlier but then the ‘brand’ concept is flawed as well. What is clear is that a site that looses traffic suddenly tends to see a reduction in position even if this is because a certain key phrase was turned off in Adwords. So is it the reduction in spend or the reduction in traffic that is the problem?
Hi Dave,
This is still a pending application, so it’s possible that it might not survive to become a granted patent. But, I’m not sure that I see the connection between what it offers and what Alexa offers, since it is more than just a question of the use of a toolbar. Actually, the Toolbar is only a small part of the concepts involved in Yahoo’s filing – a way of collecting data that can be captured in other ways as well.
Toolbars can be useful as a way of collecting information about new URLs that can then be crawled by a crawling program.
Not a surprise that Google would transmit cookie information – that kind of data collection could possibly be used to rank pages. A number of Google patents dating back at least 5 years suggest that Google would collect information in a manner like that to do things such as rerank pages based upon personalization and customize pages based upon aggregated user data.
Hi Garry,
Interesting questions. I really haven’t discussed the concept of a “brand” in his post or in many other places on this site because I’m not really sure how one would measure “brand” in a way that would make it meaningful as a ranking signal.
PageRank itself is a guess at the probability that someone might visit a web page, if they started out anywhere on the Web. Traffic volume could actually be used in a similar manner. It would more likely be seen as a ranking signal about the importance of a page rather than that pages relevance to a specific term or phrase.
An interesting idea. But prone to abuse/spamming. It is increasingly difficult to find ways to use the “wisdom of crowds” to add a human value to a page/domain.
It seems that only in America you can patent the obvious.
Hi Adrian,
Any ranking algorithm is prone to abuse and spamming. A major part of the value of using a ranking algorithm is in anticipating how people might choose to abuse and spam it, and being able to adjust for those abuses. Much of the techniques that someone might use to abuse a user-data based system would likely leave patterns that would be easy to identify.
For instance, if people were hired to perform a certain search and click on a certain link, their browsing and searching sessions would stand out as being remarkably similar, and easily discountable. To make them dissimilar enough to appear unique would cost considerably more than just getting them to log in, perform a search, and click on a specific link, and would probably cost less than creating something that people would visit and bookmark and link in large numbers.
Regarding patents, and patenting things, I’m not really too concerned with whether or not this pending patent application gets granted or not, and I have no dog in the hunt either way. I’m more interested in what we might learn about the search engine, their assumptions about search and search engine users and publishers and the web then I am whether or not this patent will be granted.
Will the search engines move towards rankings that take into account more user data? If they do, what kind of information might they look at? How might they act to keep those rankings from being abused or spammed? Will we see better search results that take into account actual traffic to pages and bookmarking of those pages rather than linking to them – which is likely at least as good and probably better as an indication of the quality of a web site?
Whether or not the methods described in the patent filing is obvious or not is really immaterial.
Hi Bill!
The idea of ranking a web site based on web browsing history is certainly not new – the Alexa web site has been doing this since the 1990’s! The primary practical method for gathering this information is using a “toolbar”, also included in Yahoo’s patent. But again, Alexa has been doing this, via their own “Alexa Toolbar”, for many, many years now.
Nor is Alexa the only one. MSN Search used to use web browsing history as one of their ranking criteria (collected via tracking outbound clicks to specific sites from their SERPs), and Google is now tracking this as well.
Then there’s the years of web browsing history that Google has collected via their toolbar. It’s hard to imagine how anyone f/Yahoo could claim with a straight face that no one at Google ever even conceived of the possibility that this data might be useful for ranking, let alone that the very idea of Google actually USING this data they had collected these many years was such a revolutionary and non-obvious concept, that YAHOO deserves a patent for thinking it up! ROFLMAO!
Finally, and in a tie-in with one of your other recent articles, Google is now using “web page load time” as one of its ranking criteria, and has also publicly stated that this load time data is being obtained not by requests from Google’s servers, but from actual users web browsing history, via the Google Toolbar –
http://googlewebmastercentral.blogspot.com/2009/12/your-sites-performance-in-webmaster.html
both of which appear to be ideas Yahoo now claims to exclusively own?!
The problem for the search industry when an overly broad, already-done or obvious patent is granted is the exclusive rights a patent provides – it prevents Google and others from being able to use that approach to improve the quality of their search results, thus hurting everybody.
Hi Dave,
The idea of ranking a website itself based upon toolbar data isn’t unique, but that’s not really what Yahoo is suggesting here. Instead, they are ranking sites for specific query terms based in part upon user data collected by their toolbar. It’s more than tracking clicks on sites in search results as well.
Chances are that Google, Yahoo, and Bing are all looking at user data in ranking web pages for queries, and my point in writing this post wasn’t to defend Yahoo’s right to be granted a patent on the process described, but rather to look at what kind of user data they might consider using. I’m not even concerned about whether or not this patent filing is describing processes that aren’t novel, nonobvious, or useful – that’s not the point of writing this post.:)
If it creates a better experience for the visitor then I’m certainly all for it. However, I don’t see the link system ever going away.
Hi Dan,
I’m not sure that I see link-based signals going away anytime soon either, but I suspect that they will have less value over time. I think we’re seeing signs of it. For instance, it’s very much likely that a link in a sidebar of a page or in a page footer has less weight now than a link in the main content area of a page. And the search engine patents I’ve been seeing have introduced a good number of other signals that a search engine may be using that are independent of links.
I do agree with you. If it creates a better experience for the users of sites, then I’m all for it, too.
Hi Bill,
Don’t you think it’s not gonna make any difference as websites who are already ranked high on search engines or use extensive advertising gets most of the traffic and therefore will get advantage.
A new website will great content may take years to come on top of results.
This will encourage people to advertise more to get more traffic and better rankings.
Hi Max,
I’ve seen websites come out of nowhere and rank well because the people behind those sites recognized the value of providing something unique and innovative that attracted visitors to their pages. I think that’s something that everyone who starts a website has to ask themselves – what can I do to stand out and make people visit my pages.
But I’ve also seen many businesses and organizations recognize that it could take a few years for their sites to build an audience and traffic to their sites, and work towards growth in a reasonable and sustainable way. The truth is that it does sometimes take years for a business to grow.
i find it awfully annoying that Google thinks it always knows what we want – when I make a search with private browsing I always get new & interesting results because Google will not “decide” for me because of what I had opened in the past – If i wanted to see the same results as the last time I searched the same work then I might as well look in my own browser history, but the reason I am searching the same term again is to find something I didn’t find last time and not to see the same again- so taking data from my last search is not useful to me in this case – for some it might be useful if you found something on page 2, then the cookies will tell Google next time to display this on page one for you
Hi Ron,
I don’t think Google believes that it always knows what we won’t, but there’s definitely a move at Google and Bing to try to understand the intent of people searching when they enter a few words into a search box, and I expect we will seem more and more of that.
The customizations and personalizations that Google does provide are sometimes helpful and sometimes not, but I don’t think they get to the point where they overwhelm the rest of the results that aren’t based upon past searches or our location.