Yahoo on Collecting User Data for Web Site Profiling

Recently I wrote about a Yahoo adaptation of PageRank, called User Sensitive PageRank, which required that a lot of data be collected about visitors to web sites, including their clicking and browsing habits.

A couple of Yahoo patent applications from last week refer to User Sensitive PageRank while describing the collection of user data and Web data, and building profiles for specific web sites based upon that data.

One of them focuses upon how profiles are created for sites, to determine what the sites are about and what kind of traffic levels they receive based upon profiles constructed for other sites where more information is known about those other sites.

The other Yahoo patent filing describes some details on how this information could be used in choosing what kinds of materials to advertise on such sites, where profiles are being used to determine context when little is know about the actual content on some pages of those site.

What’s most interesting to me is the tie-in to the User Sensitive PageRank patent, and some of the details provided about the collection of user data from toolbars, ISPs, and other programs.

The patent applications are:

System and method for web destination profiling

System and method for population-targeted advertising

Invented by Pavel Berkhin, Shanmugasundaram Ravikumar, Andrew Tomkins, and John Anthony Tomlin
Assigned to Yahoo
US Patent Application 20080028067
Published January 31, 2008
Filed: July 27, 2006

Creating Profiles and Estimates of Traffic

Traffic estimates, clickstream analysis and web traffic destination profiles can be created from looking at data in a variety of ways, including from:

a) Other web destination profiles,

b) Link analysis of the connectivity of the website with other websites,

c) Traffic analysis of the traffic between pages of the website and other pages, either on of off the website,

d) Analysis of content of the pages or metadata such as tags to determine pages with similar content or tags elsewhere that may be used for smoothing.

A traffic analysis engine may include:

a) A model generator for generating a model of traffic flow among web destinations and

b) a traffic flow analysis engine for propagating population characteristics to web destination profiles by predicting traffic flow through web destinations.

Collection of Clickstream and other Data

Some of the ways and places that information is collected about users and web sites:

a) While people are signed in at Web portals (like Yahoo)

b) From ISP collected information, with attention paid to sign-ins at “trusted systems” where there is profile information (such as social networks).

c) From ISP collected information where there isn’t profile information, but a machine learning based system can gather information about a user’s interests.

d) From toolbar collected data about users (user-based samples of information)

e) From Web site collected data about visitors (location-based samples of information)

g) Gathered from graph data — the ways users may reach a web destination including browsing, searching, or bookmarks.

e) Taken from site structure data — from a global analysis of one or more websites and the structure of Uniform Resource Locators (URLs).

Such an analysis may reveal that two web sites may have the same owner listed in their DNS records, or that the web sites may employ the same template and therefore may likely be managed by the same entity.

Similarly, a single site may have two sub-sites which are owned by distinct individuals. This might be uncovered by particular known URL constructions such as, or by analysis of the inter-linking behavior on a site.

The structure of URLs may also provide a hierarchical view of the proximity of URLs. For example, and have in common the prefix, and may therefore be viewed as quite similar.


These patent documents provide a peek at how some data collected about how people use the Web, and about the Web itself, may impact both search and advertising on the Web.

We were given a slight different look not long ago from Google, in their creation of profiles for web sites that could be used, amongst other ways, to combat click fraud – in a post I wrote titled: Google at the Crime Scene: Profiling Websites, Estimating Traffic, and Combating Click Fraud

The methods of collecting information in these patent applications, and the possible creation of web destination profiles as described, would make the use of a User Sensitive PageRank more possible. That’s something to consider.


10 thoughts on “Yahoo on Collecting User Data for Web Site Profiling”

  1. Interesting, I done some research on my sites and found that length of stay,number of page views and other analytics based on entry page have little correlation to search engine rank. It does seem like good indicator of the value of a site. If it were used, low back-link sites with great content (like one of mine) could still rank and spam sites with black-hat SEO that get an immediate bounce back would suffer . I love it.

  2. From my own experience using both Yahoo ads and Google ads to monetize web pages, there is no comparison. Google currently does a much better job than Yahoo at running contextually relevant ads that convert at a better rate than Yahoo. I think this is also proven by past profit results for the two search rivals.

    Yahoo just seems to have a long way to go in building both search and ad relevancy as compared to Google.

    Time will tell what patent innovations Yahoo comes up with to unseat Google and whether or not the possible mixing of Microsoft with Yahoo will create any advantage for Yahoo’s lackluster search and ad performance.

    Right now I think Yahoo is viewed more as a news and entertainment portal than anything else and they haven’t monetized their online properties very well.

  3. Hi Ken,

    I do think that it’s a good move on the part of search engines to try to incorporate searching and browsing behavior from users into rankings, but for it to work well, it’s going to require that they have access to a lot of data, and they are going to have to make smart decisions about what to use, and what not to use.

    Your view of traffic on your site isn’t necessarily the view that Google or Yahoo has, though if you are using something like Google Analytics, Web site Optimizer, and Conversion tracking, they (or at least Google) may have some sense of what is happening on your pages.

    If you haven’t given them that kind of access, they may be doing a lot of guessing, based upon profiles of sites that they think or similar to yours. That could be a little troublesome if they aren’t doing it well.

  4. Hi People Finder,

    One of the things that constantly amazes me is the thought of how much data that both Google and Yahoo can collect about how people use their search engines, and how they browse the Web.

    These patents hint at ways for them to guess what a new blog post or article is about when they try to serve ads, based upon a profile for the site the content is upon, and upon profiles for similar sites.

    The future will be interesting…

  5. Pingback: links for 2008-02-06
  6. Pingback: Around the SEO world in 5 links
  7. Hi Simlock,

    They do seem to suggest that they are collecting information from a lot of different resources, including purchasing it from internet service providers.

Comments are closed.