There are a number of ways a search engine may decide upon how important a web page might be. That measure of importance might be used by search engines, along with a determination of relevance, as one of the ranking signals used to decide which pages to show first in lists of results shown to searchers. That importance might also be used to decide which pages a search engine crawling program should crawl and index, and revisit to see if content on those pages have changed.
A search engine might view the links between web pages, and decide that pages linked to frequently are more important than pages that aren’t. It might also determine that web pages that are linked to by important pages are more important than pages linked to by less important pages. Google’s PageRank is one approach for determining how important pages might be based upon looking at links between pages.
There are other ways that a search engine might use to decide how important a web page might be, including actually attempting to see how many people actually use that page.
Google, Yahoo, and Bing all offer browser toolbars that provide a number of useful features, including a toolbar search, and which all could possibly collect information about which web sites that people using those toolbars visit. The search engines could also collect information about the pages people visit by accessing information from the Internet Service Providers (ISPs) that people use to connect to the Web.
This web site browsing history information might include the history of web pages visited by a user and when those pages were accessed. It might also include demographic information describing the user. A pending patent application from Yahoo, published last week explores the use of web browsing history as an alternative to looking at links to determine how important web pages might be.
The authors of the patent filing tell us that there are a number of benefits to using an approach based upon browsing history information instead of considering links between pages.
1) The data mined about the actual use of a web page may be more accurate view of that page’s importance than the links pointing to a page.
2) Other ways to rank web pages require the construction of a computationally expensive map of web pages and links between those pages, to determine their relative importance instead of just measuring the traffic to a page.
3) Measuring browsing history of a web page is an incremental approach – new browsing data can be added to known values. Using methods of ranking pages based upon links means reconstructing a new map of web pages and links on a regular basis.
4) Approaches to ranking web pages based upon links are subject to deliberate manipulation by those who would create additional links solely to get pages to rank more highly. Using browsing history instead to measure visits to web pages can filter out web browsing history created by automated approaches or deliberate attempts to boost visits to pages.
The patent application from Yahoo is:
Web Page and Web Site Importance Estimation Using Aggregate Browsing History
Invented by Gilad Mishne and Guangyu Zhu
Assigned to Yahoo
US Patent Application 20100082637
Published April 1, 2010
Filed: September 30, 2008
Particular embodiments of the present invention are related to estimating the importance of web sites based on the aggregate browsing history of one or more users.
A search engine might collect more information than just whether or not someone visited a particular page. For instance, a person’s browsing history might contain information about a particular browsing session, including:
- Pages visited before and after a visit to a specific page
- When a browser window was opened and closed
- When a browser tab was opened and closed
- When a stored bookmark was followed
- When the contents of a page was refreshed
- What kinds of activities took place during a browsing session
- The total number of events that took place during a browsing session
- What time the browsing session took place
- What date the browsing session took place
- Demographic information about the person browsing
- The number of times a particular web site appears within the browsing session
- The total time spent viewing a particular web site
- The total amount of time spent during the browsing session
- The time it took a web page to load
The importance of a site could be calculated for a particular browsing session to come up with a “local importance value” for that site and pages on the site. Those visits and importance values could be aggregated for all visitors to that page.
In addition to helping provide an importance signal for the ranking of a page in search results, this method could also be used to help create shortcut links (or query suggestions) to searchers, and to help the search engine define which pages should be crawled more frequently to capture new pages and changes to already indexed pages.
I’ve seen some suggestions that web browsing data information might be limited in use to search engines because there are possibly ways that people might attempt to manipulate this kind of user-based data, such as by using automated programs such as botnets or by hiring people to click on links and visit pages through a system such as Amazon’s Mechanical Turk.
It’s quite possible that approaches like those could be filtered out of the data collected by search engines, which may look at a wide range of information about browsing sessions, such as the information listed above, instead of just individual visits to specific pages.
If search engines do start giving browsing history information more value as a signal to rank web pages, then what does that mean for site owners? Quite possibly, one way to get pages to rank higher in search results is to create better experiences for visitors to those pages that have people spending time on the pages, bookmarking them, and returning to them.