Can looking at web traffic flowing through internet access points from Internet Service Providers help a search engine crawl the web more effectively?
A patent originally developed by the folks at Fast Search and Transfer, and assigned to Overture, was granted last week on the topic of improving the crawling of web pages by looking at that traffic, and it lays out the framework for doing so in fine detail. It also points out some of the limitations in not adopting such a practice while also explaining many of the benefits.
Some of these limitations include problems with:
- Starting to crawl the web from seed pages,
- The limited amount of access time crawlers have to servers,
- Difficulties crawlers have in retrieving dynamic objects, and
- Link topology as a source of relevance.
Here are some of the details about the patent:
System and method for enhancing crawling by extracting requests for webpages in an information flow
Invented by Bjorn Olstad and Knut Magne Risvik
Assigned to Overture Services, Inc.
US Patent 7,093,012
Granted August 15, 2006
Filed September 13, 2001
A method for providing searching and alerting capabilities in traffic content at access points in data networks is disclosed. Typical access points for Internet, intranet and wireless traffic are described. Traffic flow through an Internet Service Provider is used as a preferred embodiment to exemplify the data traffic used as the input source in the invention. The invention teaches how proper privacy and content filters can be applied to the traffic source. The filtered data stream from the traffic flow can be used to improve the quality of existing searching and alerting services. The invention also teaches how a cache can be developed optimized for holding fresh searchable information captured in the traffic flow. It is further disclosed how the said cache can be converted to a searchable index and either separately or in cooperation with external search indexes be used as a basis for improved search services. The invention also discloses how the traffic flow can be analyzed in order to derive added information for measuring document relevance, access similarity between documents, personalized ranking of search results, and regional differences in document accesses.
In terms of describing the how the various functions of a search engine operate, this is one of the more informative patents that I’ve seen. It’s probably worth struggling through if you would like to know more about the internal workings of a search engine, and how one could be adapted to capture user information to help it decide which pages to crawl and index.
A number of patents and papers that have come out recently mentioning how information collected from internet services providers could be useful in improving how a search engine functions without providing details. This patent provides some details.
In addition to addressing the limitations I listed above, it also could enable a search engine to, amongst other things:
- Use request statistics at access points in data networks to build improved relevancy in search and alert services.
- Create location specific document ranking by using request statistics from users from specific locations.
- Find freshly created and updated pages and determine relevancy based upon ages of documents,
- Locate broken links.
- Personalize rankings and document selection
It also attempts to address privacy concerns in a number of ways.
If you skip past the “claims” section, to the description, it is fairly readable as patents go.