Would search engines be better if they started web crawls from sites like Twitter or Facebook? Wikipedia or Mahalo? DMOZ or the Yahoo Directory?
The Web refreshes at an incredible rate, with new pages added, old pages removed, and words pouring out from blogs, news sites, and other genres of pages. Ecommerce sites showcase new products and eliminate old ones. New sites launch and old domains expire.
Search engines attempt to keep their indexes of the Web as fresh as possible, and send out crawling programs to find the new, update changes, and explore disappearances. Failure to do so means outdated search engines that deliver people to deleted pages, overwritten content, and stale indexes that miss out on new sites.
When a search engine starts crawling the Web, it often begins by following URLs from chosen seed sites to explore other pages and other domains. But how does a search engine choose those seed sites?
Seed sites might be domains like the Open Directory Project or the Yahoo directory, which link to a wide range of sites of different topics and regions, editorially controlled in selection of the pages contained within them.
But a search engine doesn’t necessarily have to use those particular sites as places to begin, and may choose others.
The choice of seed sites can have a dramatic impact upon the quality of a search engine and how it covers different topics and geographical areas in its index. Poorly chosen seed sites could mean low quality search results, and even more web spam showing up in response to searches.
A Yahoo patent application describes how the search engine might choose amongst sites to use as seed sites to discover URLs to other pages on the Web.
Host-Based Seed Selection Algorithm for Web Crawlers
Invented by Pavel Dmitriev
Assigned to Yahoo
US Patent Application 20100114858
Published May 6, 2010
Filed October 27, 2008
A host-based seed selection process considers factors such as quality, importance and potential yield of hosts in a decision to use a document of a host as a seed.
A subset of a plurality of hosts is determined, including some but not all of the plurality of the hosts, according to an indication of importance of the hosts, according to an expected yield of new documents for the hosts, and according to preferences for the markets the hosts belong to.
At least one seed is generated for each host of the determined subset of hosts, wherein each generated at least one seed includes an indication of a document in the linked database of documents. The generated seeds are provided to be accessible by a database crawler.
Revisiting the same seed sites on a regular basis may not result in discovering a large number of new URLs. The Yahoo pending patent provides a glimpse at how they may compare and choose amongst potential seed sites.
It tells us that the seed site selection process can be improved if the choice of particular selected seeds results in:
- A relatively large number of previously undiscovered documents being discovered and processed.
- The crawling of relatively more of more important hosts and documents, and fewer of less important hosts and documents.
- A desirable distribution among markets or categories of sites.
Candidate seed sites may be judged based upon measures of:
- Potential yield of hosts
Quality (or lack of quality) of a site as a potential seed site could be based upon things such as the site:
- Having few outlinks,
- Being a spam page or having outlinks pointing to spam pages,
- Containing pornography content.
The patent filing tells us that high quality sites are chosen as potential seed sites since, as the starting point of a crawl, it’s likely that starting with a low quality domain would likely result in many more low quality pages being crawled.
Importance of a seed site might be based upon a “host trust” score or rating or other attribute associated with that host, which generally provides an indication of:
- Other characteristics of a host
PageRank could be considered one type of host trust score, but other factors could be used as well.
Potential yield of documents, or the potential for the discovery of new URLs, for a host could be calculated based upon statistics gathered from past crawls of that host.
We’re told that markets are typically often distinguised by geography, so a seed site selection process looking to yield many new URLs may look for different seed sites based upon geography that can help the search engine help find URLs from different countries and regions.
Different thresholds may be chosen for seed sites in different markets (or according to some other characterizaton), because some markets are less dominant and may have fewer hosts and fewer “important” hosts. This can keep relatively dominant markets from having so much influence that “few or no seeds are selected for hosts in less dominant markets.”
I’m not sure that I’ve seen a detailed discussion before in a patent or white paper from one of the search engines on what they might look for in choosing a seed site for their crawling process.
Most discussions about web crawling by the search engines often provide examples of sites like the Yahoo directory or DMOZ as entrance points for the crawling and discovery of new pages on the Web.
Because of that, it’s interesting to see some of the criteria discussed that a search engine might use to identify a seed site other than those directories. Would the Wikipedia make a good seed site? Possibly. How about Twitter or Facebook? I’m not quite as sure.
We know that the search engines have been placing more emphasis on quickly including content from sites like Twitter in their indexes to give us the feeling of being delivered very timely information. Are they also following links from those services, treating them as seed sites, to discover new pages and new content upon old pages? What does it mean if they are?