The Web is filled with uncrawled web pages, and some of them may be yours.
One of the important first steps in optimizing a web site for search engines is to make sure that the pages of the site are search engine friendly, so that a crawling program from search engines can follow the links on pages to your site, and collect information from those pages. There are three major functions of search engines – crawling pages, indexing content, and displaying results to searchers – and the last two can’t happen until pages are crawled.
The purpose behind crawling web pages focuses upon:
- Discovering URLs for new pages,
- Acquiring content found on those newly discovered pages URLs, and;
- Finding Fresh content on previously crawled pages to keep search indexes fresh
As crawling programs explore the content on web pages, and come across URLs to new pages, they collect information about those URLs before actually visiting the pages they point towards. There may be a vast number of pages that a search engine has discovered a link to but hasn’t had a chance to crawl yet or may not be able to crawl for one reason or another – see Ranking the Web Frontier (pdf).
When a search engine discovers new URLs, it has to make some choices as to which pages to visit and index first. For example, if a search engine is faced with a choice between indexing a million pages of one site, or the home page of a million sites, it might choose to visit the million different sites first.
A recently published patent application from Yahoo attempts to provide a new approach to deciding which pages for crawling programs to visit based upon pages that it guesses might have the greatest search impact, addressing “Needy Queries.” Needy queries are search terms that people use with some level of frequency that have results that could stand being improved.
The patent begins by telling us about some past (and possibly present) approaches to deciding which pages to crawl and index content for based upon how important those pages might be perceived to be, rather than by a need for the content that the pages may contain, citing the following documents:
- Efficient Crawling Through URL Ordering
- Adaptive On-line Page Importance Computation
- Breadth-First Search Crawling Yields High-Quality Pages (pdf)
The fetching of content from URLs may be influenced by query independant aspects of those pages. If a URL has many links pointing to it or has accumlated a certain amount of PageRank, but hasn’t been visited and indexed yet, it might be moved up in a queue to be crawled, regardless of the content it might contain.
There is another approach to crawling URLs that focus upon content, know as focused crawling, in which crawling starts at pages that have been crawled and indexed before, and URLs are followed from those pages, on the assumption that those pages might link to other pages on the same topic. The patent points to another document that describes that kind of approach in more depth – Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery (pdf).
It tells us that this kind of focused crawling focuses upon searching for pages that are relevant to a particular topic or a small set of topics, rather than by . Such focused crawling is guided by topic classification rather than the relevancy of queries issued by user requests.
Are there needy queries, where searches are taking place that your pages might be good matches for that haven’t been crawled and indexed by search engines yet?
In Yahoo’s patent filing, the search engine doesn’t just consider information about the URLs that it might crawl to decide which pages to visit next, but also looks at its query logs to find queries that a number of people are searching for where the quality of the search results that show up in response to those queries may not be doing a good job of meeting searchers’ information needs.
The search engine would look at information that it might have about URLs that it knows about, but hasn’t visited yet to determine which pages to visit to try to improve the search results for those queries. That information could include words within the URL, the anchor text used for those links, how many links are pointing at that URL, and what the domain name is for the page the URL points at.
If that information seems like a good match for needy queries, those URLs may move up in the queue to be crawled sooner rather than later.
The patent filing is:
System and method for crawl ordering by search impact
Invented by Christopher Olston and Sandeep Pandey
Assigned to Yahoo.
US Patent Application 20090164425
Published June 25, 2009
Filed December 20, 2007
An improved system and method for crawl ordering of a web crawler by impact upon search results of a search engine is provided. Content-independent features of uncrawled web pages may be obtained, and the impact of uncrawled web pages may be estimated for queries of a workload using the content-independent features.
The impact of uncrawled web pages may be estimated for queries by computing an expected impact score for uncrawled web pages that match needy queries. Query sketches may be created for a subset of the queries by computing an expected impact score for crawled web pages and uncrawled web pages matching the queries.
Web pages may then be selected to fetch using a combined query-based estimate and query-independent estimate of the impact of fetching the web pages on search query results.
This query-centric approach to determining an order for crawling web pages could lead towards:
Newer pages that might not have many links pointing to them showing up in search results quicker, and;
Long Tail queries that people are searching for with some frequency showing higher quality search results.