Crawling Pages to Address Needy Queries and Search Impact

The Web is filled with uncrawled web pages, and some of them may be yours.

One of the important first steps in optimizing a web site for search engines is to make sure that the pages of the site are search engine friendly, so that a crawling program from search engines can follow the links on pages to your site, and collect information from those pages. There are three major functions of search engines – crawling pages, indexing content, and displaying results to searchers – and the last two can’t happen until pages are crawled.

The purpose behind crawling web pages focuses upon:

  • Discovering URLs for new pages,
  • Acquiring content found on those newly discovered pages URLs, and;
  • Finding Fresh content on previously crawled pages to keep search indexes fresh

As crawling programs explore the content on web pages, and come across URLs to new pages, they collect information about those URLs before actually visiting the pages they point towards. There may be a vast number of pages that a search engine has discovered a link to but hasn’t had a chance to crawl yet or may not be able to crawl for one reason or another – see Ranking the Web Frontier (pdf).

When a search engine discovers new URLs, it has to make some choices as to which pages to visit and index first. For example, if a search engine is faced with a choice between indexing a million pages of one site, or the home page of a million sites, it might choose to visit the million different sites first.

A recently published patent application from Yahoo attempts to provide a new approach to deciding which pages for crawling programs to visit based upon pages that it guesses might have the greatest search impact, addressing “Needy Queries.” Needy queries are search terms that people use with some level of frequency that have results that could stand being improved.

The patent begins by telling us about some past (and possibly present) approaches to deciding which pages to crawl and index content for based upon how important those pages might be perceived to be, rather than by a need for the content that the pages may contain, citing the following documents:

The fetching of content from URLs may be influenced by query independant aspects of those pages. If a URL has many links pointing to it or has accumlated a certain amount of PageRank, but hasn’t been visited and indexed yet, it might be moved up in a queue to be crawled, regardless of the content it might contain.

There is another approach to crawling URLs that focus upon content, know as focused crawling, in which crawling starts at pages that have been crawled and indexed before, and URLs are followed from those pages, on the assumption that those pages might link to other pages on the same topic. The patent points to another document that describes that kind of approach in more depth – Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery (pdf).

It tells us that this kind of focused crawling focuses upon searching for pages that are relevant to a particular topic or a small set of topics, rather than by . Such focused crawling is guided by topic classification rather than the relevancy of queries issued by user requests.

Needy Queries

Are there needy queries, where searches are taking place that your pages might be good matches for that haven’t been crawled and indexed by search engines yet?

In Yahoo’s patent filing, the search engine doesn’t just consider information about the URLs that it might crawl to decide which pages to visit next, but also looks at its query logs to find queries that a number of people are searching for where the quality of the search results that show up in response to those queries may not be doing a good job of meeting searchers’ information needs.

The search engine would look at information that it might have about URLs that it knows about, but hasn’t visited yet to determine which pages to visit to try to improve the search results for those queries. That information could include words within the URL, the anchor text used for those links, how many links are pointing at that URL, and what the domain name is for the page the URL points at.

If that information seems like a good match for needy queries, those URLs may move up in the queue to be crawled sooner rather than later.

The patent filing is:

System and method for crawl ordering by search impact
Invented by Christopher Olston and Sandeep Pandey
Assigned to Yahoo.
US Patent Application 20090164425
Published June 25, 2009
Filed December 20, 2007

Abstract

An improved system and method for crawl ordering of a web crawler by impact upon search results of a search engine is provided. Content-independent features of uncrawled web pages may be obtained, and the impact of uncrawled web pages may be estimated for queries of a workload using the content-independent features.

The impact of uncrawled web pages may be estimated for queries by computing an expected impact score for uncrawled web pages that match needy queries. Query sketches may be created for a subset of the queries by computing an expected impact score for crawled web pages and uncrawled web pages matching the queries.

Web pages may then be selected to fetch using a combined query-based estimate and query-independent estimate of the impact of fetching the web pages on search query results.

This query-centric approach to determining an order for crawling web pages could lead towards:

Newer pages that might not have many links pointing to them showing up in search results quicker, and;

Long Tail queries that people are searching for with some frequency showing higher quality search results.

Share

15 thoughts on “Crawling Pages to Address Needy Queries and Search Impact”

  1. I think that most Google and Yahoo generally do a good job of crawling pages, while MSN/Bing is way behind the curve. For instance, in Google & Yahoo I have a couple hundred pages indexed, but in Bing I have 4! I’ve double-checked my site to make sure there were no obstructions (there aren’t) and I even signed up for their version of webmaster tools in an effort to hit them over the head and tell them to index my site’s pages…still no luck!

    As for indexing and ranking, don’t get me started. Yahoo and Bing are way behind Google IMHO as far as the quality of their index/rankings go. I know the algorithms vary and treat a website differently, but I’m literally getting 90% of my organic clicks from Google. Although I know there are differences in ranking algorithms, there shouldn’t be this much of a disperity in the rankings.

    Anyhow great article Bill…

  2. This jives with the reports that Google is treating blogs with more credibility lately. It makes sense. Plus, it can do better at getting trending topics.

  3. Wow, I can’t believe how in-depth this stuff gets. Do you have any information pertaining to how many servers Google has to maintain this constant crawling of new information & tremendous amount of data? it must be staggering.

  4. Hi Agent SEO,

    Thanks. Google does seem to do a pretty good job of crawling and indexing pages – but I have seen sites where Yahoo and Microsoft have crawled and indexed more pages than Google. Accounting for the disparities between the search engines can be difficult – I do think that the approach described in this patent filing is an interesting one, and could result in pages getting crawled and indexed that might not have been in the past.

    In addition to crawling impediments, there are other things that a search engine may consider, such as how unique the content might be that it finds on the pages of a site – I’ve seen sites that have pages with duplicate content issues, or a substantial amount of duplicate content have problems getting crawled. XML sitemaps and HTML sitemaps also seem to help somewhat, as does making sure that pages aren’t placed too may directory levels away from the root directory. So it’s not just how crawlable the links are on a site that can help get pages crawled. Other factors may play a role in the crawling of pages as well – and each search engine has its own schedule and things to look for when deciding which pages to crawl.

  5. Hi jlbraaten,

    Bloggers can write about topics that are popular, and that there aren’t many relevant results for in search indexes. I’m not sure that this kind of crawling to address “needy queries” is what is resulting in blog posts showing up well for long tail terms, but it’s a possibility.

  6. Hi Brad,

    It does get complex. This patent filing is from Yahoo, but Google maintains a pretty active rate of crawling and indexing pages as well. I don’t think that I could begin to tell you how many servers they use to maintain that rate, but they have been pretty active in building data centers and have published a number of patent filings on possible ways to improve upon how those data centers work.

  7. Hello Bill,
    I hope that you had a good July 4 celebration. Would submitting a sitemap to the search engine where the URLs for individual pages might indicate a good match to a needy query then be of use? My thinking is that a sitemap does provide useful data to the bots crawling the pages, and if the information in a URL indicates that it might be a good solution to a search inquiry, does this help the search engine list that page higher on the “needy queries” list? Does a submitted sitemap come into play, or is it mainly from other sources, like finding the sitemap on the site itself?

  8. Hi Frank,

    Thank you – I hope your holiday was a good one, too.

    A recent study from Google on XML sitemaps showed that they may cause pages to be crawled faster than just letting search engines discover URLs. So XML sitemaps can be helpful even if a search engine isn’t expressly looking for URLs that might be good matches for needy queries.

    The patent filing doesn’t address XML sitemaps, and it tells us that it may be looking at things like words it finds in URLs, the domain name, and anchor text associated with the URL on pages when it is crawling pages. When a search engine parses an XML sitemap, it’s not seeing clues like anchor text associated with the URL, so it may miss that.

    Submitting an XML sitemap isn’t a bad idea, but shouldn’t be completely necessary. If you include the location of the sitemap in your robots.txt file, that is helpful as well, since a search engine should be looking at your robots.txt file from time to time before crawling your pages.

    We don’t know if any search engine is actively using this process, but the idea that search engines may be working on ways to incorporate information from their query logs about what people are searching for and from the quality of their search results is encouraging.

    What I like about the approach in this patent filing is that if you provide information on your web site that is under-represented in search results, and there’s a demand for that information in people’s searches, then the pages that information appears upon may get crawled quicker. But that comes with the caveat that you provide search engines with clues about the content of those pages with things like your URLs and the anchor text you use.

    Having your pages crawled is just the first step, though. It’s a helpful and necessary step – the possibility of being indexed and showing up in search results benefit greatly because of it. It can be possible for a page to show up in search results without being indexed, but you may see the page URL instead of a title, and no snippet for the page when that happens.

    Having those pages rank well in search engines still often means creating content for pages that will be seen as relevant for specific queries and having some links pointing to those pages. But this needy query crawling process means that some pages may get into a queue to be crawled earlier because there’s a perceived need for what they contain. If Yahoo isn’t using this process, or one like it, hopefully they will start soon. It does seem like a good idea.

  9. Some of my WP blogs get new pages indexed by Google in a matter of hours after I create them. I’m using an automatic sitemap generator on those particular blogs and I’m not sure what triggers those instant index updates but Yahoo takes days or weeks to index those same pages usually. Bing surprisingly indexes them quickly, within a day or two usually. It’s good to see someone in Yahoo is thinking about their “algorithm” especially since the CEO thinks “Yahoo is NOT a search company”. It reminds me of a chicken running around (without it’s head). It’s no wonder why the percentage of searches done using Yahoo have gone down so incredibly low when they could have been contenders.

  10. Hi Mal,

    Thanks – great questions and observations. I know that wordpress blogs are set up by default to ping a number of services, including Google – so that’s one way that the search engine discovers that you have new content, and pinging helps Google get an idea of how frequently your pages are updated. That frequency of updates may be something it considers when deciding how and when to return to your pages to index your content. An XML sitemap probably doesn’t hurt either.

    I’ve seen Yahoo return and index new content at some pages very frequently, and I’ve seen Google slow to return to some sites and index updated pages. Improved crawling when it comes to new content and updated content on pages is one of those areas that can give a search engine an advantage over other search engines. The needy query approach described in this patent filing sounds like a nice step forward.

  11. Hi Joel,

    It does seem like a great idea. It would be a nice way of improving search results for all searches across the board instead of just focusing upon ones that might already be well represented in search results.

  12. Seems to be a smart move. Nice writeup, Bill.

    Mal, I’ve noticed this fast indexing with WP blogs as well.

    Best of luck!

Comments are closed.