Next Step After XML Sitemaps: User Assisted Web Crawlers?

Imagine a search engine letting people teach a web crawling program how to navigate through the pages of a site filled with java script links and other pages usually only accessible through making selections or inputting text into forms.

Why would a search engine let users assist a search engine crawling program in exploring the content of pages normally hidden to most crawling programs?

Users Teaching Web Crawlers

Here are three examples of people who might help teach a web crawler how to crawl a site:

Webmasters – Search engines can have difficulties crawling pages because of java script links, links included as options in drop down forms, and other pages that are only accessible through forms.

While search engines have been providing tools to try to make it easier for webmasters to have the pages of their sites indexed more easily, such as those found at Google Webmaster Tools, Yahoo Site Explorer, and Webmaster Center – Bing, none of the tools offered at any of those really address that problem.

Short of a webmaster providing alternative ways to reach pages behind the java script or forms, it can be difficult to have a search engine index those pages. What if there was a way for a webmaster to train a search engine crawling program to access pages behind forms and java script included in the webmaster tools that search engines provide?

A webmaster might help teach a web crawling program about the most effective way to crawl his or her web site, browsing the pages of the site in a certain order, filling out forms on the site, and interacting with the pages of the site in a manner intended by the webmaster.

Those interactions could be captured to create rules for crawling the site by learning from those activities. The rules could then be used in the future by a web crawler to crawl the site.

Manual Reviewers – Someone manually reviewing the content and structure of web pages to see if a search engine can more effectively improve how those pages are indexed by the search engine could set up rules for a crawling program to follow links in a logical manner or fill out search forms to best find relevant pages on sites.

Content Subscribers – Programs like RSS feeds and mashup tools can bring content from a site to someone interested in seeing it, without that person having to visit multiple web pages.

If people interested in that content could train a program to crawl through forms found at places like job search sites or travel pages, or other websites that keep content behind forms, it could help them to have web content from a site automatically fetched and delivered to them.

Problems with Focused Crawling

There are two common types of web crawling.

Free crawling – when a crawling program finds a page, it stores the page and the page’s address or URL, and follows any and all links it can find in that page to locate other web pages.

Focused crawling – a crawling program tries to crawl only pages which contain a specific type of content, or “relevant” web pages.

There are a number of different approaches for focused crawling, but a crawler may end up crawling irrelevant pages or miss relevant pages for a number of reasons:

Diversity of Design and Structure – There is much diversity and variation amongst the design and structure of web pages, and if the crawler follows a single set of logic or rules when it looks for pages, it might not be too accurate in determining the relevancy of pages when looking at a broad spectrum of pages.

Irrelevant Pages in a Chain of Links An assumption many focused crawlers follow is that pages that contain a specific type of content are often linked to each other. That can be misguided – if a crawler doesn’t follow a link to a page which seems to not contain the specific type of content being looked for, and there are pages that are relevant further along a chain of links which includes that page, then relevant pages may be missed.

Pages Accessible Only Through Forms – Sometimes it is necessary to fill out a form, such as a search form for job listings, to access relevant web content, such as job listings and descriptions. Forms differ so much from one site to another, and even within the same site, that relevant content can be easily missed if a crawler doesn’t understand how to fill out many different types of forms.

Lack of Access to Restricted Content – A site owner might not want pages indexed that are relevant to the focus of a crawl.

Things that a Web Crawler Can Learn from Watching Someone Browse a Site:

These are some of the things that a web crawler can learn from a user:

  • Which web pages are most likely to be relevant
  • Which web pages are least likely to be relevant
  • How to best fill out forms, to access dynamic content
  • How and why to click on particular parts of a page being browsed such as URLs or buttons or tabs
  • How to select values from drop-down menus

A patent application from Yahoo explores more deeply how a person can assist the crawling of Web pages:

Automatically Fetching Web Content with User Assistance
Invented by Amit Jaiswal, Arup Malakar, and Binu Raj
Assigned to Yahoo
US Patent Application 20090019354
Published January 15, 2009
Filed September 11, 2007

Abstract

A method for performing activities on a web site is disclosed. A user’s browsing activities on a web site are captured. The user’s browsing activities includes affixing labels to web pages and filling out forms. The captured activities are analyzed for patterns.

Rules for performing activities on a web site are generated based on the patterns. Further activities are performed on the web site according to the rules and content from the web site is fetched. The fetched content is used in various web service applications, including crawlers.

The rules that a web crawling program might learn from watching someone using a particular site could be expanded by the program to perform other activities on those pages that the User may not have performed.

Example:

A web site with many links is divided into three categories, jobs listings, non-job related sections, and links to the site’s homepage.

Someone may have visited some of the job listing pages, but not all of them.

The crawling program may learn rules from those visits to the job listing pages to figure out how to visit all of the job listing pages.

Conclusion

The Automatically Fetching patent application provides more details on how user/site interaction might be used to help a search engine crawling program address the three kinds of activities I mentioned at the start of this post:

  • a webmaster training a crawler how to find pages that should be indexed on their site,
  • a human reviewer teaching a crawler how to find pages, and;
  • a content subscriber showing a crawling program the kind of information that they would like to subscribe to and be sent updates about.

There are a couple of older related patent applications from Yahoo that could be used with the methods described in this one that are worth a look if you want to find out more about how Yahoo might be trying to index content on the web that addresses some of the problems that focused crawling programs often face:

I’ve seen a number of people mention in different places on the Web that search engines might be learning about new pages to index by using toolbars and other tools to find new pages that they haven’t indexed.

The patent application from Yahoo looking at user activities to find new content to index on the web takes that assumption of the use of a toolbar to find pages a step further by showing how a search engine could teach crawling programs to index more pages and create site-specific rules about indexing pages by paying attention to how people browse the web, interact with pages, and fill out forms.

Allowing webmasters and people who want to subscribe to content to explicitly teach crawling programs about pages could take some of the burden of work away from the search engines and move it to people who might use the services that those search engines may provide.

This is definitely a step beyond today’s XML Sitemaps.

Share

14 thoughts on “Next Step After XML Sitemaps: User Assisted Web Crawlers?”

  1. Nice to know this, if the day really comes, it would be quite tasky. Good or bad? We’ll have to try using it first. Cheers!

  2. Hi Darren,

    I suspect that we might someday see something like this from the search engines, where site owners can provide more information about their sites, and the ways that people use them. I’m not sure if it will be as transparent as the process that the patent application above describes.

    For example, if you decide to use Google Analytics on your site, you are providing a fair amount of information to Google about how people use your site. If you provide Google with an XML sitemap, you’re telling them which pages should be indexed on your site, and you can even let Google know which pages on the most important, and how frequently content changes on those pages within that sitemap. Google also lets you tell them whether you prefer the version of your URLs with or without a “www”.

    It wouldn’t be surprising to see a webmaster providing even more information to help a search crawler index the pages of his or her site. As you say, “Good or Bad? We’ll have to try using it first.” :)

  3. Hi Diamonds,

    The processes described in this patent application aren’t available yet (and we’ll have to wait to see if they do become available).

    But, if you haven’t explored some of the other webmaster tools that Google, Yahoo, and Microsoft provide, those are worth taking a look at. If you verify your site with Google, it does provide a list of what it perceives as crawling errors that can be informative.

  4. Hi William, thanks for your reply. I’m getting a better understanding now on how Google XML works. ;-)

    I only used it as my friend told me that it helps improve one’s ranking. Lol.

    Cheers!

  5. Hi Darren,

    You’re welcome.

    The Google XML sitemaps don’t actually help with the rankings of pages, but rather provide information to Google about pages that exist on your site, to let them know that the pages listed are pages that you would like them to try to index. Google tells us on their sitemaps help page:

    Sitemaps provide additional information about your site to Google, complementing our normal methods of crawling the web. We expect they will help us crawl more of your site and in a more timely fashion, but we can’t guarantee that URLs from your Sitemap will be added to the Google index. Sites are never penalized for submitting Sitemaps.

  6. Pheww** It is one of the posts that’s speaking in-depth info about sitemaps.. And I really took so much time to grab this, William. Great to see someone working (at least exploring) things beyond today’s XML sitemaps.. I’m sure, I’d keep visiting your blog now :)
    Cheers!

  7. Hey William,

    Thanks for the reply

    I do use webmaster’s tools for Google, Yahoo and MSN. Its pretty descent though and Google is constantly updating it which is good for us webmasters.

  8. Hi Diamonds,

    You’re welcome. I do like that the webmaster tools provide some insight into what the search engines perceive as errors on the pages of a site.

    We can see a start in the development of some of the ideas behind this patent application in action in the Yahoo Site Explorer tools section dealing with Dynamic URLs, where a site owner can give the search engine information about the structure of the URLs that they use on their site. Follow the link to “Dynamic URLs” from this page:

    http://help.yahoo.com/l/us/yahoo/search/siteexplorer/

    It will be interesting to see if they develop those tools further to be more user friendly…

  9. Very insightful article.

    Bill, I think the future of search engines is not by directing the crawler, but replacing the crawler by human eyes. I think that today Google has the technology to replace crawler by human eyes (such as social media sites, Digg and reddit). We just need to wait and SEE 8-|

  10. Hi Romeo,

    I think a crawler is necessary to dig through a site and understand what the pages are about, and where else they might lead, but I agree with you that search engines are paying more attention to how people browse and search on the Web. Some interesting times are ahead of us.

  11. Hi,

    Nice post. But as far a google crawlers are more sensible than the human. They could identify most and least sensible part a website. I think this sensible parts can be easily identified in GA.

  12. Some interesting points, CAP…

    Crawling web pages quickly, and trying to understand the differences in coding and structure for millions of web sites is a pretty challenging task.

    Crawlers collect the URLs that they see, grab content from pages, and likely perform many other tasks as well, such as looking for duplicate content, gauging whether pages are web spam, and more. Breaking pages down into parts, such as a main content area, headers and footers and sidebars, and attempting to calculate relationships between multiple pages of a site or of a number of sites is also something that they can do.

    I think the basic premise behind the approach that Yahoo describes in the patent filing is reasonable – let webmasters help if possible.

    Installing something like Google Analytics might help Google understand some of the dynamics of the pages of a site, and that could be somewhat helpful in crawling pages and indexing the content of those pages, but it’s not something that everyone is doing.

  13. Pingback: SEO Daily Reading - Issue 146 « Internet Marketing Blog

Comments are closed.