Is There a Future for User-Assisted Web Crawlers?
Why would a search engine let users assist a search engine crawling program in exploring the content of pages normally hidden to most crawling programs?
Users Teaching Web Crawlers?
Here are three examples of people who might help teach a web crawler how to crawl a site:
While search engines have been providing tools to try to make it easier for webmasters to have the pages of their sites indexed more easily, such as those found at Google Webmaster Tools, Yahoo Site Explorer, and Webmaster Center – Bing, none of the tools offered at any of those address that problem.
A web admin might help teach a web crawling program about the most effective way to crawl their website, browsing the pages of the site in a certain order, filling out forms on the site, and interacting with the site’s pages in a manner intended by the webmaster.
Those interactions could be captured to create rules for crawling the site by learning from those activities. The rules could then be used in the future by a web crawler to crawl the site.
Manual Reviewers – Someone manually reviewing the content and structure of web pages to see if a search engine can more effectively improve how the search engine indexes those pages could set up rules for a crawling program to follow links logically or fill out search forms to best find relevant pages on sites. These are others who could be involved with user-assisted web crawlers.
Content Subscribers – Programs like RSS feeds, and mashup tools can bring content from a site to someone interested in seeing it, without that person having to visit multiple web pages.
If people interested in that content could train a program to crawl through forms found at places like job search sites or travel pages or other websites that keep content behind forms, it could help them have web content from a site automatically fetched and delivered to them. This is a third group that could help with user-assisted web crawlers.
Problems with Focused Crawling
There are two common types of web crawling.
Free crawling – when a crawling program finds a page, it stores the page and the page’s address or URL and follows any links it can find on that page to locate other web pages.
Focused crawling – a crawling program tries to crawl only pages which contain a specific type of content, or “relevant” web pages.
There are many different approaches for focused crawling, but a crawler may end up crawling irrelevant pages or miss relevant pages for several reasons:
Diversity of Design and Structure – There is much diversity and variation amongst the design and structure of web pages. If the crawler follows a single set of logic or rules when it looks for pages, it might not be too accurate in determining the relevancy of pages when looking at a broad spectrum of pages.
Irrelevant Pages in a Chain of Links An assumption many focused crawlers follow is that pages that contain a specific type of content are often linked to each other. That can be misguided – if a crawler doesn’t follow a link to a page that seems not to contain the specific type of content looked for, and some pages are relevant to further along a chain of links which includes that page, then relevant pages may be missed.
Pages Accessible Only Through Forms – Sometimes it is necessary to fill out a form, such as a search form for job listings, to access relevant web content, such as job listings and descriptions. Forms differ so much from one site to another, and even within the same site, relevant content can be easily missed if a crawler doesn’t understand how to fill out many different types of forms.
Lack of Access to Restricted Content – A site owner might not want pages indexed that are relevant to the focus of a crawl.
Things that a Web Crawler Can Learn from Watching Someone Browse a Site:
These are some of the things that a web crawler can learn from a user:
- Which web pages are most likely to be relevant
- Which web pages are least likely to be relevant
- How to best fill out forms, to access dynamic content
- How and why to click on particular parts of a page being browsed such as URLs or buttons or tabs
- How to select values from drop-down menus
A patent application from Yahoo explores more deeply how user-assisted web crawlers might work:
Automatically Fetching Web Content with User Assistance
Invented by Amit Jaiswal, Arup Malakar, and Binu Raj
Assigned to Yahoo
US Patent Application 20090019354
Published January 15, 2009
Filed September 11, 2007
A method for performing activities on a website is disclosed. A user’s browsing activities on a website are captured. The user’s browsing activities include affixing labels to web pages and filling out forms. The captured activities are analyzed for patterns.
Rules for performing activities on a website are generated based on the patterns. Further activities are performed on the website according to the rules, and content from the website is fetched. The fetched content is used in various web service applications, including crawlers.
A web crawling program’s rules might learn from watching someone using a particular site could be expanded by the program to perform other activities on those pages that the User may not have performed.
User-Assisted Web Crawlers Example:
A website with many links is divided into three categories, job listings, non-job-related sections, and links to the site’s homepage.
Someone may have visited some of the job listing pages, but not all of them.
The crawling program may learn rules from those visits to the job listing pages to determine how to visit all of the job listing pages.
The Automatically Fetching patent application provides more details on how user/site interaction might be used to help a search engine crawling program address the three kinds of activities I mentioned at the start of this post:
- a webmaster training a crawler how to find pages that should be indexed on their site,
- a human reviewer teaching a crawler how to find pages, and;
- a content subscriber showing a crawling program the kind of information that they would like to subscribe to and be sent updates about.
There are a couple of older related patent applications from Yahoo that could be used with the methods described in this one that is worth a look if you want to find out more about how Yahoo might be trying to index content on the web that addresses some of the problems that focused crawling programs often face:
I’ve seen many people mention in different places on the Web that search engines might be learning about new pages to index using toolbars and other tools to find new pages that they haven’t indexed.
The patent application from Yahoo looking at user activities to find new content to index on the web takes that assumption of the use of a toolbar to find pages a step further by showing how a search engine could teach crawling programs to index more pages and create site-specific rules about indexing pages by paying attention to how people browse the web, interact with pages, and fill out forms.
Allowing web admins and people who want to subscribe to content to teach crawling programs about pages explicitly could take some of the burdens of work away from the search engines and move it to people who might use the services that those search engines may provide.
This is definitely a step beyond today’s XML Sitemaps.
14 thoughts on “Next Step After XML Sitemaps: User-Assisted Web Crawlers?”
Nice to know this, if the day really comes, it would be quite tasky. Good or bad? We’ll have to try using it first. Cheers!
I suspect that we might someday see something like this from the search engines, where site owners can provide more information about their sites, and the ways that people use them. I’m not sure if it will be as transparent as the process that the patent application above describes.
For example, if you decide to use Google Analytics on your site, you are providing a fair amount of information to Google about how people use your site. If you provide Google with an XML sitemap, you’re telling them which pages should be indexed on your site, and you can even let Google know which pages on the most important, and how frequently content changes on those pages within that sitemap. Google also lets you tell them whether you prefer the version of your URLs with or without a “www”.
It wouldn’t be surprising to see a webmaster providing even more information to help a search crawler index the pages of his or her site. As you say, “Good or Bad? We’ll have to try using it first.” 🙂
The processes described in this patent application aren’t available yet (and we’ll have to wait to see if they do become available).
But, if you haven’t explored some of the other webmaster tools that Google, Yahoo, and Microsoft provide, those are worth taking a look at. If you verify your site with Google, it does provide a list of what it perceives as crawling errors that can be informative.
Lots of stuff but important to do them though. Got a couple points, will try them out. thanks
Hi William, thanks for your reply. I’m getting a better understanding now on how Google XML works. 😉
I only used it as my friend told me that it helps improve one’s ranking. Lol.
The Google XML sitemaps don’t actually help with the rankings of pages, but rather provide information to Google about pages that exist on your site, to let them know that the pages listed are pages that you would like them to try to index. Google tells us on their sitemaps help page:
Pheww** It is one of the posts that’s speaking in-depth info about sitemaps.. And I really took so much time to grab this, William. Great to see someone working (at least exploring) things beyond today’s XML sitemaps.. I’m sure, I’d keep visiting your blog now 🙂
Thanks for the reply
I do use webmaster’s tools for Google, Yahoo and MSN. Its pretty descent though and Google is constantly updating it which is good for us webmasters.
You’re welcome. I do like that the webmaster tools provide some insight into what the search engines perceive as errors on the pages of a site.
We can see a start in the development of some of the ideas behind this patent application in action in the Yahoo Site Explorer tools section dealing with Dynamic URLs, where a site owner can give the search engine information about the structure of the URLs that they use on their site. Follow the link to “Dynamic URLs” from this page:
It will be interesting to see if they develop those tools further to be more user friendly…
Very insightful article.
Bill, I think the future of search engines is not by directing the crawler, but replacing the crawler by human eyes. I think that today Google has the technology to replace crawler by human eyes (such as social media sites, Digg and reddit). We just need to wait and SEE 8-|
I think a crawler is necessary to dig through a site and understand what the pages are about, and where else they might lead, but I agree with you that search engines are paying more attention to how people browse and search on the Web. Some interesting times are ahead of us.
Nice post. But as far a google crawlers are more sensible than the human. They could identify most and least sensible part a website. I think this sensible parts can be easily identified in GA.
Some interesting points, CAP…
Crawling web pages quickly, and trying to understand the differences in coding and structure for millions of web sites is a pretty challenging task.
Crawlers collect the URLs that they see, grab content from pages, and likely perform many other tasks as well, such as looking for duplicate content, gauging whether pages are web spam, and more. Breaking pages down into parts, such as a main content area, headers and footers and sidebars, and attempting to calculate relationships between multiple pages of a site or of a number of sites is also something that they can do.
I think the basic premise behind the approach that Yahoo describes in the patent filing is reasonable – let webmasters help if possible.
Installing something like Google Analytics might help Google understand some of the dynamics of the pages of a site, and that could be somewhat helpful in crawling pages and indexing the content of those pages, but it’s not something that everyone is doing.
Comments are closed.