There’s a little park straddling Delaware and Maryland, which has a monument marking the boundary between the states. Etched across the top of the stone marker is a line that indicates the separation between the states and shows the point where an arc starts, which separates Delaware from Pennsylvania. If you look at a map of the border, you’ll see that the top of the state of Delaware is an arc shape that measures 12 miles from a cupola on top of a courthouse in Historic New Castle, Delaware. The arc between Delaware and Pennsylvania was defined in a deed to William Penn from the Duke of York in 1682. Maryland’s territory was also involved in the setting of borders.
You can hop atop the marker and sit on the state line if you’d like. Woods surround the monument, and you have to travel down a path in the park to reach it.
We take the surveying of such lines, between states, between countries, surrounding towns and cities and counties for granted, as well as the exploration and discovery of the places where we live. The programs that search engines use to discover new pages on the Web and revisit old pages are a little like those explorers and surveyors – finding material online to add to their indexes so that we can explore those indexes and search for information and pages hosted on servers scattered around the globe.
Those programs are often referred to as crawlers or spiders or robots or bots, and many restraints limit how well they might explore and define the pages that we find online.
Crawlers from the major search engines tend to be fairly simple and don’t view pages as we do with browsers. They often don’t run the java scripts that we do when we visit pages or resolve images and view any text that we might see on those images.
Simple and Complex Crawling Programs
In April, IBM was granted a patent (originally filed on June 30, 2000) that described a Web crawling program that would see pages on the Web in a very similar manner to what we see when we browse the Web. The patent, System and method for enhanced browser-based web crawling, looks at the “inline-frames, frames, images, applets, audio, video, or equivalent” on web pages and renders those to get an understanding of the final HTML markup that shows up at a URL when someone might visit a page. It even describes using Optical Character Recognition (OCR) software to read text that may appear in images.
If a search engine were to follow the detailed exploration process described in IBM’s patent, it would probably be a pretty computationally expensive process to use. It would likely take a fair amount of time and effort to index many pages. The crawlers that the major commercial search engines use seem much simpler and don’t explore the pages on the Web in that much depth. Google’s Webmaster Guidelines describe the simplicity of the crawling programs that they use with this statement:
Use a text browser such as Lynx to examine your site because most search engine spiders see your site much as Lynx would. If fancy features such as JavaScript, cookies, session IDs, frames, DHTML, or Flash keep you from seeing all of your site in a text browser, then search engine spiders may have trouble crawling your site.
Lynx is a very early web browsing program and a straightforward one, which lets you look at the text on pages.
Watching Out for Cookies
One of the efforts that someone performing search engine optimization may and should take on a site is to see how search-engine-friendly the pages of that site might be. Part of that inquiry is making sure that search engine crawling programs can visit all of the pages of a site that the site owner wants indexing and that search engines can index meaningful information from crawled pages. One stumbling block to indexing a site is when a search engine crawling program must take a “cookie” to see pages.
A cookie is a small string of text that might be sent by a site to be stored on a visitor’s computer. A cookie usually consists of name-value pairs storing information about a visitor’s travels on the site, consisting of information such as the contents of shopping carts, user preferences for the site, and information that can help track the pages that a visitor goes to to on a site. A cookie can help a site personalize the experience that a visitor has on its pages. Crawlers don’t usually take cookies, and crawlers may not be able to visit pages where the taking of a cookie is required.
Cookie Enabled Search Crawlers
A newly published patent filing from Google describes how it might enable crawling programs to accept cookies when visiting the pages of a site. One of the challenges behind a crawler accepting cookies is that a search engine may have more than one crawler or spider, or robot visiting the pages of a site while crawling those pages, and it would be ideal if they “shared” a cookie. That’s the focus of the patent filing:
Search engine with multiple crawlers sharing cookies
Invented by Anurag Acharya, Michal Louz-On, Alexander C. Roetter
Assigned to Google)
US Patent 7,546,370
Granted June 9, 2009
Filed: August 18, 2004
The patent identifies the problems that search crawlers have with sites that require cookies as follows:
Conventional network crawlers have no facility for obtaining such cookies, nor for handling various cookie error conditions. As a result, conventional web crawlers cannot crawl a full set of pages or documents in websites that require cookies, thereby reducing the amount of information available through the use of such search engines.
In addition, conventional network crawlers have no facilities for coordinating the efforts of a parallel set of network crawlers concerning crawling a full set of pages or documents in websites that require cookies. Therefore, there is a need for an improved search engine that uses multiple crawlers to access websites that require cookies.
The patent filing goes into a great amount of detail on cookies and how search crawling programs might share them. There’s no indication that Google has started to crawl pages that can only be visited by accepting cookies, but it might in the future.
Until then, if you own or work on a website, and you require visitors to take cookies to see certain pages and want those pages indexed, make sure that search engines aren’t required to accept cookies to see those pages.
At some point, we may even start seeing crawling programs like the one described in the IBM patent that looks at the text in images, information that shows up in frames and iframes, and other parts of pages that are triggered by javascript and other applets.
When they do, search engine indexes may be more like the maps we have today than the surveys of geographical borders of surveyors from years gone past.
Very impressive… it’ll be interesting to see how this might change search rankings in the future.
Interesting article – learned a little bit more about the future of SEO.
Hi jlbraaten,
Definitely interesting. How much might search listings change if Google started showing sites in search results after indexing pages that required visitors accept cookies? How many more pages would become available in the search index that weren’t in the past? If Google started showing such results tomorrow, and search rankings for pages looked very different, how many would recognize why?
Hi Jason,
Thanks. I thought that the IBM patent was pretty interesting. Search engine indexes would look very different if crawlers started reading text in images, the contents of frames and iframes (and associated that content with the page framing them), and text and links that only appeared after applets had run their course. Would search indexes contain more relevant information if crawlers could capture that information?
I suspected that one of the reasons why Google acquired Green Border Technologies might have been to allow the search engine to crawl pages in an enhanced fashion like that described in the IBM patent, in a safe “sandbox” type environment, so that if any of the applets on a page triggered some kind of malware, that the effects of that malware would be isolated. Interesting that Google is able to detect malware on pages of a site these days, isn’t it? Are they doing some kind of enhanced crawling of web pages?
Does this mean that Google will be able to crawl people’s sessions? I thought cookies were related to people’s sessions, meaning that sites will become much bigger than Google knew them to be.
Hi Adam,
Cookies can be used for a number of different purposes, such as saving a visitor’s preferences, or authenticating the identity of a user, or storing information about selections for a shopping cart, or tracking the progress of someone making their way through the pages of a site. Some user preferences saved in a cookie require a person to log in but others don’t – for instance you can set Google to show you more than 10 search results at a time without logging in, and that information is saved in a cookie.
Some sites may require that a visitor accept a cookie before they can navigate the pages of a site. If you want to experiment to see pages that might do so, you can change your browser settings to either ask (prompt) you before accepting cookies, or block cookies.
If a crawler accepts cookies, it may be able to visit pages that might require cookies to navigate to those pages. But it’s not going to see the personalization and unique content that may be available to specific users based upon their preference selections or authorization to visit some pages based upon a login.
On Google’s help page titled URLs not followed errors, they tell us that they may have problems visiting some URLs for pages where cookies are required for navigation. Some sites just won’t let you progress through its pages unless you accept a cookie. When a search engine is required to accept a cookie to visit pages and it can’t, then it won’t visit those pages.
Bill, a
As you mentioned in this post, the Lynx browser has been a good tool for determining how a search engine views your site.
An additional tool I like is http://browsershots.org/ – which allows SEOs and web designers to see how their HTML markup and CSS layouts are displayed on a multitude of browsers and operating systems, whithout having to run them all on their PC. It can take a while to process the site, but it is worth the wait.
Hi People Finder,
Thank you. Browsershots is a really nice service. It’s definitely one worth bookmarking and using.
No matter how search engine friendly your site is, ultimately it’s what your visitors do on your site once they get there that’s important. Making sure that they can read your pages, fill out your forms, sign up for your newsletters, subscribe to your feeds, etc., is essential.
I am thinking of implementing a session cookies (temporary cookies) to prevent spammer’s crawlers/bots to harvest emails and indeed to use it as a replacement for a CAPTCHA, which can be very hard to read sometimes. I had a web gallery once and the visitor message form was spammed regularly every day until I implemented a very short PHP code on that page and the spammer’s message disappeared for good (was very effective, got read of all the spammers but one). What I did was to set a session cookie at the beginning of the page and then on the message page I checked if the session was set and if not then the Submit Button was disabled. This was some 4 years ago so I don’t know how clever the robots are now, and if next generation robots will be able to use cookies then my little trick would not work. Intelligent search engine robots are welcome but then it’s inevitable that spammer’s robot will also exploit the new found functionality.
Hi Phillip,
I like the approach that you took with your web gallery. It is possible that spammers could do something similar if it made sense for them to try. If you do implement session cookies in that manner, hopefully you will let in the search engines that you do want to crawl and index your pages without requiring them to accept cookies – there’s no telling whether any of the major commercial search engines will implement this for their regular web crawlers any time soon.
like someone said cookies can be used for numerous purposes wonder it wont end up in spammers hand
Hi tarun,
The post isn’t about the use of cookies by search engines, but rather that search engines could possibly accept cookies from sites that use them. There are sites that a conventional search engine crawling program won’t crawl because those pages require a visitor to accept a cookie before they can navigate the pages of a site.
Hiya Bill!
The idea of the search engine spiders seeing web pages in the same fashion as humans, even going so far as to OCR read text in images, is one that really appeals to me. I feel that the closer a bot can see what like a human does, allied to software that tracks human behaviour on web pages, the better the final result is going to be to the human looking for relevant information. The bandwidth and resources required will be lots more, as you pointed out, but I think the future of search is going to get really interesting.
What I find really interesting is that IBM already filed this patent in June 2000. Why did it take sdo long to get granted?
Hi Jacques,
It would be interesting if they did, but possibly very computationally expensive.
In my latest post, about page load and SEO, the inventors of a patent application I wrote about suggest that a search engine crawler might spend more time on a page that people tend to spend a lot of time upon, and look at the contents of those pages in more depth, including possibly trying to see what content exists beyond just text upon those pages. Limiting this kind of expanded crawling to sites that users seem to enjoy using might be something that would limit the expense of crawling all pages with this level of depth. Something to think about.
As for the IBM patent, it did take a long time to get granted. I haven’t taken a look at its history, but some patents really do go a long time before they are granted.