My original title for this post was, “The Yahoo Site Explorer Patent Application,” because the post is about a new patent application from Yahoo that describes some of the information that they would like to receive from webmasters to make their efforts towards indexing the web easier.
The majority of this post does describe what is found in that patent application, but as I was writing the post, I thought about how difficult Yahoo makes it for webmasters to find information about how they can use Yahoo.
This includes how fragmented Yahoo’s FAQs and Help sections are, and how much effort a webmaster has to go through to learn about all of the different services that Yahoo offers that could be helpful to those site owners, from using Site Explorer, to participating in MyBlogLog, to many other tools.
If you have suggestions for Yahoo on how they could improve how they present the services that they offer, what would those be?
Yahoo’s Site Explorer
If you own a web site or work upon web sites, it’s possible that you’ve visited the Yahoo Site Explorer pages to see how many pages from your site that Yahoo might have indexed, or to see how many links to your site they may have found. It’s a useful tool for webmasters, but it isn’t all that Site Explorer has to offer.
The Site Explorer Help pages provide more details on how you can use Site Explorer to tell Yahoo more about your web site, such as:
- Whether you have XML-based sitemaps listing the pages of your site,
- Whether you have XML-based sitemaps listing the mobile versions of pages of your site,
- Whether your site has RSS feeds that people can subscribe to with an RSS feed reader,
- What language your site is published in,
- Pages, or URLs, from your site that you don’t want included in Yahoo’s index,
- Pages from your site that Yahoo may not have included in their index,
- Information about dynamic parameters in the addresses, or URLs, from your web site, such as session ids, source trackers, format modifiers.
Site Explorer is the subject of a new patent application published at the US Patent and Trademark Office, and the filing provides somewhat of an inside look at the motivations behind Site Explorer, and possible directions it might take in its future development.
In many ways, Yahoo Site Explorer is similar in purpose to Google’s Webmaster tools, though each offers slightly different functions and tools.
The Purpose of Site Explorer
Yahoo provides tools at Site Explorer to let webmasters provide it with information about how their web sites are set up, to make it easier for the search engine to index pages.
The main way that search engines collect information about web sites through programs that “crawl the web,” following hyperlinks from one page to another to discover new links, and revisit old links to pages that may have changed.
It is possible that this crawling method may mean that pages on the Web can be missed and never get indexed, or that pages that change over time don’t get revisited.
Enabling site owners to submit XML-based sitemaps with lists of links from their pages, and XML-based sitemaps for pages on their site expressly for mobile search means that the search engine doesn’t have to work as hard to discover links on the pages of those sites.
The sitemaps could also include information such as how frequently content on the page changes.
Other information provided by a site owner could make it easier for a search engine to understand some of the unique characteristics of a site, and make it easier for sites to be indexed.
Duplicate Content and Pages with Multiple Regions
In addition to lists of links in xml sitemaps, the patent mentions a couple of other areas which might help it in indexing pages.
One involves identifying duplicate content on a site, and when the same content might exist at two different web addresses, or URLs. This can happen on dynamic sites when extra data variables can show up in URLs, so that you can have two or more different URLs for the same page. The author of the patent tells us:
Showing the same page in search results, at two different URLs means that searchers end up seeing redundant pages.
If the search engine can discover, through analysis of the content of two web pages, that they are substantially identical, the search engine can discard one web page.
Another area that search engines might want help with is that a web page can be divided into regions, with content of interest to a search, such as a news article, and content that may not be of interest, such as an ad banner. If the search engine can ignore the ad banner portion of the web page, and just index the content area, the search results provided to searchers can be improved.
Site Explorer presently has a way for site owners to provide information about some data variables that might show up in the URLs for pages, such as session IDs, that might cause duplicate content problems for the search engines. It doesn’t appear to yet have a way to indicate to the search engine about which regions of pages contain content that should be indexed, and content that shouldn’t be, such as advertising.
The Site Explorer Patent Filing
Yahoo’s patent application includes a list of the kinds of information that a search engine might ask for from a site owner to make it easier for the search engine to index the pages of that site. We’re told that this list is representative, and likely not complete, but it’s definitely worth looking over and thinking about.
Providing system configuration information to a search engine
Invented by Amit Kumar
Assigned to Yahoo
US Patent Application 20080147617
Published June 19, 2008
Filed: December 19, 2006
Providing a search engine with system configuration information. The system configuration information pertains to a system having a web server that provides content.
For example, the content may be web pages associated with a web site, and the system may include hardware and software used to provide the content of the web site to end users.
More particularly, the system can include one or more computer systems, web server software, application server software, and application programs that facilitate providing content. A search engine requests system configuration information from the web server.
In response to the request, the web server provides system configuration information to the search engine. The search engine can use the system configuration information to reference portions of the content in an index. The index can be used to respond to a search query that involves content served by the web server.
Why be Concerned with Configuration Information?
If a search engine can figure out that a site contains multiple parts that are configured differently, such as a shopping section, a news archives, a blog, and some articles, it might approach each of those sections differently when it indexes them.
The patent tells us that it might treat information within those different parts in different manners, but doesn’t provide too much in the way of example.
Here are a couple of examples that I came up with – for a shopping section, the search engine might try to extract information about the products offered, including pictures and prices and other important attributes, and include those in its shopping search. For blog pages which usually have dates attached to them, the search engine might use those dates to see how “fresh” the content of those posts might be.
Some of the kinds of system configuration information that might be sought by a search engine:
— Preferred Canonical URL
An example – does the site prefer to be listed with a “www” or without a “www”?
— Other URL Information
An example – whether URLs are case sensitive. Some site owners mix together capital and lower case letters in the way they name pages, and present them on the Web. This may not be a problem for domain names where search engines ignore whether letters in the domain names are capital or lowercase, but can be a problem for the names of folders and files where servers and search engines do pay attention to whether letters are capitalized or not.
www.LiveLifeFully.com and www.livelifefully.com are treated as if they are the same page by search engines.
www.livelifefully.com/Today/IsTheFirstDay.htm and www.livelifefully.com/today/isthefirstday.htm are treated as if they were different pages by the search engines because of the capital letters in the folder and file names.
A web master might be able to tell a search engine to treat those two pages as if they were the same page through something like Site Explorer. I haven’t seen this in today’s Site Explorer at Yahoo, so be careful when you name your files and folders. I recommend using all lowercase letters consistently throughout your site.
— Directory or Section-Level Information
If a site is divided up into different sections, such as a blog section, an e-commerce section, etc., those might be set up in a directory structure, using URLs like “acme.com/blog,” and “acme.com/sales.”
Letting the search engine know about these different sections may aid it in indexing that content in an optimal manner.
— Session Information
Providing information to the search engine about how sessions are tracked across a site, might help it to avoid indexing problems associated with the tracking of visitors. A search engine may then index pages without including the session IDs that may appear in URLs, for instance.
— Contact Information
If the search engine has problems with crawling a web site, having contact information to use to contact an administrator for the site could be helpful in resolving those problems. contacted.
— Error Page Information
A site owner could provide information about how the web server is set up to handle errors when pages are unavailable on a site, so that the search engine doesn’t have to try to figure that out on its own.
— Duplicate Content Guidance
The same content on a site might be located at different places, such as blog entries at in two different directories – “/blog” and “/journal”. A site owner could indicate that the “/blog” content should be indexed, and the “/journal” content shouldn’t be.
— Spider Information
Rules that apply to search engine crawling programs (spiders) could be listed here, such as an indication that a particular directory shouldn’t be crawled by the search engine.
A site may contain different sections in different languages. Letting the search engine know explicitly which sections use which languages could be better than letting a search engine try to figure that out on its own.
— Bulk Change Information
A site owner might decide to move or delete a whole section or directory on a site. Being able to quickly tell a search engine about this change may be really helpful.
Microformats are a way of annotating a page to provide certain information in a standard format. They could be useful in providing information such as the address of a physical business run by the site owner, or contact information, or a calendar of events.
Letting the search engine know what microformats might be used on particular pages may make it easier for the search engine to find and use that information.
Not all of the configuration information that I listed above can be shared with Yahoo through Site Explorer yet, and it’s possible that the search engine may decide that other information may be helpful in the future, too.
There are no guarantees that providing information to Yahoo through Site Explorer, such as an XML sitemap or information about dynamic URLs on your site will help with the indexing of the pages of a site. But, the information could be useful to the search engine, and if providing it to them can result in Yahoo pointing out problems or errors on the pages of a site that might keep that site from being indexed, then there is value in providing the information to the search engine.
If you are interested in learning more about Site Explorer from the perspective of the search engine, it’s worth spending some time with the patent application itself.
The front page of Yahoo’s Site Explorer does a terrible job of providing people with information about some of the tools that Site Explorer offers. It also should probably have links to other tools offered by Yahoo, such as how to add a business to Yahoo Local, and other Yahoo Webmaster Resources. The Yahoo Everything page also lists some tools and features that Yahoo offers that might be helpful to webmasters.
If someone at Yahoo approached reorganizing that information in a manner that would be friendlier and more helpful to site owners, they might be surprised at the positive response they would receive.