On one level, a search engine indexes a web site by crawling that site one URL at a time, collecting information about what it finds at that address, and indexing the information found so that it can be served to visitors later.
But, the process can be more complicated than that.
For instance, a search engine may try to understand more about specific sites by collecting information on a site wide basis.
Site Wide Information about Web sites
Information that a search engine might look at about a web site on a site wide level might include:
- Detecting multiple possibly-duplicated pages from the same site;
- Determining entry points of a website;
- Identifying spam and porn sites;
- Detecting site-level mirrors,
- Extracting site-wide templates, and
- Visualizing content at the site level.
A search engine might also attempt to classify web sites based upon features found on the site, such as:
- Topics of each page,
- Internal hyperlinks on sites,
- Commonly linked-to entry points in sites, with their anchor-text,
- General external link structures,
- Directory structures of sites,
- Link and content templates present on sites,
- Description, title, and tags on key pages on sites, and so forth.
However, most of these approaches try to come up with an overall topic for a site, or for broad sections of a site, rather than for individual pages, and how those pages might be related to each other within a hierarchy.
Topic labels for web pages
A new patent application from Yahoo tells us about these site wide reasons and approaches to looking at a site to prepare us for a finer grained look at a site, in a way that explores how a search engine might attempt to understand different topics on the individual pages and segments of a site.
It might do this by looking at topic labels for specific pages (from places like links to individual pages from the Yahoo Directory, Wikipedia, the Open Directory, or other directories), and seeing how those labels might relate to each other within a topical hierarchy.
We are fortunate that the inventors of the patent filing also wrote a paper that covers a lot of the same ground, which explains the processes involved without a lot of the legal language found in the patent filing – Hierarchical Topic Segmentation of Websites.
System and method for hierarchical segmentation of websites by topic
Invented by Kunal Punera, Shanmugasundaram Ravikumar, and Andrew Tomkins
Assigned to Yahoo
US Patent Application 20080046429
Published February 21, 2008
Filed August 16, 2006
Abstract
An improved system and method is provided for hierarchical segmentation of websites by topic.
To do so, an organization of topics may be determined within directories of a website, the hierarchical arrangement of the web pages in the website may be segmented by topic, and the segments representing regions of coherent topics in the website directory may be output.
In an embodiment, a website directory may be converted into a binary tree and dynamic programming may be applied to iteratively determine whether to add a node of the tree to a segment representing a topic.
A node selection cost may be evaluated to determine whether to add a node of the tree as a segment representing a topic.
And a cohesiveness cost may be evaluated to determine how well a web page of the tree may be represented by its closest ancestral node that may be a segmentation point of a segment representing a topic.
Conclusion
The paper goes into a lot of the reasons why a search engine might want to segment the parts of a web site, and how it can use things like URL structures to help it do so.
What I found most interesting about the document was the change in focus of a search engine from crawling and understanding individual pages to understanding how pages within a site relate to each other.
How do the topics and parts of your web site relate to each other, and how might a search engine understand those relationships from different features and aspects of the site, and from links pointed to the pages of the site from other places?
It’s good to see Yahoo innovating again – they’ve been playing catch-up with Google for too long.
Interesting that the patent says they convert site structure into a binary tree, I would have thought that N-ary trees were closer to the designer’s intent when choosing URLs. Maybe that wouldn’t be patentable, though.
Hi Andy,
I agree. I enjoyed the patent and paper a lot because I thought it showed some nice innovation from Yahoo in this area. I also enjoyed the discussion on web site classification covered in the prior art area of the patent filing.
In the paper, they note that they’ve limited a lot of their inquiry into understanding the topical hierarchy of a site by restricting themselves to “segmentations that follow the directory (URL) tree” they note that “our
approach can be applied to any hierarchical structure within a website.”
They do tell us that in studying this “segmentation problem,” looking at URLs helps by capturing “the vast majority (85รขหโ90%) of websites, and allows us to study how to make use of this key element of site structure.”
They do leave open in the future the possibility of looking at other features, that uncover a “latent hierarchical structure” by a “deeper analysis of links, content, or URL [26], but that is beyond the scope of this paper.”
I believe that you are right that many designers do consider carefully their choice of URLs in organizing a site, and the percentages that the inventors of this process provide from their research (85-90% of sites) makes looking at that meaningful. But that is only part of the process, and by itself, like you say, might not be patentable.
Having built a number of sites, and spent considerable amount of time organizing the topics covered within the site, it’s interesting to see how someone at a search enigne might try to understand how the different topical sections of sites might relate to each other in a manner that could be automated, and applied to large numbers of sites.
Oh, now you’ve made me read their paper!
I can’t pretend to understand the maths, it’s too long since I was at university studying that kind of thing.
It’s interesting stuff – particularly the idea of segments in pages as well as pages as segments. It might make a weapon against spam blogs if the topic on page seems to be spammy but the rest of the SEO seems legitimate. Then again, it might just be gamed by the spammers.
At least, the paper is more manageable than the patent ๐
I am hoping that they carry out some of the additional research that they mention they might follow, to see what kinds of implications this might have in areas like the indexing of pages, or the presentation of information about web sites in search engines.
Fighting spam might be one of the implications of this kind of topical segmentation, but there might be others.
For instance, one of the efforts that Google has undertaken in their search results is to show sitelinks to other parts of a site when some sites appear at the top of search results. Yahoo could use a process like this one to show some directory like links under certain search results to give people a better idea of what they might find on a site. That could be interesting:
One of the other advantages of such topical segmentation listed in the paper is that:
They don’t go into much detail about those “algorithms” but that could have some interesting implications for how search engines treat sites.
Thank you, shenzhenseo. ๐
Glad to hear that you liked my post.