How a Search Engine Might Analyze the Linking Structure of a Web Site

How well do search engines understand the linking structure of a web site? Do they have ways to organize and classify individual links and blocks of links that they see on the pages of a site?

Do they treat links and collections of links that they find on more than one page of a site differently than links and collections of links only on one page? If they find more than one group of links on a page that contain many of the same links, though at the top and bottom of the page, how might they treat those links?

I came across a patent filing from Microsoft from last summer that explored many of these topics, as well as others. It hadn’t drawn much attention, so I decided to take a closer look at it here.

Segmentation and Link Blocks

In the 2002 paper, SmartView: Enhanced Document Viewer for Mobile Devices (pdf), a couple of Microsoft researchers discussed how web pages might be analyzed and partitioned into smaller logical sections to be viewed on small devices, such as handheld phones. These smaller sections could be selected by a viewer and seen independently from the rest of a web page. One of the authors of that paper is listed as an inventor of the Microsoft patent, and the paper is cited within the patent as an example of how web pages might be segmented in a way that benefits the viewers of a page.

Another web page segmentation process mentioned in the patent filing is a Microsoft process known as VIPS: a Vision-based Page Segmentation Algorithm. The paper describing this process was published in 2003, and explores a way of looking at the HTML of a page, along with a visual inspection of white space, horizontal rules, and other visual aspects of a web page that might indicate that a page is broken down into different logical sections.

Another paper from Microsoft that wasn’t mentioned in the patent filing, but which seems relevant, is one that explores how links from different blocks on a page might be treated differently based upon where they are located on that page. The paper is Block-Level Link Analysis, and it introduces amongst other things the idea of a Block Level PageRank:

Block Level PageRank (BLPR) is similar to the original PageRank algorithm in spirit. The key difference between them is that, traditional PageRank algorithm models web structure in the page level while BLPR models web structure in the block level.

What that paper, and other papers from Microsoft don’t explore in much depth is how different blocks of links might be related to each other. They don’t try to explore in any depth how links on a site might be related to one another, and how the pages of a site might be organized based upon links between the pages of a site. Looking at link blocks on a site, classifying them, and organizing them may yield some useful benefits.

Once a page is broken down into different segments, such as headers and footers, sidebars, main navigation bars, main content areas, advertisement blocks, etc., the relationship between links in those segments across the site might be explored

Classifying Links

To classify links and link blocks, a search engine would start by analyzing the layout of individual pages to identify candidate link blocks and see where they occur on pages, and how they might relate to each other. This analysis is used to create what the patent refers to as a Link Structure Graph, or LSG.

There are three main purposes for creating an LSG:

Locality – To identify the global link structure of a site, and the local link structure around individual pages.

Completeness – To understand the complete link structure of the site, including navigational structures and logical structures that are being used to organize the content on a site.

A navigational structure is a consistent and easy to follow arrangement of links enabling visitors to travel to different parts of a site. A high level global navigation structure usually appears on all (or most) pages of a site, and secondary (and even lower level) navigation structures may also allow visitors to navigate through different sections of pages on a site.

In addition to navigational links, a site may include links to pages in structural elements, such as a list of links to “best sellers” on an ecommerce site, or to “most popular posts,” on a blog.

Scalability – This algorithm can run efficiently for large and small web sites. It also looks at link blocks that may appear on more than one page, and relate them to each other instead of treating them as new when it finds them on other pages.

Some link blocks may appear more than once on the same page in different segments, with minor variations, and they may be merged together. For instance, the same or a substantially similar link menu might be shown at the top and bottom of a page in a main navigation area and a footer navigation area.

After substantially similar link blocks may have been merged together, the remaining link blocks which are considered “unique” are classified. Classification is based upon the function of a link block, and might be described as being one of the following three types:

S-nodes – These are organizational and navigational link blocks; typically repeated across pages with the same layout and showing the organization of the site. They are often lists of links that don’t usually contain other content elements such as text. These blocks are structural link blocks or s-nodes.

C-nodes – These are content link blocks, grouped together by some kind of content association, such as relating to the same topic or sub-topic. These blocks usually point to information resources and aren’t likely to be repeated across more than one page.

I-nodes – These are isolated links, which are links on a page that aren’t part of a link group, and which may be only loosely related to each other, by virtue of something like their appearing together within the same paragraph of text. Every link that appears on a page that isn’t classified as s-nodes or c-nodes might be seen as a single collection of links and given an i-node classification. Each link on a page might be considered an individual i-node, or they might be grouped together by page as an i-node.

If you were to look at a number of pages on different web sites, you might not find it too hard to do this kind of classification for the links on those pages.

Why Classify Links?

There are a number of reasons for the classification of links on a site. The paper on Block-Level Link Analysis mentioned above tells us that links in different blocks might be given different weights for ranking purposes. Understanding the linking structure of a site can also help when different parts of a site might also be displayed on a hand held device with a smaller screen. But there are also other potential benefits that are described in the patent filing:

1) Links to other pages that might be related to a page shown may be more easily uncovered. This patent doesn’t mention the use of quicklinks, but it does tell us that it might present information about pages that are related and make it easier to navigate to those pages on a site. These might be used with a personalization approach to uncover pages that might interest a specific visitor, or be based upon an approach that increases a visitor’s abilility to navigate to pages that might not be directly linked to pages offered by search results.

2) Internal linking information collected by the search engine might be offered to the site owners to allow them to optimize their use of links, and to see statistics about visits between pages of a site.

3) The linking information might be useful in the automatic tagging of web pages on a site.

For example, a page about Cars might include category pages about specific brands of cars, then sub categories about specific models, and then specific product pages about car parts. Understanding the linking structure of a site can mean that the higher level link text of parent pages might be used to help tag the lower level pages. So if a category page is pointed to with the anchor text “Ford” and it has a sub-category linked to with the anchor text “mustang parts,” which points to a page about a specific product page for brake pads, the brake pad page might be automatically tagged with the terms “ford,” and “mustang parts.”

4) Like the automatic tagging above, internal links and anchor text between pages might also be used to create a concept hierarchy for a site, which can then be compared to other sites containing similar concepts.

Using my car part site example, from the previous section on automatic tagging of pages, a hierarchy of concepts might be created about a site offering car parts. That site might be compared to other sites that may use similar terminology and which could even have a similar concept heirarchy. Those sites might then be clustered together by a search engine.

5) Anchor text in links found on a site might be presented to viewers to help them navigate through the pages of a site in a sidebar, or in a kind of sitemap reflecting the linking structure of the site.

The patent application is:

Web Site Structure Analysis
Invented by Natasa Milic-Frayling, Eduarda Mendes Rodrigues, and Shashank Pandit
Assigned to Microsoft
US Patent Application 20080134015
Published June 5, 2008
Filed: December 5, 2006

Abstract

A graph representation of a web site is generated by identifying blocks of links on web pages. Each block of links is represented by a node in the graph representation and connections between the nodes provide information on the re-use of blocks between pages.

Conclusion

I’ve provided a high level overview of a process described in the patent filing on how a search engine might use a segmentation process to identify link blocks on a site, possibly merge some of those blocks together, and then classify the link blocks that they’ve found. The patent goes into more details on what it might look for in creating those blocks, and merging them together, and then in classifying them.

The patent also provides some possible benefits of segmenting and classifying links into link blocks, but there are likely others that it doesn’t detail, such as if the search engine might give different link-based ranking values to links found in different kinds of link blocks.

The patent also describes how it might include data collected about the links on a site from monitoring the use of those links from visitors to pages, though it doesn’t go into much depth on the processes behind that approach.

The process in this patent filing is from Microsoft, and it’s possible that Microsoft uses a process like this when they index web pages. It’s also possible that the other major search engines may be doing some similar kind of analysis of different links on a site, and classifying them based upon where they appear in the layout of a site, and the function that they provide.

In my last post, on Creating an SEO Inventory, there is a column for “Navigation Location” where you can list the kinds and locations of links to specific pages from other pages on the same web site, such as a logo link or a main navigation link. You may want to think about links to the pages listed from what kind of classification they may fit within, in a framework like the one described in this Microsoft patent filing.

Share

47 thoughts on “How a Search Engine Might Analyze the Linking Structure of a Web Site”

  1. Get’s almost boring to tell you it’s a nice post ;)

    Were there any better indicators of the factors used to determine how link blocks across pages would be analysed to determine how they are ‘perceived’? I’m interested if they gave any info on what signals would constitute each of those node types across a larger volume of docs (e.g. across a given website).

    Rgds from Dublin
    Richard

  2. Thanks, Richard

    There are a number of factors discussed in analyzing and classifying link blocks, and I’m guessing that there probably are a good number that might be used that aren’t included in the patent filing. S-Node blocks are used by the search engine to learn about the structure of a site. C-Node blocks may help a search engine learn more about links on a site that are related in some fashion, perhaps such as a related topic.

    Here are some of the factors mentioned that might be helpful in classifying blocks.

    For S-Node Link Blocks:

    1. A very limited amount of text or non-alphanumeric characters may be allowed between links, such as a pipe (|) symbol.

    2. The link block usually contains all internal links.

    3. These blocks show up on more than one page (above a certain threshold of the total number of pages on the site).

    4. When there are a couple of S-Node type blocks on a page, and one is a subset or superset of the other, they might be merged together. So, for instance, a group of quick links in a footer that all lead to main category pages, also listed in a main navigation bar might be merged into one link block.

    5. Link blocks on different pages that resemble each other very much except, for instance, not containing a link to the page that they appear upon, may be considered the same link block.

    6. Separate link blocks might be created for links in drop-down submenus on main navigation bars, especially if there are lots of links on a site pointed to in the main navigation and sub-menus.

    7. Where all of the links in a link block are internal references to links on the same page (such as questions on a FAQ page that link to answers on the same page), they might be considered as a C-Node block. If that block of links also appears on some other pages of the site (pointing to the links on the other page), then it might be considered an S-Node block.

    8. Some S-Node blocks may be given more weight than others as being indications of the structure of the site, based upon such things as the number of links being pointed towards within the site from the blocks.

    I’m pretty sure that I’m going to be rereading this patent filing a few more times – there’s a lot in there, and I’m sure that I haven’t gotten all of the implications of the use of the methods involved yet.

  3. Hm. This one covers one of my favorite topics: internal linking. :)

    I can see it now, spreading across SEO Forums and Blogs for the next year: “10 Tips On How To Create C-Node Links On Every Page!”

  4. Hi Michael,

    It’s one of my favorite topics too. :)

    I like that this patent filing gives us a framework for discussing internal links, but I hope that we don’t limit ourselves exploring other possibilities and approaches.

  5. Great post. With links being so important to begin with, it amazes me that so few make proper use of internal links.

    But as your car part example also pointed out, this could be a great way for search engines to get a better understanding of the overall theme of a site, and not just a page.

  6. “10 Tips On How To Create C-Node Links On Every Page!” – mmm catchy!!

    Cheers Bill, in-depth and has given me lots to think about as usual. I think I’ll revisit the article and ponder it some more after its sunk in!

  7. Bing seems to be dividing things up inside of the their mobile search application. They divide up pages vs just showing you the web page. Looks like they have a lot of this in place, and wouldn’t be surprised if they are using this in many different ways.

  8. Hi Robert,

    I’m surprised to see more sites not carefully consider their linking structure, and the anchor text used to point to pages within a hierarchy of those links. I believe search engines have been looking at that structure and organization long before this patent application was filed.

  9. Hi Wes,

    I’ve seen papers and patent filings from Google and even Nokia on how to break apart pages for presentation on web pages. It does look like the Web can be presented very differently based upon what kind of device we use to look at it. That’s an interesting challenge for web designers.

  10. Belting article. I cant even come to terms with how much data search engines look at to decide where a page should be placed in the SERP’s. It’s facinating and at the same time an up-hill task that I love.

  11. Bill,

    That means an additional layer of quality might be superimposed on the linkage data which makes a site rank well. So despite your PR bar showing a bias towards green, the actual PR is small since the links are originating from comment spamming or footer navigation or something similar.

    How do footer links that go to external sites get treated according to this? Is the intent of linking considered or is it so that just because that block has least real strength, the link doesn’t means that much?

    Thank you for this brilliant post.

    Ashish

  12. Interesting stuff. One thing it has encouraged me to do, is to go through all my links and make sure they are anchored properly. For eg. after an excerpt of a post, I have something like “read more” as a link. I’ll be looking to change this so that it has a more descriptive anchor. Thanks for posting

  13. Linking pattern is the vital part of the website competence in the page ranking race…. however internal linking is the thing that a bot considers greatly… many web page with the odd and confusing internal linking pattern can get the website penalized by the search engine like Google.

    Thanks for making some doubts clear by this post :)

  14. Hi Lee,

    Good point. While it might (and let me emphasize the word “might”) be helpful to think about things like a list of “ranking factors” that search engines might use, and attempt to come up with weights for those signals, the reality is that the algorithms that search engines use are likely more complex than we might think of when coming up with such lists. Search engines do collect very large amounts of information about the pages that they find on the web through crawling, collecting feed information, and extracting informaiton about facts and objects, and user behavior through search query logs and toolbars and analytics programs and other services that the search engines provide.

  15. Hi Ashish Roy,

    Very good questions. We’ve been told that links in different parts of pages might carry different weights, or amounts of PageRank, without any indication of how that weight might be distributed. From this patent filing, we can see that PageRank might not be a strick block level PageRank, where different blocks might carry different amounts of PageRank, since some link blocks might be merged together. You’re welcome.

  16. Hi Ravi,

    I don’t know about any “penalities” associated with internal linking patterns, but I do find it a good idea to have a navigational structure that makes it more likely for people to find what they are looking for when they visit a site. If that use of internal links and anchor text can also help a search engine better understand the internal strucuture of a site, that’s for the good as well.

  17. I’ve been considering and testing the “only the first link in the code counts” theory recently and this post kind of confirms what I was thinking, that it is easy to confuse links on a page as being ignored but in reality it is just the context and markup that is having an effect on their “weight”. I’ve also noticed that updating spammy, badly coded rows of footer links to semantically arranged footer link “areas” (see seomoz) has a direct improvement on a sites performance (as with everything, other factors may have been involved).

    All in all, great post and very helpful for my own investigations!

  18. Very interesting, especially the bit about Block Level Page Rank, it’s certainly the first time I’ve seen it mentioned. Thanks for sharing.

  19. Bill

    C-node internal linking is the most favorite linking that I have observed. Internal linking as a whole is the most important thing that can help you out in getting the desired SERPs however this is a very mind consuming task and also results better.

  20. Hi Matt,

    Thank you very much. There are likely more than a few different issues that could impact the “only the first link in the code counts” theory that haven’t been explored very well. The idea of merging link blocks, as described in this patent filing is just one of them. Another that could skew that experiment would be something like a phrase-based indexing, where links that use anchor text that isn’t related in a meaningful manner to the content on a page might be given little weight or ignored when a page might be re-ranked pursuant to that kind of phrase based indexing. A third could be the use of something like a block level PageRank, where links in different visual blocks on a page might be given different amounts of weight, or possibly ignored.

    I do like the idea of using semantically well coded sections of pages, and agree that it is possible that those can have a positive impact

  21. Hi Ravi,

    I like that this patent filing gave us a different way of thinking about how search engines might treat the links that they find on a page. I see some potential benefits from the classifications, if it can help a search engine get a better sense of the structure and contents of a site. I’m not sure that I would cite one type of classification as a favorite, or better than the others, since they all have somewhat different purposes. But knowing that a search engine might classify blocks of links in a manner like this, and some of the possible reasons why gives us a perspective that we didn’t have before the patent filing was published.

  22. Hi Stu,

    You’re welcome. I like the block level PageRank paper, because it explores the idea that search engines might base rankings that we see on a level other than just pages. At Bing, I’ve recently seen links to individual blog comments in search results (for example: http://www.seobythesea.com/2009/08/how-a-search-engine-might-analyze-the-linking-structure-of-a-web-site/#comment-190861 ). Is that because they might be using something like block level PageRank? I don’t know, but it’s worth thinking about and exploring.

  23. Brilliant post..

    It made me ask a question that you might answer Bill. Is there a way to know which algorithms are in use?

  24. Hi Vedran,

    Thank you. The workings of commercial search engines are often fueled by a combination of processes that are trade secrets and processes that are protected as intellectual property through patents. Patent filings need to be written in a way that describe some actual potential use of the ideas behind them as a process or method, but don’t have to be written in such fine detail that anyone who reads the patent can sit down and follow a series of steps to come up with the same invention.

    We are given some mentions of specific earlier works in the patent itself to processes like those described in the papers I linked to in the post on Smartview and VIPS, so it’s possible that some algorithms developed for those methods may be involved, and we are also give some descriptions of processes that the sear4ch engine might use to identify link blocks, and merge some blocks, and other steps. In essence, the patent filing describes some algorithms that are unique to it, though not at a super fine level of detail.

  25. We have been reviewing our own internal linking structure to determine if we are doing it in the most optimal way. Getting the most value out of the content of your website (from a search engine perspective) is largely dependent on how well you make your content accessible. This post was very helpful and exactly what I was looking for. It will not only be useful when analyzing the links structure of a site the size of ours, but for some of my smaller clients who have a site much smaller in scale.

    I think that people focus so much on off-page and on-page (in terms of content) methods of optimization, that the internal link structure is often a neglected component of a solid SEO strategy.

  26. Hi Juggle,

    Thanks. It can be really helpful to start off with a strong internal link structure for a site to begin with. Unfortunately, I run across many sites where that hasn’t been done, they need some serious work. I really liked the classification that was described in this patent filing, because it provided a different way to think about the links that you might present on your pages. Glad to hear that you found it useful.

  27. Very indepth article with some great information. I to believe that it is best to nail your onpage links before even thinking about offpage link building.

  28. Hi Neil,

    I agree with you completely. A site owner has considerably more control over the structure and content of their own sites, and building an intelligent information architecture for a site, making usability and conversion improvements, addressing broken links, and fixing problems with the same pages found at more than one URL are things that you can do without relying upon other people at other sites.

  29. Good article, it’s true you need to make sure everything on your own site is perfect first. You can get loads of tools which will investigate your site for broken links and bits, stay on top of this and sales conversions will increase.

  30. Hi Paul,

    Thank you. It really does help to fix broken links, remove internal and external redirects, and focus upon building up the quality of your pages with an eye towards increasing conversions or other goals that you might have with your site.

  31. Bill, I think too many people are concentrating on get rich internet schemes rather than building an infomation rich site that Google will love anyway. Off page optimisation is important, but its an effort wasted if your site is not well structured for the search engines.

  32. That means an additional layer of quality might be superimposed on the linkage data which makes a site rank well. So despite your PR bar showing a bias towards green, the actual PR is small since the links are originating from comment spamming or footer navigation or something similar.
    How do footer links that go to external sites get treated according to this?

  33. Hi Nigel,

    It does appear that Google is targeting those “get rich” types of sites with updates like Penguin and Panda. The patent that I wrote about in this post is from Microsoft, and it really only discusses some of the efforts that a search engine might take to better understand links from other sites, but it does seem that Google has started to take more actions against paid links, manipulative linking schemes aimed solely at raising the rankings of sites, and other activities that might be against their guidelines.

  34. Hi Jackson,

    The Google PageRank toolbar is only updated 3-4 times a year and may not be an accurate reflection of how much PageRank a page might have. It’s also not necessarily an indication of how much PageRank a link from a page might pass along. You may want to look at this post, and the comments after it for some thoughts on links from footers and sidebars and so on:

    Google’s Reasonable Surfer: How the Value of a Link May Differ Based upon Link and Document Features and User Data

  35. I thought PageRank itself was only update 3-4 times a year, not just the toolbar although I can’t remember where I read this.

  36. Hi Rob

    Just the toolbar is updated 3-4 times a year. Google updates actual PageRank much more rapidly. It used to be every 4-5 weeks back in the day when the Google Dance used to take place, but it’s been years since they’ve updated it that slowly.

  37. Thanks for clarifying that Bill. I have no idea where I read that but I suspect I completely misread the article. It makes sense to have more frequent updates.

    On a completely unrelated note, I love the site.

Comments are closed.