If a search engine were to collect a list of links pointing to a page, and all of the text used in links to those pages (anchor text), it might be possible to learn a lot about the page being pointed towards by looking at the words used in those links. But what if there aren’t many links pointing to that page?
That’s the problem explored in a recent paper, Building Enriched Document Representations using Aggregated Anchor (pdf), by Donald Metzler, Jasmine Novak, Hang Cui, and Srihari Reddy, at Yahoo! Labs.
The authors of the paper refer to that problem as the anchor text sparsity problem, and they have come up with an interesting way to try to address the problem.
Here’s a simple example from the paper:
Let’s look at a hypothetical web page at the URL “http://dancing.com/lindyhop.html.”
Imagine that there are a couple of pages on the same domain (dancing.com) that point to that URL.
And, there are a couple of pages from outside of the site that also links to the “lindyhop.html” page.
If we look at the anchor text from the pages outside of the site, otherwise referred to as external links, linking to “lindyhop.html” we might see terms used in those links such as “Lindy Hop” and swing dancing.” We’ve learned a little about the page being pointed towards by looking at that anchor text.
Can we learn even more by looking at the anchor text used in the links pointing to the other two pages within the “dancing.com” site that link to our “lindyhop.html” page? It’s possible. If external links pointing to those two pages include anchor text such as “Savoy Ballroom” and “dances in New York,” then we have.
Aggregating Anchor Text Relevance Along the Web Graph
If you create a visual representation of the Web as a collection of destinations (web pages, videos, images, executable files, etc.), and connections between those files (links), you could come up with a pretty complex looking graph, which you might refer to as the “Web Graph.”
Many of the techniques behind ranking web pages, with PageRank being one example, involve using a graph like that. PageRank looks at the links themselves, and not the anchor text associated with those links. When search engines do look at anchor text in links pointing to the page, they may limit themselves to only looking at anchor text in links one step away along with that Web graph, pointing directly to pages.
But, what if you were to aggregate that anchor text, so that you looked at anchor text in connections, or links, two or three steps back along that Web Graph? Might aggregating that anchor text helps a search engine better understand what a page is about by looking at anchor text more than one link away?
Here’s what the authors of the Yahoo! paper tell us
Our work has four primary contributions. First, to the best of our knowledge, we are the first to formulate and address the anchor text sparsity problem directly.
Second, we propose many methods for aggregating anchor text across the web graph.
Third, we propose various ways to use the aggregated anchor text to build enriched document representations.
Finally, we show that our enriched document representations, when used in conjunction with a state-of-the-art ranking function, result in significant retrieval effectiveness improvements on a very large web test collection.
In my last post, Search Engines Applying Different Anchor Text Relevance from the Same Site and Related Site Links, I looked at a paper from Microsoft that explored how a search engine might give different weights to anchor text in different links based upon things such as whether or the links were from the same site, or a related site, or from an unrelated site.
That paper didn’t explore what a search engine might do when a page just didn’t have many links pointing to it at all. Aggregating anchor text from links further back along with the link graph, to address the Anchor Text Sparsity Problem, might sometimes help in that instance.
Several questions are addressed in this paper that makes it worth exploring further than my overview, including how much weight anchor text might be given when multiple links are using that anchor text from the same site, or the same anchor text used from different sites, or how to avoid “spam, unrelated, or simply junk anchor text” to build an understanding of what a page might be about. This method also appears to be more helpful for finding pages that fit queries that are longer and more informational queries than shorter queries that tend to be navigational.
If you’re interested in learning about how a search engine might attempt to understand what your pages are about by looking at the anchor text in links pointing to your pages, and by looking at the anchor text in links pointing to other pages within your site that link to your pages, you may want to spend some time with this paper.
19 thoughts on “The Anchor Text Sparsity Problem”
Nice explanation on the anchor text and the sparsity problem. Apart from this how the anchor text is considered in case of no-follow, no-index tags are used. Because Google strictly follows no-follow and no-index tags when compared to the other search engines.
If a page linked to has a meta noindex/nofollow tag within its head section, then that page shouldn’t be included in a search engine’s index. That doesn’t mean that a search engine might not keep infomation that it sees about the link pointing to the page, such as the URL and the anchor text used in the link. We shouldn’t see information about the page with the noindex/nofollow in a search engine’s search results, but as you ask, might it be used in some other way?
It’s possible that the information could be collected to give a search engine an idea of what kinds of content might be contained on a web site, even on pages that aren’t indexed and shouldn’t be. This might include creating a profile of a site for personalization purposes, or categories for pages on the site for local search or the display of adsense, or for other reasons.
You would think that is the case with the no follow tag…but I thinks google still looks at the nofollow tags when evaluating page links. It is amazing to find that sometimes there is a page with almost zero links that somehow manages to out rank others on a consistent basis. And sometimes that page has very few visits. On my site I have a few pages that consistently out rank others and there is really no obvious reason for it. But it seems to happen. And whenever a new company starts in my town…well their page gets above mine for a few weeks then typically drops. Not that I am a king of backlinks…most I let slowly build up. But I have seen some crazy text links out there. Text link spam is really a problem also.
I’m not sure if by “nofollow tag” you mean the nofollow/no index meta robots value (<meta name=”robots” content=”noindex, follow” />) or the nofollow value that one can use with a rel attribute on a link element or an anchor (a) element (“<a href=”http://www.example.com” rel=”nofollow” ?>”). I think it was a serious mistake to have named the second one “nofollow,” because of the confusion that it causes.
Google may collect information about all of the links that it finds on pages when it crawls a page, including the destination URL, the anchor text used, and whether or not it might have a nofollow attribute. The blog post from Google that originally announced their use of the nofollow value in links didn’t tell us exactly how it works, and didn’t tell us that it would stop both the passage of PageRank and the use of anchor text in a link to determine the relevance of pages being pointed towards.
It’s possible that use of that nofollow value in links might not mean that Google isn’t using the anchor text to determine what the page being linked to is about, but we haven’t heard that directly from the search engine.
As for why a page that doesn’t have many links pointing to it may rank well, inspite of there not being many links, there are reasons that those pages may rank well for certain query terms, regardless of the lack of links. Here are a few:
1. There may just not be a lot of competition for those query terms. Since you don’t receive very much traffic for those pages, even though you rank well for them, that’s a possibility.
2. PageRank depends upon the quality of links more than the quantity – a single link from a highly ranked page could be worth more in terms of PageRank than many thousands of links from pages that have much lower PageRanks.
3. Those pages may be “very” relevant for the query terms used – much more than other pages in the search results, which means that they don’t need as much value to rank well from query independent rankings signals like PageRank to rank highly.
As for new companies (and I’ll assume, new web pages to go with those companies), one common reason why a site might start out ranking highly, and then seem to disappear, is that it’s not unusual for a new site to initially rank well for some terms (often as they are indexed in a temporary search index database – a stop index, as in “stop the press, we have new content”), and then disappear in rankings as the content from that page is moved over to the main search index (quite likely a multi-tiered index).
I agree that text link spam is a problem – I get a lot of spam comments here where the commentors, automated and human, use anchor text in the name field instead of an actual name or a nickname that they commonly go by in many places. I delete most of those.
So, basically, this paper essentially introduces/highlights a concept of themed links and on-site pages, rather than simply sticking to analyzing a one-step set of links. Isn’t it something we have argued about years before – that Google in particular looks, whether a site fits a topic or not?
Now, are you sure Google doesn’t take into account anchor text of links pointing to 2-3 steps in the web graph? PageRank may not do this, but the PR weight from other pages on the site does carry over to the examined landing page, so it should still work in a similar way, somehow.
Then again, I’ve neglected chasing the SEO secrets lately, so I’m probably missing something (like your other post on the topic about Google/PR) here.
I’m glad that you asked that, because it might be tempting to think that an approach like this might be an attempt at theming.
It’s not really a question of themed links as much as it is looking at anchor text in:
1. External links pointing directly to a page
2. External links pointing to other pages on the same site as that page that link to the page.
The problem we are told it attempts to solve is when there just aren’t many links pointing to a page, which the paper’s authors refer to as the “Anchor Text Sparsity Problem.”
It’s not a question of whether a site fits a topic as much as it is trying to learn more about one specific page by increasing the amount of anchor text that might be looked at for that page.
PageRank flows along the entire Web Graph, but hypertext relevancy (or anchor text relevancy) appears to have been used only from direct links to a page.
While this paper is from Yahoo, it is possible that other search engines have experimented with similar ideas, including Google.
And here is a moral of a very old SEO story:
“Rotate your anchor link text when you build up your inbound links and also while interlinking the site (internal links)”. Iam not pretty sure if a well-interconnected site with anchor link text well rotated can get rid of link sparsity, but can play a major role to know what the pages of a site tells us about a particular page on the same site, if not from other sites.
Bill kudos to your excellent post. Hope you continue your efforts for the betterment of SEO.
I was reading a case study the other day at sphinn that was basically proving that Google is following no follow links
By this i just want to say that we can never be sure with Google and things might change in a flash
It’s long been a good idea to use anchor text that is both descriptive and varied when pointing to a particular page. Not only for the sake of SEO, but also from a usability stance.
For instance, I really like the User Interface Engineering article on finding The Right Trigger Words. Descriptive anchor text that helps a visitor find what they are looking for from a link on a page is helpful in letting visitors meet the goals they might have for a visit to a site. And that descriptive anchor text can also help when it comes to a search engine looking at anchor text to understand what a page being linked to is about.
Varying the anchor text that you use in links pointing to a page not only helps search engine, but it’s also useful in presenting that link in an engaging and persuasive manner appropriate to the context in which the link is provided. Making sure that the anchor text is descriptive helps in both instances.
There’s nothing in this paper or my post about nofollow links. The question that was raised in comments was, do search engines look at hypertext from links that have a nofollow rel link value? The answer is, we don’t know for certain, but it is something that I would suggest is worth testing for yourself, case study or not.
Do you think it is valuable to always use your main search terms in the anchor text, or do you think it is smarter to always switch up the anchor text to make it appear more natural, while attempting to always include relevant words in the link text?
I don’t think that it’s a bad idea to use your main search terms in anchor text, but it is probably a good idea to also include terms in other links that might be reasonably be considered to be related to those terms (as well descriptive of the content on the page being linked to). It should “appear” more natural as much as it should be more natural. 🙂
This is a great outline of the anchor text issue. I think if one looks at Google’s first crawl, back to where it all began, and follows links from that out into the web then using Aggregated Anchor text for sites that had few links pointing to it would be ideal. This also is supports the idea that when getting links from sites its important to keep both the relevancy of your page and the site that is linking to you as high as possible, and also be aware of the links that are in the same vicinity as your links on the page. Overall another great post Bill.
Thank you, Answer Blip.
I’m not sure that a search engine would want to look at the anchor text too many links back – you might start seeing some fairly irrelevant anchor text that way. But I do like the idea of aggregating some anchor text like they describe in the paper. Relevancy between links and sites can be nice, but you don’t always have control over who links to your pages, and what anchor text they might use. That can make things interesting.
Good points and a great post. Honestly, I find it very difficult to use anchor texts in an variety forums without getting your texts adjusted or deleted by the site moderator. It is a very difficult art to master indeed. As you said you don’t always have control in who links to your material also so you can get all kinds of various, and sometimes less profitable, anchor texts. We may be presented with a solution to this as AI evolves over the years and we can have a form of spidering that mimics human edited directories like dmoz.
Thank you. There are things that you can control, such as the titles that you give to pages and blog posts, and the content and anchor text that you use on your own site, and things that you can’t control. That’s true about linking and anchor text and many other things we come across in our lives. 🙂
I’m of the belief that if you write engaging, interesting, memorable titles and content, that you’ll find more links created by others interested in what you’ve created, and interested in sharing your writings, pointing to what you write using words that are relevant to what you’ve written about.
Hi Bill and Joel,
I agree with what Bill said on using some keyword variations to make it appear more natural to the search engines. Just make sure that these variations still has some relevance to the key phrase to get the best results in building links to your target pages.
I do it by allocating fifty percent of the links to build to the home page using the actual url. For example, if we were building links to our website (Melbourne SEO Services) we would use http://www.melbourneseoservices.com.
Next is to allocate about thirty percent of these links and use the actual keyword phrase in the anchor text. To continue the previous example, these are the words â€œMelbourne SEO Servicesâ€, the main key phrase that weâ€™re trying to rank for.
Last is to distribute the remaining ten to twenty percent of links linking to the home page using the rest of the variations of the keyword. For this example, you can use â€œSEO Servicesâ€, â€œsearch engine optimisation servicesâ€ and other relevant keywords.
Please take note that these are just some quick hints to help you get started right away and to keep you on track as you go along. Just make sure to evaluate the results as you do these tips so you can adjust your linking strategies as you go along.
Thanks. Another approach that is often worth pursuing is to look at other pages that may rank well for the term or phrase that you are optimizing for, and see if certain phrases keep appearing upon those pages. It’s possible that Google may see those terms as related because they tend to co-occur on pages for the term you’re targeting with your optimization.
Comments are closed.