If a search engine were to collect a list of links pointing to a page, and all of the text used in links to those pages (anchor text), it might be possible to learn a lot about the page being pointed towards by looking at the words used in those links. But what if there aren’t many links pointing to that page?
That’s the problem explored in a recent paper, Building Enriched Document Representations using Aggregated Anchor (pdf), by Donald Metzler, Jasmine Novak, Hang Cui, and Srihari Reddy, at Yahoo! Labs.
The authors of the paper refer to that problem as the anchor text sparsity problem, and they have come up with an interesting way to try to address the problem.
Here’s a simple example from the paper:
Let’s look at a hypothetical web page at the URL “http://dancing.com/lindyhop.html”.
Imagine that there are a couple of pages on the same domain (dancing.com) that point to that URL.
And, there are a couple of pages from outside of the site that also link to the “lindyhop.html” page.
If we look at the anchor text from the pages outside of the site, otherwise referred to as external links, linking to “lindyhop.html” we might see terms used in those links such as “Lindy Hop” and swing dancing.” We’ve learned a little about the page being pointed towards by looking at that anchor text.
Can we learn even more by looking at the anchor text used in the links pointing to the other two pages within the “dancing.com” site that link to our “lindyhop.html” page? It’s possible. If external links pointing to those two pages include anchor text such as “Savoy Ballroom” and “dances in New York,” then we have.
Aggregating Anchor Text Relevance Along the Web Graph
If you create a visual representation of the Web as a collection of destinations (web pages, videos, images, executable files, etc.), and connections between those files (links), you could come up with a pretty complex looking graph, which you might refer to as the “Web Graph.”
Many of the techniques behind ranking web pages, with PageRank being one example, involve the use of a graph like that. PageRank looks at the links themselves, and not the anchor text associated with those links. When search engines do look at anchor text in links pointing to page, they may limit themselves to only looking at anchor text in links one step away along that Web graph, pointing directly to pages.
But, what if you were to aggregate that anchor text, so that you looked at anchor text in connections, or links, two or three steps back along that Web Graph? Might aggregating that anchor text help a search engine get a better idea of what a page is about by looking at anchor text more than one link away?
Here’s what the authors of the Yahoo! paper tell us
Our work has four primary contributions. First, to the best of our knowledge, we are the first to directly formulate and address the anchor text sparsity problem.
Second, we propose a number of methods for aggregating anchor text across the web graph.
Third, we propose various ways to use the aggregated anchor text to build enriched document representations.
Finally, we show that our enriched document representations, when used in conjunction with a state-of-the-art ranking function, results in signifcant improvements in retrieval effectiveness on a very large web test collection.
In my last post, Search Engines Applying Different Anchor Text Relevance from the Same Site and Related Site Links, I looked at a paper from Microsoft that explored how a search engine might give different weights to anchor text in different links based upon things such as whether or the links were from the same site, or a related site, or from an unrelated site.
That paper didn’t explore what a search engine might do when a page just didn’t have many links pointing to it at all. Aggregating anchor text from links further back along the link graph, to address the Anchor Text Sparsity Problem, might sometimes help in that instance.
There are a number of questions that are addressed in this paper that make it worth exploring further than my overview, including how much weight anchor text might be given when there are multiple links using that anchor text from the same site, or the same anchor text used from different sites, or how to avoid “spam, unrelated, or simply junk anchor text” to build an understanding of what a page might be about. This method also appears to be more helpful for finding pages that fit queries that are longer and more informational in nature than shorter queries that tend to be navigational.
If you’re interested in learning about how a search engine might attempt to understand what your pages are about by looking at the anchor text in links pointing to your pages, and by looking at the anchor text in links pointing to other pages within your site that link to your pages, you may want to spend some time with this paper.