How Google Might Use the Context of Links to Identify Link Spam

With Google’s Penguin update, it appears that the search engine has been paying significantly more attention to link spam as attempts to manipulate links and anchor text to a page. The Penguin Update was launched at Google on April 24th, 2012, and it was accompanied by a blog post on the Official Google Webmaster Central Blog titled Another step to reward high-quality sites

The post tells us about efforts that Google is undertaking to decrease Web rankings for sites that violate Google’s Webmaster Guidelines. The post is written by Google’s Head of Web Spam, Matt Cutts, and in it Matt tells us that:

…we can’t divulge specific signals because we don’t want to give people a way to game our search results and worsen the experience for users, our advice for webmasters is to focus on creating high quality sites that create a good user experience and employ white hat SEO methods instead of engaging in aggressive webspam tactics.

The post does point out examples of the kind of Web Spam that it targets, including keyword stuffing on pages of a site, and “unusual linking patterns” and spun content. Last month, I wrote about a Google patent that described how Google might be trying to identify spun content in the post, Google Scoring Gibberish Content to Demote Pages in Rankings?

In 2004, Google filed for a patent that describes how the search engine might pay more attention to the context of a link, such as the words that surround the link, to better understand the context of those links. In the example of unnatural links from the Webmaster Central blog post, we see clearly how links in an example post might be created in a way where the context of those links makes little sense:

Examples of link spam from the Google Blog

Artificially Inflating Ranks of Documents Through Links

The patent points out a number of “techniques” it aims at that are followed to “artificially inflate the rank of a document, thereby degrading the quality of the search results.” These include:

Link-Based Spamming – This involves obtaining a large number of links to a page to increase the rank of that page. They give an example of link farms, and tell us that “some spammers pay owners of highly ranked documents to include a link to their document so as to increase the rank of their document.”

Anchor Text Spamming – This involves acquiring links from a large number of pages linking to a particular page using the same anchor text, to get that page to rank highly for that text in search results.

Google Bombing – Very similar to anchor text spamming, this approach has its roots more in disrupting search results as a joke or to make a political statement rather than for commercial or economic gain.

On-Site Framing – Many sites “frame” pages on their site with links such as “products” links, “jobs” links, “investor” links, etc., to try to “artificially inflate” the ranks of pages associated with these links.

To combat these techniques, the patent describes a way that the search engine might pay more attention to the “context” of links on a page to either boost or demote the rankings of those pages.

The patent is:

Ranking based on reference contexts
Invented by Anna Patterson and Paul Haahr
Assigned to Google
US Patent 8,577,893
Granted November 5, 2013
Filed: March 15, 2004

Abstract

A system ranks documents based on contexts associated with the documents. The system identifies a reference in a first document, where the reference is associated with a second document. The system analyzes a portion of the first document associated with the reference, identifies a rare word (or words) from the portion, creates a context identifier based on the rare word(s), and ranks the second document based on the context identifier.

Note that one of the inventors behind the patent is Anna Patterson, who is behind Google’s Phrase-Based Indexing patents.

How Ranking Based on the Context of Links Works

An example from the patent of window around a link where Google might look to find context identifiers for a link.

As the search engine crawls pages, it might identify the links on a page and capture a window of text around the link, such as a left window of the five words before the link and a right window of five words after the link. In the picture above, we see a link with the anchor text “Saturn,” and the text on the left includes “Beautiful of all the planets” and the text on the right includes the words, “Is surrounded by an elegant.”

The next step in this process is for Google to take what it believes is the “rarest” of the words from each of those segments of text from this text associated with the link over all documents on the Web that it has indexed, using a process such as an inverse document frequency (IDF) weighting technique or a conventional linguistic modeling technique.

In this case, “planet” is the rarest word in the left window and “elegant” is the rarest word in the right window. The patent tells is that the number of words used in these windows might be more or less than five words, or that other content from pages where these links appear might also be used.

We are also told that only “real” words or used in this process, and the words might be identified as real by looking at whether or not they appear in a certain number of documents on the Web a minimum number of times, such as within 50 different documents. This can keep random blocks of text that might include symbols or numbers from being used.

There may be a lot of documents that link to the same pages, and this context approach means capturing all this context information from potentially a lot of pages. Where there are a lot of pages that might include these words near a link, a count of those is included in the context information. Since Saturn is a planet, chances are good that there will be a lot of links that might include “planet” near a link with the anchor text “saturn” pointed to that page. Since Saturn is often referred to as the Elegant Planet, it’s also likely that the word “elegant” might appear near a link to a page about Saturn that uses the anchor text “Saturn.”

These “context” scores for the rarest words around a link, or “context identifiers” as they are called in the patent, are used to create a score for each link, to come up with a ranking score for each document. Other elements that might be considered in such a score can include:

  • The number of links to the document
  • The importance of the documents linking to the document
  • The freshness of the documents linking to the document
  • Other known ranking factors

If you look at the example above about unusual linking patterns from the Google Webmaster Central blog post, the words around the links in that aren’t words that would likely appear near those links on a regular basis.

If there aren’t many of the same context identifiers, or there are so many that they might be considered suspicious, any ranking value for that link that might be passed to the page being linked to might be ignored. The patent doesn’t refer to this as a PageRank signal or a hypertext relevance signal, but it’s possible that both are being referred to.

The counts of these context identifiers might be tracked over time, so that a sudden surge of counts of those for a link might be identified. A page that acquires links to it in a short period of time that contain the same context identifiers might be considered suspicious and those links may not count in the ranking of the page linked to. A page that has a variety of valid context identifiers might be boosted in search results.

Take Aways

This patent was filed almost a decade ago, even though it was just granted recently. There’s no telling if Google has used the process described within it, or used it and replaced it with a different approach, or used it and continues to do so today.

The types of problems it is intended to solve, such as link spamming, anchor text spamming, Google Bombs, and framed on-site references are issues that Google seems to still be battling, though with things like the Penguin Update and manual penalty notices sent to site owners in Google’s Webmaster Tools, it looks like Google has been active in fighting these types of problems. Is this context identifier approach part of the process that Google uses to identify unnatural linking? It looks like it would work in the Penguin Web spam example above.

How much attention are you going to be paying to the words that you place around links in the future?

Share

42 thoughts on “How Google Might Use the Context of Links to Identify Link Spam”

  1. I have been saying to our copywriters how important it is to ensure the wording around links are contextually relevant for months now. It makes sense and this will help a lot with people using content spinners or auto link placement tools, or online software that follow practices where wording around the links itself does not make sense.

    Thanks Bill

  2. As always: Top quality content, Bill! And to your question: Not more attention than I’ve been giving for at least a year. This has been an important part of making a contextual link more valuable for quite some time.

  3. Hi Dewaldt,

    The idea that Google is looking at, and paying attention to text surrounding links, is something that has been floating around for a while, but how rare words might be picked out from that text, and other pages that might link to the same page explored to see how often those contextual identifiers are used isn’t something that comes up in those conversations. Good to hear that you’ve been insisting on contextually relevant wording. It definitely does seem like something that would be effective in identifying content spinners, auto link placement tools, or other online software where context is ignored.

  4. Hi Thomas,

    Thank you. I can’t claim that I personally make a conscious effort to always surround a link with contextually relevant words, but usually they seem to come out that way if the link you include really is relevant to what you’re writing about. We don’t have access to Google’s count of the use of words on the Web, or necessarily documents that all link to specific pages, so this is something that would be hard for us to try to game, even though the patent was filed almost a decade ago.

  5. If you write naturally, you don’t have to spend time ensuring that text around a link is contextually relevant. If you write with clarity in mind, then contextual relevance will follow. The only people who really need to spend time worrying about this are content spinners and people trying to link unnaturally within their documents. I’d rather spend my time using my skills to craft excellent content.

  6. Hi Allen,

    There are many people, including copywriters who have written many books on copywriting who believe that writing for the web means writing naturally, and then stuffing keywords into something they’ve written. I remember commenting on the blog of one of those copywriters on a post where he insisted that people do that, and telling him that he shouldn’t have even begun to write those articles until he knew what keywords might have been important in the first place.

  7. and what about comment links, especially when using exact match anchor text? I’m frustrated that Google still hasn’t managed to nip this one in the bud yet! for example, this page has 6191 comments, most are spam! “http://blogs.creighton.edu/lai42188/2012/11/12/intro/”

  8. Old – but sounds still good.

    How do you think this relates to the ‘broad match’ or perhaps the ‘modified broad match’ terms in the Google adwords keyword tool?

    Let’s say a term in in the computer broad match cluster.

    If an inbound link is surrounded by other terms from the same modified broad match cluster – would this to be considered ‘no spam’ according to this?

  9. Of course, natural links will be in the right context, naturally. Back in 2006, another SEO told me his firm had tested and concluded that if one wanted to rank well for a term it was more powerful to use partial match anchor text, but to have to the other keywords directly adjacent to the anchor text. So, if my target term is key west seo design, I should just link the word, ‘seo’ in that phrase. I did alter my anchor text patterns and while I can’t say it was more effective than exact match, it certainly meant I was safe from the Penguin update effect on exact match anchor text. As always, you provide insight. Thanks, Bill.

  10. I don’t think they are currently using this patent definitely it is not part of Penguin algo. As far as the example MC posted in his article, they use same old anchor context plus probably a few extras like money anchor, exact match and I would also two a couple signals like main page, # of eOBLs, and maybe a bad neighborhood signal as well. Here you have a multivariate legit that they can run till they get tired. Always remember they can’t do anything too complicated. Any non linear optimization is practically off the table with the scale they need to sort through. They are pretty much at the limit of bayesian method – they all keep dancing around the same tree just changing a few things here and there. Now, the knowledge graph may be helpful and they do take the pain because this would give better contextual ranking than hilltop they are using now.

  11. This is very interesting Bill

    When I started copywriting 15 years ago, most of the content I produced was for company brochures. Then more clients wanted website copy, then it morphed into SEO website copy.

    In all that time I’ve followed the same process of writing for the target audience and trying to produce high quality work. In that context, any links that my clients use within the body text of a website page should always have contextual relevance – it’s only when they try to engineer it or keyword stuff etc that things become more complicated.

    I don’t know – all this link building lark is becoming way too complicated for my liking. I say write quality stuff, build a mailing list and a strong social media presence – and share it directly with your audience. Forget what Google is doing and any resultant SEO benefits will be a bonus :-)

    Cheers!

    Loz

  12. Why did it take so many years for the patent to be granted? They would’ve been able to fight spam better and earlier if it was granted earlier?

  13. Rohit, assuming they use this technique (or have used it in the past) they would not need to wait for the patent to be granted. I can’t offer any insight into why a patent process takes so long.

  14. Hi Rohit and Michael,

    As Michael notes, Google wouldn’t necessarily have to wait until the patent was granted to start to use the methods within it, but there were some serious roadblocks along the path to having the patent being granted that might have caused Google to delay in relying upon the processes described within the patent.

    I looked through some of the documents that were filed in the prosecution of the patent, and it seemed that the patent examiner thought that some of the claims in the patent couldn’t be granted as they were presented originally, because they stepped on claims in another patent that was already granted – System for categorizing documents in a linked collection of documents.

  15. Hi Sam,

    We don’t know if Google pays any attention at all to any of the links within the comments on that post. I put quotation marks around the naked URL that you posted to keep it from becoming a live link, because I didn’t want to have a link to it from here.

    I also didn’t include any discussion of exact match domain names and anchor text that mirrors those names because the patent itself doesn’t really discuss those, and it’s not something that I really wanted to discuss either. That’s something that is outside of the scope of the patent itself, but might be addressed in other ways by Google. It really doesn’t play a role in looking at the context of a link based upon words in text that might surround a link.

    I did write a previous post about a patent that did seem aimed more at exact match anchor text – Google’s Exact Match Domain Name Patent (Detecting Commercial Queries).

  16. Hi Andreas,

    Just because something is old, doesn’t mean that it doesn’t have value. We do get a sense of the fact that Google was digging into how to value links based upon other text on pages where those links are from, and with almost a decade to think about it more, it’s quite possible that engineers at Google have spent some serious time since this patent was filed to explore other options and avenues as well.

    As for “broad match” concepts from paid search at Google, I try not to cross the beams from organic search and paid search too much if possible. The designation of what might be a “broad match” and an “exact match” concept in advertisements that Google might present to searchers might not be helpful when it comes to an analysis of the context of words in the determination of whether or not a link fits into the same or similar contexts from different pages that might link to the same page. It definitely wasn’t something covered within the description of the patent itself.

    The ideas within this patent may have evolved into looking not only for exactly matching (see what I did there) context identifiers (like “planet” or “elegant” in the example above) but also words that are somehow related to those words conceptually as synonyms (within the same context) or substitute terms. Related concepts, if there is a good way to identify them, might be even more useful as a way of understanding the context of links, and it is possible that Google has moved on to those.

  17. Hi M.J.

    Sometimes “natural writing” is more myth than reality. Some natural writing is obtuse and filled with cognitive dissonance, where the parts don’t fit together well, and there’s a lot of internal inconsistencies among the different parts. Chances are that a good copywriter can avoid many of those traps, and include relevant and helpful text around a link, that helps define it, and give visitors a better sense of where that link might go, and why it was included on a page in the first place. But that doesn’t aways happen naturally, and sometimes takes a lot of sweat and effort to achieve.

    Again, I purposefully avoided discussion of exact match and partial match anchor text and domain names in this post because they aren’t referred to in the patent, and discussing them would take away from the ideas that it does discuss. This post isn’t really about what anchor text is used, or what domain name is used, but instead focuses upon the text around a link, and the context that text might bring to a link.

    Now by including the other words near the “partial” link, you are providing some relevant and meaningful text around the link, but it’s questionable whether that text might be considered the rarest or most important text that surrounds those links. So it’s a little hard to say how helpful they might be.

  18. Hi Blackgold,

    Since I don’t work for Google, I can’t say with any certainty whether or not Google is or isn’t using any of the ideas of processes behind this patent. It definitely falls into the realm of being obvious after you read it (but not necessarily before).

    As with many patents, the scope of this one is limited to one issue and one process and it doesn’t necessarily provide a roadmap to solving all of the issues or problems that Google might face in the Penguin update. It does seem to address the idea that sometimes links and the text that surrounds them have nothing to do with each other, and that could possibly be an indicator of low quality content that should be flagged and potentially ignored. It doesn’t touch on exact match domains and anchor text, but it doesn’t have to. That’s not the purpose of the patent, or the pain point it was intended to address, and it can work completely independent of that.

  19. Hi Loz,

    If you’re providing contextual relevance in the text surrounding your links, you should probably be treating this patent as a confirmation that what you are doing is a good thing.

    From both a practical stance, and a good user experience stance you are letting your visitors know why you’ve included a link elsewhere, and what it might be about, so that if they want to click on it, they have some idea of what they might see and an idea of why you sent them there. That’s a pretty good practice to follow regardless of whether there’s a patent from Google on the topic or not. Google isn’t telling you to do that with this patent, but us looking at the patent isn’t so much to get a set of guidelines from Google on how to do things, but rather a peek into some of the assumptions that they make.

    I’d rather learn from resources such as the following article from User Interface Design:

    Getting Confidence From Lincoln
    http://www.uie.com/articles/getting_confidence/

    As noted in that article, if there’s a good “Scent of Information” in a link, that it will deliver visitors to where they will actually want to go, it’s more likely that they will click upon it. Having strong and meaningful text around a link can help in that situation.

  20. Thanks for the article. Seems pretty obvious that if you write normally the context will happen naturally.

  21. Hi Bill,

    I don’t know if this patent is in use, but I m quiet sure that Google from years now is working to understand better links context (in part) analyzing the surrounded words of an anchor text.

    Sorry again for my English Bill but first of all I would like to know if I have well understood what you wrote…

    This patent shows that, in identifying rare words surrounding an anchor text, Google can verify if these rare terms surrounds other links that points to that page. This is to identify if their’s a common context (common, not relevant).

    Why rare words ? Do eliminating stop words and common words help to figure out what are the most significant peaces of text ?… and consequently the most opportune words to analyze for identify context ?

    In the condition I have understood enough, my opinion is that this patent seems more to be a “Relevancy first check”, than a critical method to identify spam links.

    Let’s think about someone building a number of links for a spammy keyword (as P. Loan), but making attention to surround these links with terms, like planet, elegant, solar system ?

    Do Google will consider these links as relevant because are In-context ? Am I wrong if i think that Phrase based indexing is the other part of the puzzle and if i say that “this patent cannot be efficient without Phrased based” (I think was a really good idea to note that Anna Paterson is behind this patent).

    Now, combining it with Phrase based indexing (for my opinion) it make more sense, but don’t you think it could work only for old dirty spam techniques ?

    Their are many different ways of link manipulation.For my opinion the example of link spam you showed in your post, is too dirty that it’s not so hard for Google to identify it as SPAM.

    But let’s think about other spam link methods.

    Nowadays many of spam links looks to be surrounded by some “quiet good context terms” and this make the things hardest for Google.

    While I m quiet sure that this patent is in use (may be for a “relevancy first check”), I think that other methods can be count less to identify spam links.

    Sorry for my long comment

  22. Hi Walid,

    No problems with the length of your comment. I appreciate your thoughtful response to my post.

    If the anchor text used in a link is irrelevant to the text that surrounds it, especially more meaningful words from that text, it would make sense to not give that link as much weight as other links were the text is more relevant. Looking at how often those rare words appear in other text around links to the same site is something that a computer is capable of, so it isn’t odd to see that as part of this algorithm.

    I did think it was important to point out Anna Patterson’s involvement, as you noted, because of her involvement in coming up with phrase-based indexing, which is in many ways an improvement on the process described in this patent.

    But yes, this could be used as a first check, in a more prolonged process. This patent was filed over a decade ago, but there are still many people who insert completely irrelevant links into pages – see the example I posted above that accompanied the Penguin blog post, where the anchor text for the links have little to nothing to do with the text that surrounds them. If a process like this is used today, it’s probably used as an initial filter to identify a percentage of potentially spammy links. It doesn’t mean that it’s the only step in use, but if it could be used to identify some percentage of potential unnatural linking patterns, then it could be worth continuing to use. I suspect it’s a process that has been tested, and if a better process to identify these types of links has been found, that might be more effective and efficient, then chances are that it’s been replaced.

  23. Nice post, I kind of never really payed attention to this. Maybe, because I’m always making high quality content.
    But thanks for the tips. Will have to pay more attention to the whole context of my posts.

    Thanks

  24. Thanks for the post. The points you raise are ones that are constantly in debate. I’ve made it a point to test multiple variations of anchor text to a number of sites I own. I have even purchased links with particular anchor text pointing to sites I don’t care too much about. In my research, I haven’t seen a negative impact using keyword rich anchor text, but then again, how do you test SEO with all things being equal?

    This was my first time here and I look forward to stopping by again!

    Cheers,
    Mike

  25. The best way I’ve found to understand the importance of having relevant links and surrounding context, is to find examples when the author doesn’t adhere to this concept. G seems to be working hard to reward fresh, relevant content as opposed to keyword stuffing. Thanks for the info.

  26. Hi Bill,

    Good one there but i have a question regarding to anchor text. All my comment comes with my name “Gilbert Samuel” or just “Gilbert”. This automatically makes my name a anchor text although I’ve not used keywords in the name field of a comment before. My sure my name will be a hyper link immediately this comment gets approved and that’s exactly how it’s going to be on many other sites, so what are you going to call this? over using anchor text?
    ps: If I’m wrong don’t mind me being a dummy

  27. Hi Gilbert,

    I’m going to call it a legitimate use of anchor text that doesn’t attempt to abuse hyperlinks in an attempt to rank for a commercial term. I wouldn’t worry too much about it.

  28. Thank you for the article. I am just starting to learn a bit more about SEO and this has been valuable. While I would like to see other sites link to my business, I did not realize that too much would be detrimental to my site’s ranking.

  29. Hi Bill!

    I am such an avid fan of yours. I really love reading your very informative posts. Yes, I totally agree with you. Search engine has been paying more attention to link spam as attempts to maneuver links and anchor text to a page. Looking forward for another great posts of yours!

  30. Why rare words ? Do eliminating stop words and common words help to figure out what are the most significant peaces of text ?… and consequently the most opportune words to analyze for identify context ?

  31. Hi Dan,

    Exactly. Trying to find words in common that are words that appear frequently isn’t going to be very helpful when it comes to better understanding context.

  32. I don’t know why Google hasn’t implemented something similar to this approach long ago to snag spammers. I just went through a DMCA process to remove a spammer that copied one of my sites and randomly spun “Acai berry” links into it. Surprisingly, maybe since they manually created a domain name very similar to my site’s url, they ranked high on page one for my key phrases. So of course a number of signals should have kicked in, such as the content on the spam site was stripped in whole from the legitimate site (including all the links and pages) and the original content had been there for years. This whole concept seems obvious and something that should be on a checklist to test before promoting a page to rank highly for a particular phrase. I was pleased to find both Google and Bing quickly removed the fraudulent site quickly, but I hope they implement more checks. Targeting spammy words or phrases and doing a relevance check on anchor text seem appropriate.

  33. Yes SEO is changing rapidly. Nowadays it is not just about links. In my opinion people are still struggling to understand what Google want from us.

  34. Hi Nathan,

    Definitely – there’s a lot more going on than just putting keywords on a page and attracting/acquiring links to it.

    Part of the reason why I spend a lot of time looking at patents and whitepapers from the search engines is to gain more insight into the kinds of things that Google wants. :)

  35. Thanks for this excellent information. What worries me is that any SEO has the potential to become black hat at Google’s whim. We were innocently using the anchor text “kitchen design software” for some time and ranked well, Panda has hit us badly. It is almost impossible to predict what will happen next, they may choose to penalize those who use keywords around the anchor text!

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>