Ranking Webpages Based upon Relationships Between Words (Google’s Co-Occurrence Patent)

My last post, Not All Anchor Text is Equal and other Co-Citation Observations, was a response to a White Board Friday video posted a couple of weeks ago at the SEOmoz Blog, Prediction: Anchor Text is Dying…And Will Be Replaced by Co-citation. I didn’t expect my next post (this one) to revisit that post and its observation that the way certain words might co-occur on pages might be a possible ranking signal that Google may be using.

Rand noted that first page rankings for three different pages, which didn’t seem very much optimized for the queries they were returned for, might be ranked based upon a ranking signal that looks at how words tend to co-occur on pages related to those queries. My post in response explored some reranking approaches by Google that also might account for those rankings, including Phrase Based Indexing, Google’s Reasonable Surfer Model, Named Entity Associations, Category associations involving categories assigned to queries and categories assigned to webpages, and Google’s use of synonyms in place of terms within queries.

Google’s Phrase-Based Indexing approach pays a lot of attention to words (phrases, actually) that appear together, or co-occur, in the top (10/100/1,000) search results for a query and may boost pages in rankings based upon that co-occurrence, and seemed like a possible reason why those pages might be appearing on the first page of results. The other reranking approaches that I included also seemed like they might be in part or in full responsible for the rankings as well. Then I found a patent granted to Google this week that seems like an even better fit.

Word Relationships and Document Rankings

The sign below is in front of an old hotel on the outskirts of town. People used to stop at the hotel on their way to Skyline Drive in the Shenandoah National Park about 30 minutes away. The two words in the sign, “Vacancy” and “Enter” convey a lot of meaning with a minimum of words.

What if you could pick out the most significant words in a document, and based upon how close they might be to each other, determine both how related they are and how significant they might be to the document they appear within together?

An old sign stating that there is a vacancy at the hotel it appears in front of.

Perform a query at Google for a term such as “mockingbird” and take the top 1,000 or so documents that appear in the search results responding to that search. Extract most of the terms from those documents after marking where they appear on the page, and calculate scores for each of the words based upon things such has how many times they occur in a document, and how close to the beginning of the document they might be.

Perform a capitalization analysis and a part of speech analysis to determine if the terms might be nouns, proper nouns, named entities, or even nuggets of information such as sentences. These might be scored higher than verbs or other types of terms within the document. Other types of analysis might also be used to determine if a term is a named entity.

Filter out the terms that tend to appear pretty commonly on the Web using something like a term frequency–inverse document frequency (TFIDF) score for those documents to see which terms are common. The top 20 or so terms that are above a certain threshold based upon the TFIDF analysis might be kept for a document, and the rest eliminated. These remaining terms are the most significant terms in the document.

Then calculate relationships scores for the terms left over in each document. Words that interact in a document by being in close proximity to each other are said to have a relationship. A close proximity might be seen if the words appear in the same sentence, or the same paragraph, or within a certain number of sentences from each other. These are local term relationships. If one of the remaining terms has no local term relationships with any of the other terms, it is disregarded.

Term scores for a document are calculated based upon the first position of the term in the document, and a distance between that term and the closest distance between it and another of the terms, based upon the number of sentences apart they might be. If they are in the same sentence, that distance would be zero.

An image from the patent showing a flow of local term relationship scores into document scores that could influence rankings of those documents.

After the terms in all of the documents have been extracted and scored and word relationship scores have been figured out, determine relationships among the documents based on the local term relationships and on the initial order of the documents. A score for each of those documents can be generated by looking at which documents have terms in common, and among those documents with common terms, and something like a combination of the original ranking score and a document score based upon all of the term relationship scores within each document.

The patent tells us that the advantages of using this method are:

  • More diverse search results are presented for an ambiguous query
  • The order in which search results are presented may be re-arranged such that the topmost search results introduce a more diverse range of information
  • Relationships between documents that have no hyperlinks connecting them to each other can be determined. Terms related to a given term in a corpus of documents can be identified and presented as related terms that can be used as navigational references to documents in the corpus.

The patent is:

Document ranking using word relationships
Invented by Sharad Jain
Assigned to Google
US Patent 8,321,409
Granted November 27, 2012
Filed: June 30, 2011

Abstract

Methods, systems, and apparatus, including computer program products, for scoring documents. A plurality of documents with an initial ordering is received. Local term relationships between terms in the plurality of documents are identified, each local term relationship being a relationship between a pair of terms in a respective document.

Relationships among the documents in the plurality of documents are determined based on the local term relationships and on the initial order of the documents. A respective score is determined for each document in the plurality of documents based on the document relationships.

Take Aways

The process described in this patent attempts to identify the most important and significant terms on top-ranked pages returned in response to a query. It looks at the strength of the relationships of those terms with other important terms on the same pages. It creates scores for the documents based on the locations of those terms within the documents and the relative distances between the important terms to each other (with the closest distance being the one used, if the terms appear more than once).

The scores of the documents, and differences in significant terms within each may influence rankings in two possibly different ways. The document scores might be combined with the original score for a page to boost it in the set of search results.

Differences in which words are significant words within the document might indicate diverse types of results and an ambiguous query term, and may lead to the search engine rearranging the results to cover a broader range of meanings earlier on in search results. For example, on a query for the term “java,” there might be some results about the programming language, some about the island, and some about the drink. The significant words or terms for each page might point to three different meanings for the query term that should be represented on the first page of search results.

Because of the boosting based upon document scores, and the rearranging based upon showing diverse results, it’s possible that some pages that aren’t as highly relevant for the query term may be moved up a lot in search results.

Look at the last webpage or blog post you might have published, and see if you can guess which words might be identified by Google as the most significant words on the page, and how strong the relationships between those words might be. Does their co-occurrence on your page influence the ranking of that page?

Note that this process doesn’t replace an information retrieval score for a page based upon the use of the query terms (or synonyms for those terms) on the page or in links to the page, and an importance score based upon something like PageRank. Instead it adds an filter or additional weighting, and also tries to add more diversity to search query results.

Is Google using this kind of co-occurrence of significant terms on pages to influence their rankings?

Share

31 thoughts on “Ranking Webpages Based upon Relationships Between Words (Google’s Co-Occurrence Patent)”

  1. To be honest, unless G actually tell us what they are doing, the chances of us working it out is pretty darn slim.
    That said – we don’t have to be too exact to benefit from the thinking and analysing :D

    It would be fun trying to test out pages to see if adding/subtracting associated/related terms or altering the proximity between them would influence ranking position.

  2. This is an excellent analysis of the very nuance approach Google is taking to content retrieval and page ranking (and re-ranking) in search. As Google increases its entities Index we may well begin to see more and more pages unrelated to the search query rank a lot higher because of co-occurrence of words and the discovery of entities within the page content. Brilliant post, thoughts and analysis!

  3. Interesting post. I’d like to do a case study with a new partial EMD website where I don’t build any links and simply mention the brand name of the domain several times on several high authority websites. I want to see if the brand mention alone will help with rankings. Would you like to help me with this case study? e-mail me or connect on Google +.

  4. I think the long and short of this blog post is that Google is trying to improve their understanding of language. While the specific details of this process or both interesting and educational, the tactical and strategic implications are quite minimal.

    For years, Google has been telling everyone to create content with real people in mind. This is merely another extension of that policy.

    Having said that, this does help to explain why I sometimes see very ambiguous keywords sending traffic in analytics…perhaps Google uses search history in conjunction with their co-occurence filter to serve results.

  5. I was always fairly sure of the importance of including synonyms and variations of a keyword in close proximity to that keyword on the same page, so this is Google taking that same idea but applying it to content on third party websites; assuming that if the keywords are close to the brand term then they must also be closely related to what that brand offers in terms of their services / products. I can see the logic in this, but surely this opens the door to new spam, e.g. creating websites, adding content and mentioning the brand alongside relevant keywords but omitting the link? I guess the authority of said website will be very important, which pushes me into my SEO 2013 mindset; traditional marketing online, e.g. building relationships with key online publishers and pushing your PR pieces out to them. Thanks Bill, I always enjoy your posts, as thought provoking as ever.

  6. Bill as usual you have done a great job integrating your thoughts and analysis on Co- occurrence. I did see Rand’s post and had a hard time following it. Your explanation has cleared it up a bit for me.

  7. Thanks Bill. I think that the semantic relevance is essential. If in my blog I write a guide book on New York City, Google expects me to cite, for example, the Statue of Liberty. If I do not the text will be considered less relevant.

  8. Thanks for highlighting this Bill, lots to digest here.

    Assessing or assigning relationships between documents without links that is going to be an interesting field to watch develop. Citations and referencing will become the latest thing perhaps?

  9. I would welcome this technology. The fact that keyword saturation & exact keyword prominence still holds major weight is almost comical.

  10. Bill do you think PR could play a big part in SEO when all this rolls out? Im thinking positive mentions online from high auth and traffic sites and mentions of your domain (raw url with or without the http or www) but without a hyperlink

  11. Interesting analysis you put together on this patent. Is it just me or does seo post penguin/panda seem like the Wild Wild West? So many seo’s have lost faith in seo. Take social signals for example. Google doesn’t priortize any social signals except for… Google Plus. What a surprise!

    Favoring their own sites and web properties over others is why they are in hot pursuit by the FTC.

  12. Another reason your #1 ranking might be unachievable or at least far more challenging that it seems. Seems to me this type of algorithmic work pushes SEO’s to develop well-refined kw strategies and write well with appropriate concept groupings, division between thoughts, etc.

    Thanks as always Bill

  13. Great post Bill – I’m glad you’re out here wading through Google’s patents so we don’t have to.

    @Todd MCDonald – Just picking up on your point. If you look at Google’s recently re-leaked Google Quality Raters search manual it’s clear that Google is trying to understand User Intent as well as Page Quality. This has to play into our client’s keyword/URL strategies. Will people start grouping and targeting KWs into ambiguous intent, navigational intent, etc – and then coming up with differing strategies for when there is a diverse range of competitors for an ambiguous term versus a narrow range of competitors for a specific term with clear user intent(Action, Information,Navigation). This might require the development of some new SEO tools to really understand what’s going on ;-)

  14. That’s quite an analysis you’ve done there. Way over my non-technically inclined head, I must confess. I just wish it were a hell of a lot easier to figure out how to get Google to like my site more. I’ve read so many conflicting things about whether anchor text has any appreciable impact on SEO that I’m still not sure whether it does or not.

  15. If i understand correctly i could consider also the close word near the kewyord in the body text.
    Value position keyword:
    – Keyword used in the body first paragraph is better than in other part of the page
    – Word close the keyword is important and increase the importance of the keyword

    Could be corret?

  16. At first, when I noticed the granted date I immediately thought that we could expect a major update soon but chances are this is probably already playing a part in the algorithm.

    Reading through your article and the application raises the question about global differences in vocabulary around the world. Even in English speaking countries there exists huge differences, for instance a particular part of a car body in the USA is referred to as a fender, whereas in the UK the same part is called a wing and a Fender in the UK is only used when talking about the famous guitar brand. In the UK kids confectionary is referred to as sweets whereas in Australia they are called lollies, in the UK a lolly is what i believe the USA call a popsicle.

    So, with that in mind, couldn’t this restrict the reach of a TLD website depending on which local variation the author uses?

  17. I’m having a hard time with Google as of late. Seems that one day I’m on top of the word in the searches and then the next day I’m on page 4. I appreciate your website.

  18. I have been doing a lot of research into this since I heard about it on the Whiteboard Friday video you brought up. I was having a hard time wrapping my head around it, but your article has cleared it up a bit. Thanks for the insight.

  19. It seems like a lot of this information is highly speculative. It seems safe to assume that the basis of this post is indeed accurate, but I wonder to what extent it really is true. In the end of the day, much of the discussion around search engine rankings is nothing more than educated guesswork.

  20. Hi Michael,

    The patent that I wrote about in this post is very real, and there’s a link to it in this post. Most of what I wrote about it isn’t speculative at all. I like to try to include links and references to patents that are from the search engines because a lot of what I read in forums, in blog posts, and elsewhere on the Web is often very much speculative. Please feel free to follow the link to the patent and read it.

    Google may or may not be using the word relationship approach that I wrote about in this post, but it isn’t something that I made up. It is very real.

    Thanks.

    By the way, the link to the page you linked to when leaving your comment appears to be broken, so I am removing it.

  21. I think that the semantic relevance is essential. If in my blog I write a guide book on New York City, Google expects me to cite, example, the Statue of Liberty.

  22. This is a great article but I do have a question regarding the co-occurrence. How close in an article do 2 words have to be to each other to be considered as co-occurring keywords? Do they have to be in the same paragraph?

  23. Hi Joe,

    So these co-occurring combinations of words aren’t considered “keywords” but instead words that might appear within the same pages on documents returned in response to a particular query. For example, if you search for “java”, you might find a lot of pages that have words on them like “programming” and “language,” and those two words are likely to show up, or co-occur on a lot of documents within the result set for that search for “java.”

    Each page in the result set might be given a document score based upon the appearance of words like those on their pages, and the distance between each of those words (programming and language, in my example) may influence the score of each document. Documents with higher scores might be boosted in rankings on a search for “java.”

    So the word relationships aren’t necessarily between the query term or terms themselves (though those can be used), but rather other words that might co-occur on a lot of the top documents (first ten, first 100) that show up in response to a query.

Comments are closed.