The Incomplete Google Ranking Signals, Part 1

Sharing is caring!

I’ve seen a few long posts lately that list ranking signals from Google, and they inspired me to start writing a series about ranking signals over on Google+. The chances are good that I will continue to work on the series there, especially since I’ve been getting some great feedback on them.

This post includes the first seven, plus an eight signal – the Co-Occurrence Matrix described in Google’s Phrase-Based Indexing patents.

I’m also trying to include links to some of the papers and patents that I think are among some of the most important to people interested in SEO that support the signals that I’ve included.

Here are the first 8 signals:

1. Local Inter-Connectivity

This ongoing series will look at some of the different ranking signals that Google has likely used in the past to rank search results. In response to a query.

In 2001, Krishna Bharat filed for a patent with the USPTO that was granted in 2003. What it did was take the top search results (top 100, top 1,000, etc.) and boost some of those results based upon how often they “cited” or linked to each other within that “local” setting.

According to the patent, search results are ordered to normally be based upon things such as relevance and importance (PageRank), and then they are examined again. A local relevance score is added into the mix to use to change the order of those results:

A Local Rank is considered with an old score for Pages
A Local Rank is considered with an old score for Pages

Further, the method ranks the generated set of documents to obtain a relevance score for each document. It calculates a local score value for the documents in the generated set. The local score value quantifies the documents referenced by other documents in the generated set of documents.

You may have heard or read that Krishna Bharat rewrote how Google worked in the early 2000s by applying the Hilltop Algorithm to how it works. The “other references” section of this patent refers to a paper by Bharat from before he joined Google that describes what Hilltop is and how it works:

Hilltop: A Search Engine based on Expert Documents

Google has since published several patents and papers that may boost or demote some local results based upon other signals, since then and I’ll be including some of those in this series.

Ranking search results by reranking the results based on local inter-connectivity

2. Hubs and Authorities

Imagine that pages on the Web might be given 2 sets of scores. These sets of scores are for “broad topic” queries, such as “I wish to learn about motorcycles.”

The first score might be an authority score, based upon how well it answers that broad topic. The second score might be a hub score, where compilations of links have been collected that can be used to find authoritative pages for that broad topic.

The patent I’ve linked to was written when the inventors were at AltaVista and came into ownership by Yahoo when Overture purchased them, and then Yahoo purchased Overture. One of the inventors was the inventor of the patent in the first post of this series, Krishna Bharat, and the other inventor is Monika Henzinger. Reading anything you can from either is recommended.

A paper cited in the patent should be read by anyone studying SEO and studying how pages may be ranked in search results, by Jon Kleinberg:

Authoritative Sources in a Hyperlinked Environment (pdf)

Try to read through as much of the paper as you can before moving on to the patent – it becomes much easier to read and understand if you do so.

The point behind the patent is to improve upon the Hubs and Authorities Algorithm in the Kleinberg paper, to prevent topic drift when the focus is upon terms that may have more than one meaning (for example, Jaguar may refer to a Car brand, a type of Animal, and a football player in the NFL from Jacksonville).

The content is pruned, and then re-ranked based upon connectivity
The content is pruned, and then re-ranked based upon connectivity

The patent and the paper aren’t from Google, even though Bharat and Henzinger both worked at Google, and you will see elements of Hubs and Authorities scores in their work.

Method for ranking hyperlinked pages using content and connectivity analysis

3. Reachability

I couldn’t help myself but publish this one – at a rate of one a day, it could take some time to get up to 100-200 ranking signals, and I don’t know if I’m patient enough for that. 🙂

With the first couple of signals that I’ve written about, the idea of Google wanting to identify authority and hubs plays a strong role, and the idea that some pages are great resources that should rank highly comes out of that.

This patent focuses upon looking at user behavior signals for pages linked to from other pages to determine a reachability score for those pages. Good Hubs are pages that tend to lead to authoritative pages. To a degree, it’s similar to scoring pages that act as good Hubs.

I wrote a post about this patent that describes how it works titled:

Does Google Use Reachability Scores in Ranking Resources?

So pages that are great resource pages based upon some measure of the quality of the links from those pages to other resources could boost the rankings of those pages.

Initial Rankings Changed by Reachability Scores
Initial Rankings Changed by Reachability Scores

In the book about Google by Steven Levy, we are told that Google values “Long Clicks” as a quality signal. The patent does describe how Google might determine what a long click is, but doesn’t use it as a direct ranking signal. Instead, it uses Long Clicks to determine the quality of pages that link to many other pages that result in long clicks. Those pages would likely be good Hubs pages. 🙂

Determining reachability

4. Burstiness

Jon Kleinberg noticed one day that at certain times of the year, his email was filled with specific topics, such as around the time that midterms and final exams were going to happen, his emails would focus upon test-taking and extra office hours.

He noticed that this kind of behavior happened on the Web as well, where different events would trigger certain topics, in blogs, in the news, in search queries, and so on. He looked at archives of things like presidential messages, for terms that would recur, and the events that triggered them, and started thinking of information in streams.

Kleinberg's study included looking at words in Presidential addresses.
Kleinberg’s study included looking at words in Presidential addresses.

When traffic goes through a network, it isn’t in a steady stream, but rather travels in bursts. Sometimes there are patterns to the bursts. Having a sense of hot topics, topics that have cooled off, others that may be seasonal or influenced by the day or day of the week, could be useful.

Many lists of ranking signals used by the search engines mention things like “freshness,” and algorithms for things like news or blog search do likely use “freshness” as an important signal. Still, when a search engine acts as a reference source, like a library, sometimes more mature results are what is being called for.

When Monika Henzinger published patents for Google on Document Inception Dates for documents found on the web, those documents were dated based upon when they were first published, or first crawled by a search engine, sometimes the rankings of those might be influenced by that date, this could be influenced by the relative age of a set of search results.

So, if a search for “declaration of independence” turned up more mature documents, there might be a preference to show older documents, and they might be boosted in search results. On a search for “Windows 8.1”, the set of search results might tend to be a lot younger, and so newer documents might be boosted in search results.

If there is a sudden increase in searches for “Justin Bieber Canada,” the bursty nature of the Web might cause fresher documents to rank higher, and we might see a “query deserves freshness” algorithm kick in where news articles and newer pages move up in search results.

Please don’t call it freshness, because sometimes mature pages are the ones that move up.

Temporal Dynamics of On-Line Information Streams (PDF)

5. Semantic Closeness

Some see the word “semantic” and ask where the semantic markup or schema.org markup is. This post isn’t for you.

Some see heading elements and wonder whether or not the fact that they are often bigger and bolder on a page than other text makes those pages rank higher for the words within the heading. This post isn’t for you.

There are those of you who see a list on a page. It doesn’t even technically have to use an HTML list element, which recognizes that any of the items within the list could be ordered differently, such as alphabetically, or by word length, or even randomly, and each of those list items would be equally as valuable a list item as any of the others. And they would be equally as close to the words in the list’s heading as any of the other list items.

And closeness is magical to search engines and SEO. Search for “ice cream,” and the page that includes the phrase “ice cream” should be more relevant and rank higher than the page that includes the phrase, “I went to the store to buy cream, and slipped on the ice.

I've annotated this patent image to show semantic closeness in a list.
I’ve annotated this patent image to show semantic closeness in a list.

Not only are list items equal distances away from the heading of that list, but heading elements on a page are equal distance from every word in the substance that they head. I know this because it’s covered in Google’s definition of “semantic closeness.”

And each word on a page is an equal distance to the words in the title of that page. That’s what the semantic meaning of a page title is, and that’s included in Google’s definition of “semantic closeness” as well.

As I noted above, no schema.org markup was required to have semantic closeness. Meaning happens, and some HTML elements have meaning baked right into them, which goes beyond just how they present things on an HTML page.

So the next time that you see someone state that there is no correlation between using a heating element and ranking at Google, ask them if they accounted for semantic closeness and leave them scratching their head. If they don’t get it, they probably never will.

Document ranking based on semantic distance between terms in a document

6. Page Segmentation

You may find yourself asking why the Microsoft link. Honestly, it’s probably because Microsoft has written much more about page Segmentation and carried the ideas and concepts behind it much further than Google or Yahoo.

Microsoft's VIPS paper image showing segmentation.
Microsoft’s VIPS paper image showing segmentation.

Google does have a few patents directly on the concept of page segmentation, and I included in it my “10 most important SEO Patents” series.

Here are some of the things that Microsoft described in white papers and patents, though:

  • A block level PageRank, where links from different blocks or sections on pages would carry and pass along PageRank as if they were pages in the older approach to PageRank.
  • A way to decide which was the most important block on a page, especially pages that had multiple main content sections like a magazine mode with multiple stories, so that the text in the most important block should carry the most Relevance Value.
  • A way to analyze and understand the different blocks or segments of a page based upon linguistic features of those blocks, or sections.

Does the content of that block use mostly full sentences in the Sentence case, with only the first words capitalized, and full punctuation?

Does the block only contain lists of words/phrases, in title case, and mostly links elsewhere?

Does the block contain a copyright notice, so that it’s most likely a footer for a page, and the text within it should be ranked very lowly from a relevance stance?

Here are a few posts I’ve written on web page segmentation, for those of you who want to do a little more investigation on the subject:

VIPS: a Vision-based Page Segmentation Algorithm

7. Reasonable Surfer PageRank

PageRank is the algorithm that seems to have set apart Google from the other search engines of its day, but the chances are that it started changing from the moment that it was set loose on the world. I can’t in good faith write about the PageRank of the late 90s, but I wanted to point to a different model.

Not every link on a page passes along the same weight, the same amount of PageRank, and likely not even the same amount of hypertextual relevance. We heard this from Google Representatives for a few years, and from even search engines like Yahoo and Blekko, where we’ve been told that some links are likely completely ignored, such as those that might show up in comments on blog posts.

Features of Links determine weights
Features of Links determine weights

As this patent tells us, Google might see the anchor text of “terms of service” on a page, and automatically not send much PageRank to that page.

You see the name “Jeffrey Dean” listed as one of the inventors on this patent, and if you start digging through other Google patents, you’ll see it frequently. He often writes about technical issues involving the planet-wide data center that Google has been building, and how the whole of the machinery works overall. If you have a few days to spare towards looking at patents from Google, it won’t hurt looking for ones written by him. His “Research at Google” page might overwhelm you:

Jeffrey Dean – Research at Google

There have been a lot of things written about PageRank over the years. Still, if you haven’t read about the Reasonable Surfer and don’t understand the transformation it describes from a random surfer model, you really should.

Here’s a blog post I wrote about it that you could use as a kick start:

Google’s Reasonable Surfer: How the Value of a Link May Differ Based upon Link and Document Features and User Data

Ranking documents based on user behavior and/or feature data

8. Co-Occurrence Matrix

I’ve written several posts over the last decade about phrase-based indexing, and it’s probably one of the most important SEO topics of that period. And one of the most ignored and underrated. Several patents from Google describe how phrase-based indexing works, and how Google may have incorporated it into its inverted index.

The inventor behind Phrase-Based Indexing is also the inventor of one of the largest search engines of the 21st century, the Recall search engine, which was used as a beta on the Internet Archive. Patterson left Google to launch the Cuil search engine with her husband, Tom Costello. That search engine supposedly launched with 120 Billion pages. Cuil was a failure, but Patterson was very quickly back at Google as a Director of Research.

In Phrase-Based-Indexing, meaningful good phrases are identified on web pages and are mapped to those pages in an inverted index. In search results sets for queries, the phrases that co-occur within the top 100, or top 1,000, or some other set may be identified. For words or phrases that might have more than one meaning, those results might be clustered so that pages about similar topics are grouped to find their co-occurring phrases.

An overview of the phrase-based indexing ecosystem
An overview of the phrase-based indexing ecosystem

These co-occurring phrases are called “related words,” When they appear on a page that might rank for the initial query, Google may boost them in search results. If too many “related words” appear on a page, beyond a statistical likelihood, Google might consider that page to be spam beyond a statistical likelihood.

Google may look for these “related words” in the anchor text and may weigh links associated with anchor text differently based on the co-occurrence level. Here’s a passage from one of the patents that describe how that works, and it’s interesting to give our discussion of the HITS algorithm above, how it refers to documents pointed to with highly co-occurring related words as “expert documents.”

[0206] R.sub.i.Q.Related phrase bit vector*D.Q.Related phrase bit vector.

[0207] The product value here is a score of how topical anchor phrase Q is to document D. This score is here called the “inbound score component.” This product effectively weights the current document D’s related bit vector by the related bit vectors of anchor phrases in the referencing document R. If the referencing documents R themselves are related to the query phrase Q (and thus, have a higher valued related phrase bit vector), then this increases the significance of the current document D score. The body hit score, and the anchor hit score are then combined to create the document score, as described above.

[0208] Next, for each of the referencing documents R, the related phrase bit vector for each anchor phrase Q is obtained. This is a measure of how topical the anchor phrase Q is to the document R. This value is here called the outbound score component.

[0209] From the index 150 then, all of the (referencing document, referenced document) pairs are extracted for the anchor phrases Q. These pairs are then sorted by their associated (outbound score component, inbound score component) values. Depending on the implementation, either of these components can be the primary sort key, and the other can be the secondary sort key. The sorted results are then presented to the user. Sorting the documents on the outbound score component makes documents containing many related phrases to the query as anchor hits rank most highly, thus representing these documents as “expert” documents. Sorting on the inbound document score makes documents that are frequently referenced by the anchor terms the highest ranked.

The Phrase-Based Indexing patents are very rich, and there are many additional elements that need to be explored in more depth. I detailed a number of those in the following post, but going through the remainder of the patents reveals many additional ways that Google may use this Co-Occurrence Matrix.

Phrase Based Information Retrieval and Spam Detection

Many of those patents are linked to in my post 10 Most Important SEO Patents, Part 5 – Phrase Based Indexing

Multiple index based information retrieval system

Epilogue

This isn’t the end, by far. But there are many signals that I likely won’t include within this series because there just really isn’t a lot to support them. Some factors that I’ve seen in lists of ranking signals are likely more myth than anything else, and I may address some of those.

The chances are good that Google uses multiple algorithms at any one point, and we’ve been told that the search engine makes around 500 changes or so a year to how they rank pages in search results.

I’ll be exploring those, but I hope that the signals I do include provide enough information to be starting points for anyone who wants to do more research on their own. If you do, let me know!

Sharing is caring!

39 thoughts on “The Incomplete Google Ranking Signals, Part 1”

  1. Great article as always. Thanks for the depth and details. I’ll be reading this article a few times.
    You are touching on things here which I have not read about before such as “Semantic Closeness” i just what I have called On Page Optimization and “Page Segmentation” thanks again!

  2. Hey Bill, this was so nice of you, to put everything into one post. Are you going to update this post, or will you put the part 2 as a separate post?

  3. Hi Johnny

    Thanks – I wanted to include somethings within this set of ranking signal that people might not have seen before or seen very much of, and it sounds like I succeeded.

  4. Hi Ivan,

    Thank you. I think I’ll probably be creating a follow-up post rather than adding directly to this one. That gives me the chance to do what I’ve done with this first one and add some additional text and images, and possible some new signals to that post as well.

  5. Hi Bill,

    That’s a really interesting read! I was pretty impressed by learning about the Block Level PageRank and I thought that it would be a great way to handle Guest Blogging spam, especially in terms of author bio links. Those could be marked and weighted with a lower PageRank value for instance (when they are not nofollow but active links instead). As the document you’ve linked to states “block level PageRank can reflect the semantic structure of the web to some extent” in this case links in a bio section following an article could be somewhat depreciated if not relevant or semantic interrelated with the text on the page.

    In this regard it would be absolutely unnecessary to punish a website for welcoming contributing authors and dofollowing their links – I believe that devaluating the effects of a given overexploited practice could be a much more successful and insightful approach than condemning it and going on a witch-hunt. Do you think that this is possible and probably even already employed in this area? Thanks a lot for sharing your thoughts and for summarizing the info in a much more readable and involving way than the dry format of the patent docs.

  6. Thanks, Mike

    I wasn’t thinking of this as a competition, but rather as a “what would I want to see” when faced with a number of ranking signals. And what I decided I would want to see was some papers or patents or even blog posts about those signals, that might help lead to ideas on why a search engine might find these important, and how they might be used by the search engines. 🙂

  7. Hi Nevyana

    The Block Level Segmentation and PageRank paper was pretty interesting, and it’s quite possible that Google limits the weight or value that might get passed along in a link from a comment significantly. The Reasonable Surfer PageRank approach likely also does as well.

    Thanks for pointing out that quote from the paper about how a bio might be ignored if not relevant or semantically interrelated with text on the page. It’s statements like that one that are worth finding and paying attention to because they show that there is a relationship between many of these ranking signals.

    I did have one of the search engineers from Blekko leave a couple of comments in the past when I wrote about, I believe it was Page Segmentation, who pointed out that they pretty much ignored links from comments

  8. Hi Trond,

    Thanks! These are definitely the kinds of things that I’m usually always looking for, regardless of whether or not I’m writing a blog post or not. What things are the search engines looking at when they rank pages? How does one type of ranking signal interact with others? How Google might find relevance or meaning or signs of quality on a page is pretty much at the heart of what we do, and essential for us to keep in mind when we work to create new pages or to help someone improve the quality of their existing pages. 🙂

  9. Hi Michael,

    Krishna wrote his Hilltop paper before he worked for Google, and the local inter-connectivity patent has nothing to do with Google News.

    See:

    ftp://ftp.db.toronto.edu/pub/reports/csrg/405/hilltop.html

    Bharat is widely known as the inventor of Google News, and his name is on a patent specifically about Google News, which has had its claims updated at least twice in the past few years in continuation patents.

    I don’t feel bad in anyway suggesting that people interested in learning about Google Ranking Signals read Jon Kleinberg’s work or Bharat’s paper on HillTop, or his patent on local inter-connectivity. 🙂

  10. I assume you have spent a long time writing this post. My conclusion, after reading it twice, is as follows: There’s only one Bill Slawski! 🙂

    Excellent read, Bill!

  11. Well said Trond, Bill is one of the most skilled technical SEO I follow, thanks a lot for all your articles 😀

  12. Hi Michael

    Perhaps you can share some of those articles and that Matt Cutts webmaster video, if you can remember which one it is?

    Can you tell us more about how Hilltop applies to Google News, since you insist that it was only used in Google News and no where else at Google?

    Thanks.

  13. Krishna Bharat wrote Hilltop for Google News. You may have just mislead a whole new generation of SEOs into thinking that Hilltop has been running amok in the Web search SERPs.

  14. Thank you, Michael

    I really dislike when people make arguments based on hearsay, or include topics that they don’t provide any relevance for, so I asked. I appreciate the time that you took to respond even though I don’t understand why you are making some of the arguments that you are making.

    Bharat worked on a number of algorithms and patents for Google, and many of those were on Web Search rather than news search:

    http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=0&p=1&f=S&l=50&Query=in%2F%28krishna+and+Bharat%29&d=PTXT

    I’m wondering why you bring up the IP address/Domain issue? What does that have to do with anything? Can you explain in more detail why you think this is important? Thanks.

  15. That Google used Hilltop for Google News is well-documented, at least in the Indian media where they proudly hailed him for his accomplishment.

    Matt Cutts and other Googlers have also pointed out repeatedly that Google does not throw out results from similar hosts in Web search. They cannot afford to do that. Hilltop was never used in Web search and probably never well be.

    People can search on terms like “google” and “wordpress” to see that Hilltop is not being used to select and rank the search results.

    To include Hilltop in a list of ranking factors that Google has used for general Web search results is simply a wrong placement. It was never intended for that and at no time does it appear Google ever intended to use it that way.

    They have other algorithms to prevent cluttering up results with the same domain set.

    Unfortunately, many people in the SEO industry spent years wrongly telling each other and anyone else who would listen that Hilltop was probably responsible for “Florida” in 2003 (it was not — Google implemented Hilltop in 2002 in Google News). That nonsense had finally died down.

    I fear now we’re in for another several years of SEOs citing Hilltop as a factor because we all give you great credibility in these matters. This time, however, you are very mistaken. The denials from Googlers alone should make that clear. I believe there is even a Webmaster video where Matt debunked this idea.

  16. Thanks, Michael

    I hope you’re feeling better soon.

    There is a passage in the book, “In The Plex” that talks a little about Krishna and his arrival at Google, and that he worked on their first patent on “web connectivity analysis.” That section of the book is pretty interesting, and discusses his version of HillTop. It’s at:

    http://books.google.com/books?id=V1u1f8sv3k8C&pg=PA38&lpg=PA38&dq=in+the+plex+hilltop&source=bl&ots=BRvP8vdkfy&sig=ZhoOf8-frwSxOgCFseo4oVGwLyk&hl=en&sa=X&ei=iAEeU6DJA4Ht0wH6oICwCA&ved=0CDIQ6AEwAQ#v=onepage&q=in%20the%20plex%20hilltop&f=false

    It doesn’t refute what you’re saying, or support what I’m saying but it does present a different view of it.

    The Local Inter-Connectivity patent was granted in 2003, but it was filed a couple of years earlier in 2001, which would more closely fit your timeline

    The pruning of pages under the local inter-connectivity patent seems to be during the calculation of a local score, and wouldn’t prevent Google from displaying pages in search results from the same domain or IP address. See Claims 2 and 3 of the local inter-connectivity patent.

    Just because those results are displayed doesn’t mean that they have to be used for that local score – and therefore whether or not local inter-connectivity would work or not isn’t necessarily negated by what Googlers have had to say about including search results from the same domain or IP address in the result sets displayed to searchers.

    The patents I’ve been associating with Bharat’s initial work on Google News (and I can’t tell you this for certain) are:

    Methods and apparatus for clustering news content

    and

    Systems and methods for improving the ranking of news articles

    In the Bharat/Mihaila Hilltop paper, it does look like it was started at Compaq and maybe finished while Bharat was at Google since it gives his “current” address as being at Google, but his Bio in the footnote tells us:

    Formerly he was at Compaq Computer Corporation’s Systems Research Center, which is where the research described here was done.

    The Hilltop paper (not necessarily used in the local inter-connectivity patent) goes on to define what he means by “expert documents” and how those are identified from a larger set of documents, through finding specific keyword phrases in sources such as page titles and headings and anchor text:

    In order to find the most relevant experts we use a custom keyword-based approach, focusing only on the text that best captures the domain of expertise (the document title, section headings and hyperlink anchor-text). Then, in following links, we boost the score of those targets whose qualifying text best matches the query. Thus, by combining content and connectivity analysis, we are both more comprehensive and more precise.

  17. I’m sorry you feel compelled to ask that I do the research. This is very, very old news.

    Still, here are some references. I’ll start with an interview where Krishna talks about developing the algorithm (starting in late 2001 — and he was already working for Google at the time).

    http://www.niemanlab.org/riptide/person/krishna-bharat/

    Here is a Google Books citation (a biography from 2008). I have shortened it.

    He joined Google in 1999, according to their bio page for him:

    http://research.google.com/pubs/krishna.html

    One of the ideas that many SEOs took away from Hilltop (and LocalRank) is that Google will somehow treat multiple sites hosted on the same IP address differently from multiple sites hosted on their own unique IP addresses. Matt has shot down that idea several times but I only had time to find this article, which deals with only one aspect of that thinking:
    http://www.mattcutts.com/blog/myth-busting-virtual-hosts-vs-dedicated-ip-addresses/

    There is/was another article (or possibly a video) where he used either Blogger or WordPress.com as an example of how Google cannot afford to ignore Websites that are hosted on the same domain when showing relevant results in its SERPs.

    This video from last year recaps some of the ways Google has tried to manage this behavior in response to complaints it has received about showing too many results from the same domain. They have worked to “make it progressively harder” for a single domain to appear often in search results, but if it’s really the only good source of information then that is what they will show.

    http://www.youtube.com/watch?v=sxv-AvNPoh8#t=16

    Many queries continue to display multiple subdomains from Web hosting services like Weebly, Tumblr, WordPress, Blogspot, Blog.com, etc. It just depends on where the niche communities settle, or where the topics receive the most attention.

  18. The RipTide interview refers to him bringing localrank to Google, as they call it “an adaption of Hilltop.”

    But they don’t say that is what is used for Google News.

    It does describe a process that sounds very similar to the one described in the “Systems and methods for improving the ranking of news articles” and “Methods and apparatus for clustering news content ” patents, and in the changes made in the claims to the “ranking of news articles” patent in the continuation versions of it that followed. I spent some time going over those in depth last year when the latest version came out.

  19. I’m sorry, Bill. I am not feeling well. I guess the wrong kid sneezed on me or something this weekend.

    Hilltop looks at IP clustering (among other factors). But it is a Google News algorithm and so far there is no indication from Google that it was ever exported to any other part of their.

    It was always an SEO mythology that Hilltop came out in 2003 (it came out in 2002). It was always an SEO mythology that Bharat developed the algorithm outside of Google (he was AT Google when he developed it).

    But just look at the opening paragraph in the Hilltop paper itself: “…Our algorithm operates on a special index of ‘expert documents.'” That isn’t how Web search works.

    http://ftp.cs.toronto.edu/pub/reports/csrg/405/hilltop.html

    I’ll have to leave it at this.

  20. Hi Michael,

    It’s difficult to tell how much of Bharat’s career at Google was focused on Google News, and how much involved Web search. It’s a nice story to tell someone that he was “pre-occupied with news search after the events of 9/11” but that could mean that the news part of his job went from being 5% of his job to 10%. He certainly didn’t stop on working upon Web search after 9/11 to focus exclusively upon News.

    It’s possible that the interview plays things a little loose for one reason or another. I did take a look at the patents that he worked on at Google, and (counting continuation patents) he was involved in 14 News patents out of a total of 42.

    Web – Reputation scoring of an author
    News – Systems and methods for improving the ranking of news articles
    Web – Artificial anchor for a document
    Web – Artificial anchor for a document
    News – Systems and methods for browsing historical content
    News – Detecting novel document content
    Web – Query rewriting with entity detection
    Web – Serving advertisements using user request information and user information
    Web – Methods and apparatus for employing usage statistics in document retrieval
    Web – Search augmentation
    News – Systems and methods for improving the ranking of news articles
    Web – Identification of semantic units from within a search query
    Web – Rendering advertisements with documents having one or more topics using user topic interest information
    Web – Authentication of a contributor of online content
    Web – Embedded communication of link information
    News – Methods and apparatus for clustering news content
    Web – Determining quality of linked documents
    Web – Methods and apparatus for employing usage statistics in document retrieval
    Web – Systems and methods for direct navigation to specific portion of target document
    Web – Identifying related information given content and/or presenting related information in association with content-related advertisements
    News – Detecting novel document content
    News – Systems and methods for browsing historical content
    News – Systems and methods for improving the ranking of news articles
    News – Systems and methods for syndicating and hosting customized news content
    Web – Query rewriting with entity detection
    News – Methods and apparatus for ranking documents
    Web – Rendering advertisements with documents having one or more topics using user topic interest information
    Web – Methods and apparatus for employing usage statistics in document retrieval
    Web – Embedded communication of link information
    Web – Determining quality of linked documents
    News – Systems and methods for improving the ranking of news articles
    News – Methods and apparatus for clustering news content
    Web – Query rewriting with entity detection
    News – Detecting novel document content
    Web – System and method for supporting editorial opinion in the ranking of search results
    Web – Methods and systems for requesting and providing information in a social network
    Web – Rendering advertisements with documents having one or more topics using user topic interest
    Web – Method for estimating coverage of web search engines
    Web – Identification of semantic units from within a search query
    News – Graphic user interface for a display screen
    Web – System and method for supporting editorial opinion in the ranking of search results
    Web – Ranking search results by reranking the results based on local inter-connectivity

    Most of the white papers listed on his “research at Google” page focus upon web topics outside of News as well, with only a couple looking like they are News related –

    http://research.google.com/pubs/krishna.html

    The localrank patent removes pages from the local score calculation that are from the same domain, host, or IP address to keep links from those pages from influencing the local scores because they are affiliated.

    That has nothing to do with Google displaying pages in search results from the same domains or hosts or IP addresses. They can be listed and displayed without being part of that calculation.

  21. Hi Michael,

    The localscore doesn’t have to power Google’s knowledge graph. It attempt to re-rank search results for specific topical queries based upon the links between them within the top results for those queries.

    As Claim 1 states:

    1. A method of identifying documents relevant to a search query, comprising:

    • Obtaining an initial set of relevant documents from a corpus;
    • Ranking the initial set of documents to obtain a relevance score for each document in the initial set of documents;
    • Calculating a local score value for at least two of the documents in the initial set, the local score value quantifying an amount that the at least two documents are referenced by other documents in the initial set of documents; and
    • Refining the relevance scores for the documents in the initial set based on the local score values.

    The topics are defined by the queries, the “expert” pages are the hub pages within those top results, and the authorities are the pages that tend to be linked to by the Hubs pages, which improve in ranking based upon the old scores and the local scores.

  22. Since Hilltop comes from his DEC days and LocalRank is an adaoptation of Hilltop we can agree I am playing fast and loose with the terminology and the dating (I was feeling nausauous earlier and not trying to be precise — sorry).

    My point, however, is essentially made by the interview. Krishna Bharat’s work for Google was bound up in Google News and not Web search. And the entire interview centers on how Bharat was concerned with improving the News Search after 9/11.

    Hilltop’s requirement for topical authorities isn’t supported in any Google technology prior to the Knowledge Graph (that I can think of off the top of my head). Maybe they drew on LocalRank for the Knowledge Graph but that’s a far cry from Hilltop being a ranking factor.

    Hilltop uses Host Affiliation to achieve differentiation in its search results. It restricts documents from a group of affiliated hosts. You can see from point 23 in the updated LocalRank patent that it does the same thing. The original patent application was filed on January 30, 2001.

    His 2012 “10th anniversary” post about Google News recaps the basics:
    http://googlenewsblog.blogspot.com/2012/09/google-news-turns-10.html

  23. Greetings Bill,

    Thanks so much for your approach to sharing SEO. Your attention to detail is the best.

    I also enjoyed this post even more because of the enhanced dialog. I look forward to reading more.

    While I cannot claim to be all knowledgeable, posts like this always keep me learning and add to my understanding. I continue to add elements to my composite SEO model as it relates to how to wrap content with great on site SEO and your blog is a major contributor.

  24. Hi Scott

    You’re welcome. It’s funny, but if you work to learn a little more everyday, and you’re consistent about it, you’ll end up surprising yourself with how much you know after a while. 🙂

  25. Hi Etela

    Thanks. Sometimes I do wonder whether or not I’ve included enough actionable insights and information in posts. I often know what changes or actions I might take in response to something I’ve written, and I don’t necessarily want to write blog posts that go on forever. Good to hear that you are pulling things out of the post that should be useful.

  26. Thank you Bill, this is an awesome article that I’ll be reading through a few more times.

    I especially appreciate your point in “Don’t call it freshness, because sometimes mature pages are the ones that move up.” At times it feels like we get carried away with the “freshness” focus and need to remember that there is a very important place for mature, or as I refer to it sometimes evergreen, content.

    I also thank you for explaining the semantic closeness. It is something that always made sense to me, and I always felt it had significance, but thanks to your explanation here, I’ll be able to better communicate it to others.

  27. Thanks Bill for posting this info… I am always able to glean something of importance (to me) from your posts.

    I am a big believer in utilizing (appropriately) phrases that I know (with limited research) to be co-occurring in a particular ranking niche, similar to what you described below. I utilize this type of knowledge/assessment everyday when optimizing on-page. I try and get a handle on the what “too many related word’s’ looks like, and seek to stay within a range that I see as ‘average’ for the top ranking pages.

    “These co-occurring phrases are called “related words” and when they appear on a page that might rank for the initial query, Google may boost them in search results. If too many “related words” appear on a page, beyond a statistical likelihood, Google might consider that page to be spam.”

  28. I never find most “advanced” post about SE algorithm and signals. Didn’t fully, but it’s bookmarked for sure.

    I think if I read and understand every single link you posted, it will boost my SEO experience by 20% minimum. You just confirmed me that link in different parts of sites have different value, by mention “Segmentation”.

    Tnx, definitely worth reading!

  29. Hi,

    thanks Bill. Great article, i will try to translate to German. The concentration at the basics and to really understand them, is the best, what we can do, for our SEO.
    It is a good exercise to train our brains.

  30. Now this is one article that every SEO consultant needs to read. I’ll be the first to admit that it’s not an easy read (for me anyways) but it’s absolutely something that every person in the SEO business worth their salt has to read. Thanks Bill. Here’s to more.

  31. Great write-up, Bill. Unfortunately, I’ve met quite a few people who pretty much just look at the PageRank as the ranking factor and don’t branch out from there.

    Can’t wait for the rest of this series!

  32. Really awesome article! Super in-depth and I’ll have to read over this a few times again.

    I can’t stress enough to my clients about reachability. I’m so happy you mentioned that. It’s very hard for people to grasp this concept or perhaps I’m explaining it wrong. Thank you!

  33. Bill, as always you bring clarity (with patents) to those of us that have long taken notice of things like “Semantic Closeness” without before putting a name to it.

  34. I wish Searchmetrics took a clue from these. No fancy stuff but pure facts. I love the way you present. It is so informative . thanks.

  35. Fascinating reading from a layman… thank
    you

    Can I ask regarding this statement
    “The localrank patent removes pages from the local score calculation that are from the same domain, host, or IP address to keep links from those pages from influencing the local scores because they are affiliated.

    That has nothing to do with Google displaying pages in search results from the same domains or hosts or IP addresses. They can be listed and displayed without being part of that calculation.”

    “What effect does “removes pages from the local score calculation ” – where more than once website with niche and locality share the same host and IP address – have on SERPS?

  36. Hi Neil,

    By “local rank”, they are referring to documents that rank in the top 10, or top 100 or so pages for a specific term – and that’s treated as a locality within a set of search results. They might remove some pages that are likely related to each other, so not to give too much credit or benefit to too many related pages.

Comments are closed.