My last post, Not All Anchor Text is Equal and other Co-Citation Observations, was a response to a White Board Friday video posted a couple of weeks ago at the SEOmoz Blog, Prediction: Anchor Text is Dying…And Will Be Replaced by Co-citation. I didn’t expect my next post (this one) to revisit that post and its observation that the way certain words might co-occur on pages might be a possible ranking signal that Google may be using.
Rand noted that first page rankings for three different pages, which didn’t seem very much optimized for the queries they were returned for, might be ranked based upon a ranking signal that looks at how words tend to co-occur on pages related to those queries. My post in response explored some reranking approaches by Google that also might account for those rankings, including Phrase Based Indexing, Google’s Reasonable Surfer Model, Named Entity Associations, Category associations involving categories assigned to queries and categories assigned to webpages, and Google’s use of synonyms in place of terms within queries.
Google’s Phrase-Based Indexing approach pays a lot of attention to words (phrases, actually) that appear together, or co-occur, in the top (10/100/1,000) search results for a query and may boost pages in rankings based upon that co-occurrence, and seemed like a possible reason why those pages might be appearing on the first page of results. The other reranking approaches that I included also seemed like they might be in part or in full responsible for the rankings as well. Then I found a patent granted to Google this week that seems like an even better fit.
Last Friday, in a well received and thoughtful White Board Friday at SEOmoz titled
Prediction: Anchor Text is Dying…And Will Be Replaced by Co-citation (title changed at SEOmoz) Prediction: Anchor Text is Weakening…And May Be Replaced by Co-Occurrence, Rand Fishkin described how some unusual Search Results caused him to question how Google was ranking some results.
I’m a big fan of looking at and trying to analyze and understand search results for specific queries, especially when they include results that appear somewhat puzzling, and I think those provide some great fodder for discussions about how Google might be ranking some search results. Thanks, Rand.
If I were to tell you that the major search engines have a bigger and richer database full of information than their index of the World Wide Web, would you believe me? Chances are that you’re one of the persons who helped build it. The information that Google and Bing and Yahoo collect about the searches and query sessions and clicks that searchers perform on the Web covers an incredible number of searches a day. When Google introduced their Knowledge Graph this past May, they gave us a hint of the scope and usage of this database:
For example, the information we show for Tom Cruise answers 37 percent of next queries that people ask about him. In fact, some of the most serendipitous discoveries I’ve made using the Knowledge Graph are through the magical “People also search for” feature.
When someone performs a search for a query that doesn’t produce much results at Google or Bing, the search engines might remove some of the query terms to provide more results, or they might look for synonyms that might help fill the same or a similar informational need. But chances are that such approaches still might not produce the kinds of results that searchers want to see.
Can social networking rankings influence which users profiles and interactions get crawled and then indexed first by a search engine crawling program? A Microsoft patent application asks and answers that question. Is it something that Bing is using, or will use?
Importance Metrics for Prioritizing Crawls
Back in the early days of Google, PageRank wasn’t just a way of ranking pages based upon the quality and quantity of links pointed to your pages. Google also used PageRank as one of the importance metrics used to decide which pages to prioritize when they had to choose which URLs to crawl first. The paper, Efficient Crawling Through URL Ordering (pdf), co-authored by Google Founder Lawrence Page pointed to a few other metrics that were used to decide which URLs to visit first on a crawl, including PageRank. Another of those looked at how close a page is to the root directory of a site. The idea behind that one is that it’s better to index a million different home pages than it is to index a million pages on one site.
With the growth of social networks and an incredible amount of user generated content that comes with them, there’s a lot less reliance upon links, and yet search engines want to crawl and index as much content from those types of sites as well. The lack of links to those means that something like PageRank is out of the question – and probably would be if we were talking about Google, too. Search engines don’t just want to crawl and then index user profiles, but also the things users of those networks post and the conversations that they have. Why not focus upon crawling content from people who are more active on those social networks?
Social networking content should be relevant and recent when shown in search results. But the ranking of that social content is an area that fairly new to social networks, and something that there’s really no established methods for. A search engine can grab a crawl list from a social network, with the URLs of pages and posts and pictures to crawl, but where should it start? Such a crawl list can even be easy to retrieve, especially in cases like when a social network like Twitter might turn over an XML feed to a search engine. But again, where to begin?
Can the quality of links that your pages or videos or other documents link to influence the ranking of your pages, based upon a reachability score? A newly granted patent from Google describes how the search engine might look at linked documents and other resources reachable from a page or video or image to determine such a reachability score.
Search rankings might be promoted (boosted) or demoted in search results for a query based upon that reachability score calculated based upon a number of different factors.
Someone clicks on a search result, and while there they find links to other resources that they might click upon. Different user behaviors recorded by a search engine might be monitored to determine how people interact with the first, or primary resource visited, and similar user behavior signals may also be looked at for pages or videos or other resources linked to from that resource. Reachability scores might also be calculated for those secondary resources linked to from the first resource, looking at the third or tertiary pages and other resources linked to from the secondary resources.
Calculating reachability scores may follow a process like the following: