Timing is everything. On Monday, I was asked if I would give a presentation at the Internet Summit 2011 in Raleigh, NC on November 15th and 16th on an advanced SEO topic. I thought about it, and agreed, and decided to give a presentation on how social media has been transforming search on Tuesday night. On Thursday morning, Google delivered a present in the form of a Blog post at the Official Google Blog titled Giving you fresher, more recent search results.
The title for my presentation is “Next Level SEO: Social Media Integration” and the basic premise is that social media has changed the expectation of searchers. Searchers want fresher content, they want to see what their friends and contacts have to say, and they want access to experts and authority figures and their thoughts on timely events and news. Search engines have no recourse but to respond.
I didn’t see the Google blog post until yesterday afternoon, and quickly wrote up some of my thoughts at Google Plus regarding Fresher Results at Google? There are a number of other very thoughtful reactions to the change, and I thought I might point those out, and maybe expand upon my thoughts from yesterday.
Here are a few of the posts that captured my attention:
Gianluca Fiorelli explored the reasons for the freshness update and some of the history behind it in Completing the Google Puzzle, almost. This snippet from the post puts the change into perspective:
Google is an substantially an editor (even it will never recognize it) that sells ad spaces, and Search is still its main product to sell toward the advertiser. So Google needs to have the best product to continue selling ad space. That product was ironically endangered by Caffeine. With Panda Google tries to solve the content issue and with the Social Signals linked inextricably to the Authors and Publishers tries to solve the obsolescence of the Link Graph.
Justin Briggs digs deeply into some of the methodolgies that Google may be using to deliver fresher results in Methods for Evaluating Freshness, and his analysis of how Google looks at a document inception date sounds spot on, as does his description of the use of microblog data to uncover fresh and bursty topics and pages using social media.
I also like the SEOmoz presentation from Rand Fishkin and Mike King, Google’s “Freshness” Update – Whiteboard Friday, which covered a number of the implications of the update for search marketers.
In my Google Plus post, I mentioned three alternative approaches that Google might follow either independently or in combination to deliver fresher results.
Using Social Media to Find Bursty Topics and Associated Pages
One of them is to grab information from social streams such as Google Plus, Twitter, and others as quickly as possible to find recently popular topics that people might search for. I wrote about a Yahoo patent that explores this approach in the post Do Search Engines Use Social Media to Discover New Topics? It’s not hard to imagine that Google would do something similar with Google Plus posts at the very least, if not also from crawling sites like Twitter.
As I wrote in that post:
The patent filing explores “recency-sensitive” queries, where searchers are looking for resources that are both topically relevant as well as fresh, such as novel information about an earthquake. If you’ve been watching twitter streams, Facebook updates, and other social media, you’ve seen that sometimes these sources are the best and fastest places on the Web to find that kind of information.
It’s possible that a search engine that ignores sources like those isn’t going to be able to return any relevant results for those types of queries – what the patent’s inventors call a “zero recall” problem.
If I want to find information about very recent events, such as the things going on at occupy wallstreet or occupy oakland, I’ve been finding it much easier to type the hashtags for those into twitter than search at Google or Bing or Yahoo. At this point, Twitter still seems to be very much ahead of Google’s new freshness approach in pointing out recent pages about those protest movements.
Using Distributed Crawling Through Browsers to Find Bursty Topics and Associated Pages
Another approach that Google could follow to get very recent information to show to searchers, and one that circumvents social media itself and looks at browsing information directly. The search engine WOWD developed this approach over the past couple of years without the use of a web crawler. Instead, it used information gathered from a browser plugin to see what links people are clicking upon. If a lot of people started clicking upon certain new links, those might indicate hot topics and pages that should be shown to searchers.
Google acquired the patents from WOWD in late June of this year, and I wrote about the acquisition last month in Wow! Google Acquires Wowd Search Patents. As I wrote in that post:
This is an interesting group of patents, from their approach to distributing crawling of web pages by people with the application installed, to ranking “fresh” or “popular” recent pages, to the recommendation system Wowd has developed. It’s possible that Google may have purchased these patents to protect some of the things that they may have already been doing or were planning on doing, or that they may implement some of the technologies described within the acquired patents.
If Google were using the WOWD approach, I might still expect to see fresher results for the occupy protests that I am right now when I search at Google.
Caffeine and an Updated Change Analysis to Find Fresher Content
The third alternative is a change to the way that Google handles fresh results under the patents that came from Google’s Historical Data patent, which are intended to both address the problem of stale results and some spam results in response to queries.
Justin’s post discusses a number of the issues that are described by these patents, such as how freshness might be determined by things such as when a page was first crawled. Google did recently file for a divisional patent on one of the patents from the historial data patent family which described a change in the way that Google might determine whether important changes might have taken place on a page over time.
I wrote about that change in the post Revisiting Google’s Information Retrieval Based Upon Historical Data, which looked at changes to a patent titled Document Scoring Based on Document Content Update. How does the updated version differ from the original version?
The original claims for the patent told us that Google might ignore the “boilerplate” language it finds on pages, and the changes to those. In the newer version, instead of mentioning the word boilerplate, the patent tells us that it might calculate the frequency with which words appear on a page (excluding stop words), and look at changes to the section of a page that contains the most frequently used words.
So, for a page that has been around for a while which has updated information included in the main content area of that page – which often is the section of a page which usually contains that most frequently used words – the changes to that page might make it appear more fresh than its original document inception date, or when Google first crawled the page. This change, which doesn’t require a boilerplate analysis, but rather may just look at changes to a main content area is probably something that can be processed much faster under the incremental update approach from Google Caffeine.
The section on Google Caffeine in my post Son of SEO is Undead (Google Caffeine and New Product Refinements) describes some of the changes brought about by the Caffeine update:
In a nutshell, those changes involved:
- Reducing the default sizes of blocks within which files are stored in the Google File System, from 64mb to 1mb, which enables hard drives to hold considerable more information when they contain small files.
- Allowing metadata on a master server to be distributed to multiple master servers, so that searches can more easily be split into parts.
- Placing information about specific pages on multiple smaller files instead of one larger file, and only updating the parts that change for a page instead of all of the information about that page.
Instead of a more complicated analysis to identify changes in content, the new approach may just look at the main content area of a page to see if the content contained within it is fresher.
For queries where the average age of top results are fairly fresh, as influenced by substantial changes to the main content area of those pages, freshness of certain results may move those results up higher in rankings. Those may have to compete against very relevant new results as well.
Under that approach, if we perform a search for a bursty topic like [occupy oakland], we should see within search results some pages that may have been around for a while but which have some fairly new fresh main content as well as newer pages. They might not be as fresh as new pages pointed to by lots of tweets or Google Plus posts, but they should work towards solving the “zero recall” problem mentioned in my quote above for recency-sensitive queries.
Some topics demand very fresh content, and social media has raised the expectations of searchers by providing them with a way of finding fresh information on natural disasters, breaking news events, and other topics that are recency sensitive. If you search at one of the search engines and get no results for those topics, that might seem like a failure on their part. Google’s move towards providing fresher results was pretty much demanded by the expectations of searchers who want fresher content.
The three approaches that I’ve described that Google might use to display fresher content directly into their search results may be methods that the search engine is using, or could be using in the future. Of course, they could be looking at other information and following other approaches as well.
If you get the chance to see my presentation in Raleigh in a couple of weeks, I’ll probably be discussing this topic more and other things involving how the search engines may be integrating social signals into search, and I’ll look forward to talking with you about them there. If you can’t make it, let me know what your thoughts are on the topic below.