Almost seven years ago, I started thinking about what documents I would recommend that people read if they wanted to learn as much about SEO as possible. SEO by the Sea was a little more than a couple of months old, and I started a series of posts that I called the “100 best SEO documents of all time.” I started the series knowing the first 30 papers, blog posts, and patents that I wanted to include in the series, and somehow never got past those first thirty.
The posts were the three posts immediately before the gathering that originally gave SEO by the Sea its name. I went from blogger to event organizer, and never quite returned back to the series that I started. In the past couple of days, the first post got some attention on Twitter, and I promised to update the series.
The next ten documents are ones that I’ve been thinking about quite a bit after reading them, and what they might mean for the future of search.
At the Intersection of Search and Social
The Anatomy of a Large-Scale Social Search Engine (pdf) – April, 2010
This first paper was written by the Aardvark team in the days before the company was acquired by Google, and the title was part homage to the original Google paper, The Anatomy of a Large-Scale Hypertextual Web Search Engine. It described a search that was more inspired by how people find information in a village than in a library. The Aardvark project was discontinued by Google, but I wouldn’t be surprised if it returned as part of the intersection between search and social networking in Google’s Search Plus Your World.
Ranking User Generated Web Content – October, 2009
I wrote about this patent filing in the post, How Google Might Rank User Generated Web Content in Google + and Other Social Networks, and there’s a decent chance that the credential scoring described in the patent filing is similar in many ways to that which is used by Google to rank social search results. The patent gives us a look at some of the signals that a search engine might look at to give reputation or credential scores to people who write blog posts or microblog posts at places like Google Plus, and at Q&A type sites. It also provides a look at how the meaningfulness of responses and interactions might be measured, and how authority and expertise in different topics might be determined.
Bigger Index Using Smaller Files and Incremental Updating
Some of the processes and technologies described in many of the patents and papers that come from the search engines aren’t quite ready for prime time. It’s not because they might not contain good ideas, but rather because the technology might not have caught up to them yet. And then we get advances in technology that make changes possible. One of those changes was an infrastructure update at Google with the name Caffeine. The paper and the patent that follow describe changes at Google that made the search engine much faster, as well as capable of holding a lot more information with the same amount of servers.
The Percolator-based indexing system (known as Caffeine ), crawls the same number of documents, but we feed each document through Percolator as it is crawled. The immediate advantage, and main design goal, of Caffeine is a reduction in latency: the median document moves through Caffeine over 100x faster than the previous system.
Predicting User Behavior with Web Page Features
While the past couple of documents looked at infrastructure updates at Google, the next two look at processes that involve handling very large amounts of data. The first of the papers describes how those types of data sets might be implemented using Google’s Map Reduce system. The second paper takes what was learned in the first paper, and applies it to features identified in landing pages and advertisements to predict the bounce rates of sponsored advertisements. The papers tell us that the methods developed could be used with other large data sets, like pages in web search, to predict user behavior based on features found in a sample set of pages.
What’s a link worth?
The next patent was one that verified something that many of us suspected for a few years, verified by Google’s Matt Cutts and the director of product management for Yahoo! Search Technology, Priyank Garg. The July, 2008 interview with Priyank Garg by Eric Enge and the patent confirmed hunches about changes to the way that the major search engines were treating links. Instead of each link on a page carrying the same value as any other link, we learned about a number of features that the search engines might consider when determining how much weight they might pass along.
Ranking documents based on user behavior and/or feature data – Filed June, 2004
My write up of the patent cuts through some of the legalese – see: Google’s Reasonable Surfer: How the Value of a Link May Differ Based upon Link and Document Features and User Data
Big Big Data
The Unreasonable Effectiveness of Data – March/April, 2009
Great article, and an even better video on how having access to a very large amount of data can even make weak algorithms provide useful data. The video is around an hour long, but it’s highly recommended.
The new knowledge base results at Google and Bing don’t just pull information from sources like Wikipedia and Britannica, but they also look to query logs from the search engines to understand the kinds of things that searchers might be looking for when they perform searches. As the paper tells us, millions of searchers add to the data in search query logs daily, and those “facts” about the things being searched for can help the search engines learn what searchers may be interested in when they perform a query.
I’m going to try to finish this series over the next 6 weeks, with a new post every week. What we end up with may not be the 100 best SEO documents of all time, but hopefully we’ll get a lot of the really good ones in here.