What’s a good way to organize the index of a search engine?
A way that is fast and returns a lot of relevant results? Maybe one that doesn’t need to be search the whole index to find results?
A newly granted patent from Microsoft provides some interesting insights into indexing by document, and how static ranking factors may influence whether a document is in the main index partition, or if it might be found in a later partition acting like an extended index.
In a recent post from Dan Thies, Why Google Can’t Just “Dump” PageRank, he discusses the importance of PageRank as a mechanism for a search engine to use to decide which pages to put in its main index, and which ones to put in its extended index.
It’s an excellent point, and references from Google on how they may manage their index seem to support it well. Here are a couple of posts that I’ve written on patents issued to Google about main and extended databases:
- On Supplemental Results, Partitioned Indexing, and Extended Indexes
- Google Patent on Extended Search Indexes
We don’t know if those describe how Google’s indexes work, but they are interesting to compare to the methods described in the Microsoft patent.
Microsoft refers to its “query independent ranking factor” in this document as a “static” ranking.
The granted patent is:
Index partitioning based on document relevance for document indexes
Inventors: Darren Shakib, Gaurav Sareen, and Michael Burrows
Assigned to Microsoft
United States Patent 7,293,016
Granted November 6, 2007
Filed: January 22, 2004
The abstract of the document sets out the concept behind the document pretty well:
Indexed documents are arranged in the index according to a static ranking and partitioned according to static ranking. Index queries reference the first partition and move to a subsequent partition when a static rank for the subsequent partition is higher than a weighted portion of the target score added to a weighted portion of a dynamic rank corresponding to the relevance of the results set generated thus far. By changing the weight of the target score and dynamic ranks in the subsequent partition score, searches can be stopped when no more relevant results will be found in the next partition.
The patent goes into a lot of detail, and if you’re interested in how it works, it’s worth a look.
While I was doing some research on this patent, I came across an excellent article about one of the inventors listed, Mike Burrows, who worked at Microsoft and is now at Google. It’s from Stanford University’s Cardinal Inquirer – The Genius: Mike Burrows’ self-effacing journey through Silicon Valley. (Unfortunately, the article seems to be no longer available on the Web.)
Here’s a brief snippet:
One of the highest-ranking computer scientists at Google, Mike works on Googleâ€™s distributed system, a non-centralized network consisting of numerous computers that can communicate with one another and that appear to users as parts of a single storehouse of shared hardware, software, and data. In the circle of high-tech experts, Mike has a special place for his role in inventing Alta Vista, the first multilingual search engine.
This Cardinal Inquirer story is one of the better articles on the history of search engines that I’ve read in quite a while. Highly recommended.
9 thoughts on “Microsoft on Index Partitioning”
I had a boss try to explain indexing and all that to me once…
I actually fell so asleep that my head craned back and I did that sgkgkhgklkahhhhhh snore right in front of him.
It’s a miracle that I didn’t get fired. Heck, I didn’t even get threatened. I think he secretly thought I was as funny as I thought I was. Heh.
Indexing can be a tough subject, and it’s not easy trying to explain or understand it in one sitting. It’s often easier to pay attention to when you have real world examples that you can look at over time, and more compelling if you have Web pages that you care about that you want included in that index. 🙂
One of the better, and more interesting approaches to describing indexing that I’ve seen is one that Matt Cutts put together in a Google Newsletter for Librarians – How does Google collect and rank results?
I don’t know about infringement, but there are some interesting comparisons that can be made between the process described in Microsoft’s patent, and the two granted Google patents on expanded databases.
The list of cited patents referenced include more than a couple of Google patents, as well as a couple of earlier patents from Mike Burrows from his Altavista (Digital Equipment Corp.) days.
Matt’s page is a nice intro to indexing, isn’t it?
As usual, another gem. I think the supplemental index is infringing on this granted patent. What do you think?
PS: Thanks for the MC link
Bill, you continue to amaze me with your careful research.
If you have some time to talk through this one with me, please ping me by email.
Thanks. Definitely, it would be a pleasure. I’ll be in touch.
Yes. I am not sure how I missed that article. Very nice intro.
Thank you for that insights about indexing. I used a part of this information as an example in a presentation for a client. They finally (and only) understood that SEO is not a “magic button” business and has to do with serious work and agreed to pay for it.
SEO, like any endeavor worth doing, is worth taking the time and making the effort to do right. Sometimes you can find opportunities while doing SEO that might seem like “magic buttons,” but getting to the point where a small change can have a big impact is often preceded by taking a lot of the right steps first.
Comments are closed.