In The Anatomy of a Large-Scale Hypertextual Web Search Engine, Sergey Brin and Lawrence Page officially presented Google and its use of hypertext to index documents on the Web and produce better search results.
If you’re interested in discovering how search engines work, there aren’t too many other starting points that might be better than that document.
A new patent granted to Google this week, System and method for selectively searching partitions of a database, gives us a deeper glimpse into the inner workings of a search engine and its index.
It describes how partitions can be used to make it faster and easier to search through the index of a search engine, and how rarer and less common results for queries might be kept in an extended index, which is also the topic of another patent granted to Google earlier this year that shares the same list of inventors and was filed on the same day, which I wrote about in Google Patent on Extended Search Indexes.
Is this extended index what we have been referring to as including Google’s supplemental index results? It could be or could have been. A recent blog post at the Official Google Webmaster Central Blog on supplemental results, Supplemental goes mainstream, tells us that Google is going to stop labeling supplemental index results as “supplemental” because the differences between the main index and supplemental index results are growing narrower, and supplemental index results are now fresher and more comprehensive than they have ever been.
Interestingly, it also tells us that supplemental index results were introduced in 2003, which is when these two patents were originally filed.
The patent describes such things as:
- The use of a cache to return results for popular searches,
- Filtering of search results,
- How terms may be mapped to different partitions and sub partitions,
- The role of PageRank in in partitioning results,
- The use of document index sub-partitions, which contain information associating documents with the terms in those documents,
- How snippets for results might be requested,
- How a search of an extended database is triggered and which signals might be used to trigger such a search,
- How extended results are aggregated with results from the standard index,
- When an alternative approach might be used to tell a searcher that there are additional results that can be shown if they click upon a link.
We are also told that while this document describes web search results, this method of using partitions and extended results might also be used for other collections of documents such as books, catalogs, news, etc.
The system used for supplemental results may have evolved since 2003 when these documents were filed. The Google Webmaster Central blog post tells us that the system that crawls and indexes supplemental results had been overhauled in 2006. Note also that nowhere in these patents is the phrase “supplemental results” actually used, yet they do seem to explain how supplemental results work.
Thanks, Richard.
I try to summarize, and try to give a little flavor for these patents, but I’ll admit that I do still get more out of them if I return to them a few days or weeks or months later. There are a lot of “alternative embodiments” and minor details that might really be major details, that some of these mention, that it isn’t always easy to pick up on all of them.
I’ll be revisiting this one a few times over the next month or two I think.
The plug-in sounds like a good idea. I’ll look into that.
It’s only when I go to read some of these documents myself that I can really, really appreciate what you do. Light reading they are not.
This would coincide with reported instances of pages appearing in both indices. Page rank references also make for interesting reading.
Best rgds
Richard
BTW & OT – any reason why you dont use subscribe to comments plugin Bill?
Bill – This is a enormously useful find. Thanks a lot.
I needed more recent material to continue my Google’s inner workings series.
I plan to carefully study these and share.
BTW: Thanks for adding my blog to your blog roll. It is an honor.
Thank you, Hamlet.
I’ve been enjoying your posts, and the Google inner workings series. Looking forward to seeing what you come up with.
It looks like the patent is discussing Google’s proprietary “shard” file system (I’m not sure that is an entirely technically correct term).
Hi Michael,
Nice topic. Sharding is a way of dealing with large data sets, and it could be a part of this.
Neither patent uses the word shard, but that doesn’t mean that it’s not an aspect of what is being discussed. I thought it was worth looking through Google’s patent filings to see if any of them did use the word. Here are the ones that I found:
Method and apparatus for characterizing documents based on clusters of related words
Method and apparatus for learning a probabilistic generative model for text
Selecting and/or scoring content-relevant advertisements
There was a really nice discussion of shards and sharding on the Google Code Blog, in the context of the development of the open source project Hibernate, a few days ago in a podcast:
Google Developer Podcast Episode Six: The Hibernate Shards Open Source Project
You know, it’s funny how many people I have come across that were petrified of the supplimental index. When I spoke to them the discontinuation fo the labeling, they were releived…not sure why…I think it’s all in a name.
Hi Lewis,
Thanks for sharing your experience of how people have reacted. I haven’t heard too many reacting to it in a positive way, but that may be based on how I explained the news.
I liked having the label around, because it let me know that there was a potential issue that could be resolved. Sorry to see it go.
I can understand why people might not want to have to face having some of their pages in a supplemental index, but I’d still rather know.