Can looking at how many times rare words appear in a search engines index give us an idea of the size of the database for that search engine?
About a week ago, I wrote about some of the most common English words in the indexes for Google, Yahoo, Bing, Ask, and Google Caffeine. I took a look at 50 words that are amongst the most frequently appearing words in English, and estimates from those search engines about the number of times that those words showed up.
Comparing the number of results between the different search engines for those common words really didn’t tell us anything about the relative sizes of the indexes for those search engines for a number of reasons.
One is that the number of results shown are rough estimates only. It’s also possible that the way that estimates are calculated from one search engine to another are very different. Some of the pages listed among those results are likely duplicate pages at different URLs, or may have contained misspellings of the words. Some of the words may be abbreviations or acronyms, as well (such as “it” being an abbreviation for information technology).
Some pages also show up as relevant for a particular search query without actually including that term on the page itself. For example, the Adobe Reader download page has ranked at the top of search results for the term “click here” on Google for years, without that phrase appearing on that page. So many links using those words as anchor text pointing to the page have been enough for the page to show up in search results for the term.
As I noted in that post, it might be possible to get a more realistic look at the relative sizes of search engine indexes by looking at the number of search results for terms that are rare, rather than looking at the most frequently appearing words. Cuil’s CEO and founder, Tom Costello recently described using that technique in his blog on a post about Bing (no longer available), to tell us that “Bing is now around 20% the size of Google.”
I don’t have access to an advanced web crawler like the CEO of Cuil might, to identify a large number of “rare” terms. I’m also using a very small sample size, but I wanted to take a look at a few “very rare” English words, to see how frequently they appeared at the search engines.
I identified a number of English words that appear in less than 1,000 search results at Google Caffeine, Google, Yahoo, Bing, Ask, and Cuil by looking at the phrontistery’s Compendium of Lost Words, and doing searches for those terms. Since these search engines will only show the first 1,000 results for a query, it’s possible to see all the URLs for the terms, and use actual numbers rather than estimates, and to see if the words actually appear upon the pages listed. If I had a much larger sample size, I would feel comfortable in saying that the following table gives us a much better idea of the relative sizes of the indexes for the search engines that I’ve included.
Here are some very rare English words, and the number of times that they actually appear in search results at different search engines (not counting duplicate pages and “substantially similar” results).