Using Rare Words to Estimate Search Engine Index Sizes

Can looking at how many times rare words appear in a search engines index give us an idea of the size of the database for that search engine?

About a week ago, I wrote about some of the most common English words in the indexes for Google, Yahoo, Bing, Ask, and Google Caffeine. I took a look at 50 words that are amongst the most frequently appearing words in English, and estimates from those search engines about the number of times that those words showed up.

Comparing the number of results between the different search engines for those common words really didn’t tell us anything about the relative sizes of the indexes for those search engines for a number of reasons.

One is that the number of results shown are rough estimates only. It’s also possible that the way that estimates are calculated from one search engine to another are very different. Some of the pages listed among those results are likely duplicate pages at different URLs, or may have contained misspellings of the words. Some of the words may be abbreviations or acronyms, as well (such as “it” being an abbreviation for information technology).

Some pages also show up as relevant for a particular search query without actually including that term on the page itself. For example, the Adobe Reader download page has ranked at the top of search results for the term “click here” on Google for years, without that phrase appearing on that page. So many links using those words as anchor text pointing to the page have been enough for the page to show up in search results for the term.

As I noted in that post, it might be possible to get a more realistic look at the relative sizes of search engine indexes by looking at the number of search results for terms that are rare, rather than looking at the most frequently appearing words. Cuil’s CEO and founder, Tom Costello recently described using that technique in his blog on a post about Bing (no longer available), to tell us that “Bing is now around 20% the size of Google.”

I don’t have access to an advanced web crawler like the CEO of Cuil might, to identify a large number of “rare” terms. I’m also using a very small sample size, but I wanted to take a look at a few “very rare” English words, to see how frequently they appeared at the search engines.

I identified a number of English words that appear in less than 1,000 search results at Google Caffeine, Google, Yahoo, Bing, Ask, and Cuil by looking at the phrontistery’s Compendium of Lost Words, and doing searches for those terms. Since these search engines will only show the first 1,000 results for a query, it’s possible to see all the URLs for the terms, and use actual numbers rather than estimates, and to see if the words actually appear upon the pages listed. If I had a much larger sample size, I would feel comfortable in saying that the following table gives us a much better idea of the relative sizes of the indexes for the search engines that I’ve included.

Here are some very rare English words, and the number of times that they actually appear in search results at different search engines (not counting duplicate pages and “substantially similar” results).

Query Google Caffeine Google Yahoo Bing Ask Cuil
archiloquy 67 69 25 14 11 24
exipotic 54 56 22 10 8 16
historiaster 82 82 27 28 15 22
irredivivous 42 43 14 7 8 9
keleusmatically 59 60 20 13 3 10
melanochalcographer 13 15 6 6 7 10
phylactology 58 58 25 17 11 10
stibogram 14 15 8 6 4 9
tussicate 36 37 15 13 12 11
vicambulate 144 128 41 21 12 31
Share

20 thoughts on “Using Rare Words to Estimate Search Engine Index Sizes”

  1. Curiously – I thought that last query stood out due to the larger variance between vanilla Google and caffeinated Google [vicambulate] from my location is a bit different:

    “http://www2.sandbox.google.com/search?q=vicambulate&hl=en” Results 1 – 10 of about 458
    “http://www.google.com/search?q=vicambulate&hl=en” Results 1 – 10 of about 180

    Originally I thought that maybe Caffeine might have more recent results (perhaps due to massive crawling seen by Googlebot pre-Caffeine launch), but the results reported either mean Caffeine date data is wonky, or Caffeine perhaps doesn’t have an historic index to speak of:

    “http://www2.sandbox.google.com/search?q=vicambulate&hl=en&output=search&tbs=qdr:y&tbo=1″ Results 1 – 1 of 1
    “http://www.google.com/search?q=vicambulate&hl=en&output=search&tbs=qdr:y&tbo=1″ Results 1 – 10 of about 16

    Nice post Bill, but of course this is likely a very noisy proxy as it assumes that all the SEs crawl and index in the same way (I’m thinking that some SEs may crawl wide, while others deep).

    Now I’ll have to go and learn what all of those words mean :)
    Rgds
    Richard

  2. Hi Richard,

    Some good questions and points.

    Curiously – I thought that last query stood out due to the larger variance between vanilla Google and caffeinated Google [vicambulate] from my location is a bit different:

    If you click through to the second page, the number of results actually drops. Google still shows estimates on the front page, even if the number of results numbers less than 1,000. For this post, I didn’t click through to include the “substantially similar” search results, though I could have and maybe I should have. Regardless, the actual number of results listed is still much smaller than the estimates that are shown on the first page of the search results.

    Originally I thought that maybe Caffeine might have more recent results (perhaps due to massive crawling seen by Googlebot pre-Caffeine launch), but the results reported either mean Caffeine date data is wonky, or Caffeine perhaps doesn’t have an historic index to speak of:

    One of the reasons for this post, and my previous post on common words, was to see if there were some differences between the present Google, and the Google Caffeine update. It’s interesting that when we look at the number of results for very common words, that Google is showing us a good number more results in the estimate, but when we look at very rare results, the Caffeine estimates tend to be just a little less.

    Nice post Bill, but of course this is likely a very noisy proxy as it assumes that all the SEs crawl and index in the same way (I’m thinking that some SEs may crawl wide, while others deep).

    Thanks. I think regardless of how search engines crawl sites (deep or wide or focused), I was more interested in seeing if I could learn something about the size of the search engines’ indexes. Of course, we don’t know what they might be filtering out of results that might limit what we see, or other factors that might influence those numbers. But it’s still interesting to look at, and if Tom Costello is correct about the utility of this method, looking at rare words may give us some insight into the different sizes of search engine indexes.

    Now I’ll have to go and learn what all of those words mean.

    I was tempted to include the definitions here, but felt that it was better to link to the source where I found them, as repayment for making those rare words easier for me to find. :)

  3. Ciao Bill,

    using rare words can produce bizzare results, due to different ways in which engines are producing search layers. You may want to consider a more balanced way of sampling. There are some papers about this topic.

    The Indexable Web is more than 11.5 billion pages, 2005

    K.Bharat and A.Broder, A technique for measuring the relative size and overlap of public web search engines [WWW1998]

    S.Lawrence and C.L. Giles, Accessibility of information on the web [Nature 400:107-109, 1999]

    And more recently:

    Estimating corpus size via queries

  4. Thats a really interesting post. I cant think of too many sites I’ve seen use the word melanochalcographer =D

  5. Hi SeoNext,

    You’re welcome. If you find this post interesting, I suggest you dig into the papers that Antonio mentioned in his comment. I provided links to them in my comment, which is right above this one. He’s a co-author of the paper at the first link.

  6. Hi Daniel,

    Thanks. The compendium of lost words, where I found these words, and which I linked to in the post has some pretty interesting terms. Interestingly, one of the criteria that was used to consider a word as a “lost word” on that site was:

    The word may not appear in its proper English context on any readily accessible web page.

    Using the Google search engine, I have ensured that none of these words appear on any English-language web page in its proper context. Many of these words turn up no hits whatsoever. Others occur only as part of long alphabetic word lists that lack definitions…

    It looks as though some of these “lost words” are becoming found again.

  7. Hi Antonio,

    Thank you very much. I appreciate your commenting on this topic, especially considering your research in the area. The approach of using rare words is too simple a method for determining something such as the size of search indexes. I’ve seen the papers that you mention, and I believe that Tom Costello even referred to a couple of them in his post at Cuil.

    They are worth looking at for anyone who might want to learn more about how difficult it might be to estimate the size of a search engine’s index.

    The Indexable Web is More than 11.5 billion pages (pdf), by A. Gulli and A. Signorini

    A technique for measuring the relative size and overlap of public Web search engines

    Accessibility of information on the web (pdf)

    Andrew Thomkins had a copy of “Estimating corpus size via queries” on his site, but the link appears to be broken. He did discuss the topic in a presentation, which is available here: Estimating corpus size via queries.

    This one was interesting as well:

    Efficient Search Engine Measurements (pdf)

    Again, thanks.

  8. There seem to be more spam and mfa sites in Google’s search results than the other search engines. It would seem the emperor can be more forgiving than Darth Vader after all.

  9. Hi Michael,

    Very good point. If you visit many of the pages that are listed for these rare words, there are many empty search results pages otherwise filled with ads, as well as spam pages. There are also very few pages where people are actually writing about these words, or using them as actual parts of sentences.

    It is very much possible that there is some filtering of search results going on at some of the search engines I’ve included, which makes their numbers lower than at other search engines.

  10. Pingback: Robert Gregić - Moj blog » Blog Archive » Dnevnik_21.09.2009
  11. Pingback: Tomislav_B » Blog Archive » Aktivnosti za 21.09.2009.
  12. Thanks for this post, Actually there is useful SEO trick I got from this. I do experiment on my site in the content of the site put some rare word. Which give me good PR just after first update of the google PR. No other search engine do more importance to rare words as google.

  13. Hi humza,

    The PageRank of a page shouldn’t rely upon the content of that page, regardless of whether you are using rare words or not. Would Google give a page a higher query independent ranking (not pagerank, but maybe something else), if that page had one or more words on it that appeared very infrequently on the Web? I don’t know, but it’s something to think about.

  14. That was an interesting post, but I was also intrigued by humza’s comment as well as your reply, Bill. Makes me feel to do some research on how Google actually sees such uncommon words.

  15. Hi Sandro,

    Thank you. I enjoy spending time trying out lots of searches, and observing, and seeing if things I don’t expect start happening. It can be pretty rewarding.

  16. Pingback: Jasna G » Blog Archive » Dnevnik_20.9.2010.
  17. Pingback: zoran_zav » Blog Archive » Tekst zadataka sa vježbe 20.09.2010
  18. Pingback: Marko_Ben » Blog Archive » Dnevnik 20.09.2010.
  19. Pingback: Tomislav Š » Blog Archive » Dnevnik_20.09.2010

Comments are closed.