Most Common Words in Google, Yahoo, Bing, and Ask, with Google Caffeine

Just which words show up most frequently on the Web? I’m not sure that question can be answered, but it’s something I’ve wondered for a while.

With a beta version of Google’s future update, code named Caffeine recently released to allow people to experiment with, I thought I would do a few comparisons.

I found a few lists of the most common words in the English language, and came up with a top 50 to see how frequently those were estimated to show up in Google, Yahoo, Bing, Ask, and Google Caffeine. Those are shown in a table and a chart below.

I’m not sure how informative this might be, even after looking at it. It’s not a very scientific test as well. There are a few reasons for that:

One of them is that when you search at one of the search engines, you’ll see a message that says something like:

Results 1 – 100 of about xxx,xxx,xxx for [query term]

From at least one previous Google patent filing, we can guess that the total amount (xxx,xxx,xxx) of results listed is likely only an estimate, and not an actual count. That patent application told us that the number shown might be estimated based upon a look at anywhere from 2 percent to 10 percent of Google’s index. Since the Caffeine update is a complete infrastructure/database update, we may not be able to even guess that the estimates shown for the present day Google are created in the same way that the Caffeine updates might be.

We also can’t be sure that the numbers for Yahoo, Bing, and Ask are calculated in the same manner either.

Another is that while I may see one total count at Google for each term, if you looked up the same terms at Google, you might see different numbers because you may be searching at a different data center, and it’s quite possible that there are differences from one data center to another.

A third thing to keep in mind is that when we search at one of the search engines, we aren’t actually searching the Web. Instead, we’re searching the indexes of the Web that the search engines have created. That means that some pages may be indexed more than once under different URLs, that many pages on the Web may not be included since they haven’t been indexed yet, and that words that might appear on the Web as text in images or which are presented in Flash or hidden behind java script or log-in screens aren’t going to be counted.

The table below is number of total results in Millions. I sorted them by how frequently the terms tested appeared in Google Caffeine.

Query Google Caffeine Google Yahoo Bing Ask
a 19,320 17,570 31,200 7,800 1,280
in 15,850 13,980 30,200 7,850 900
to 15,220 13,500 27,500 8,920 1,740
the 14,850 13,900 28,800 8,170 747
of 14,760 12,990 28,000 7,310 794
and 13,980 12,950 28,000 7,490 789
for 12,110 10,720 26,800 7,740 769
by 12,080 10,420 27,000 6,120 956
on 11,260 9,940 25,100 5,610 598
is 9,580 8,870 22,600 4,250 699
I 9,220 8,250 18,600 3,860 686
all 9,110 7,580 27,200 6,990 1,020
this 8,890 7,870 21,500 5,790 585
with 8,490 6,300 20,900 2,440 636
it 7,700 6,860 19,300 4,190 542
at 7,410 6,600 20,800 3,930 552
from 7,340 6,920 18,400 4,160 521
or 7,030 6,210 19,500 3,940 567
you 6,760 5,930 19,900 5,080 543
as 6,460 5,750 15,400 3,550 884
your 6,360 5,470 19,500 3,790 495
an 6,260 5,520 16,500 3,780 489
are 6,260 5,760 18,100 163 578
be 6,120 5,460 17,100 3,990 473
that 5,780 5,260 15,200 5,650 405
do 5,500 5,020 13,000 2,090 410
not 5,500 4,870 15,600 4,550 418
have 4,870 4,390 14,500 4,130 468
one 4,330 3,870 12,300 2,750 375
can 4,150 3,690 13,300 3,030 367
was 3,930 3,610 10,400 2,960 361
if 3,810 3,500 11,200 2,660 345
we 3,780 3,370 12,400 3,430 358
but 3,610 3,340 10,100 1,680 327
what 3,290 2,850 11,600 3,080 322
which 3,020 2,810 7,750 1,810 300
there 2,970 2,770 8,340 1,450 262
when 2,850 2,600 8,360 1,580 306
use 2,730 2,250 12,300 1,830 327
their 2,690 2,680 8,210 1,650 254
they 2,650 2,440 8,260 1,670 293
how 2,470 2,170 9,050 1,730 289
he 2,200 2,040 6,060 1,420 190
were 2,130 2,100 5,320 2,770 203
his 2,030 1,880 5,310 858 182
had 1,860 2,240 5,090 966 191
each 1,370 1,290 4,150 1,090 164
said 1,210 1,350 4,060 857 128
she 953 882 3,030 1,200 95
word 780 685 2,280 469 80

I thought it would be helpful to present this information in a visually different manner as well. The chart that follows is in reverse order of the table above.

chart comparing estimates of the number of results for common words in Google Caffeine, Google, Yahoo, Bing, and Ask.

As I mentioned above, this is a completely unscientific view.

One thing that it definitely won’t do is provide an idea of how large the databases might be for each of the search engines. According to post at the Cuil blog on Bing (no longer available), there is a way to try to make that comparison, but it relies upon looking at the number of search results for terms that are rare, rather than looking at the most frequently appearing words, like I have.

Share

27 thoughts on “Most Common Words in Google, Yahoo, Bing, and Ask, with Google Caffeine”

  1. [I’m not sure how informative this might be, even after looking at it. ] Exactly. And using 2% of the index that’s worst than guessing because it can be so misleading. 2% of what slice? The one thing it does seem to be good at is verifying stop words. Cheers -

  2. I think it is very informative. How come Yahoo has so many more occurrences? You would not think that yahoo indexed more pages than google or bing, I wouldn’t anyways.

  3. Thanks, malmilligan and Tommy,

    I may have jinxed some of the discussion about this comparison of words and rankings and search engines by saying that this might not be that informative.

    I did something like this a couple of years back, looking at how many sites were listed at different country level tlds (Google’s Most Popular and Least Popular Top Level Domains). A lot of interesting ideas came out in the discussion around the post, including people seeing very different numbers at different data centers.

    One question I have, after doing this is why search engines take the time to tell us that they are showing us that that these are “Results 1 – 100 of about xxx,xxx,xxx for [query term].” Why do they also tell us how long the search might have taken as well? I guess it helps make the results appear more authentic.

    For terms that appear in the search engines databases that show up less than 1,000 times, if you actually go to the last result in the listing, you can get more accurate numbers. If you want to try to estimate the actual sizes of the different search engine databases, there is a method that you could try to use. As the Cuill blog post notes:

    You take rare queries, and see how many pages are returned. (You need to check that the page actually contains the query, as some engines return pages that don’t contain the term, but contain a misspelling, or there is a link to that page with the term.

    Using anywhere from 2% to 10% isn’t necessarily bad. Many people have relied for years on Comscore numbers to measure search traffic. When you see a news site reporting that Google has 65 percent of search traffic compared to something like 16 percent for the next largest search engine, the stats that are being referred to are usually Comscore numbers. Comscore tracks around 2 million searchers in the US from are a panel of searchers who have agreed to have their searches tracked in exchange for something like antivirus software or the chance to win a sweepstakes. That’rs out of more than 200 million internet users in the United States, or 1/2 of a percent. (They are just starting to move beyond this approach though.)

    It’s possible that search engines might use numbers like these to try to identify stop words, though it’s possible that the search engines have been treating stopwords very differently since at least early 2008. The word “the” may be a stop word for some queries, but not for a query such as “The Matrix.” The words “to,” and “be,” and “not,” could be considered stop words, but not when they appear in the phrase, “to be or not to be.” Google hasn’t been telling us at the top of search results that they may have removed some words from searches because those are “stop words,” since at least the early months of 2008. That seems to be an interesting upgrade of the search engine that many people don’t talk about much.

    How come Yahoo has so many more occurrences?

    An interesting question. Over the years, we’ve seen Google and Yahoo both make announcements about the size of their databases, one-upping each over every so often. It’s quite possible that Google has a larger database than Yahoo, but after the first few billion pages, it’s hard to tell how meaningful that might be. Yahoo announced at a recent conference that at least 1/3 of the pages that they see on the Web are duplicates of each other.

    I believe that I’ve been seeing much larger numbers from Yahoo than from Google for many different query terms since at least 2005. It’s possible that Yahoo is estimating differently than Google. I’ve never taken that to mean that Yahoo has a larger database than Google, though I’ve sometimes wondered if Yahoo did that on purpose to make it look as if they did.

  4. Hi Ravi,

    One of the ways that we learn about what search engines do, and how they work is to look at search results, to make comparisons, to see some of the differences between what different ones do, and how they work. While I tend to focus here on search related patent filings and white papers with a majority of my posts, I sometimes like to just experiment a little with search results, even if to satisfy only my curiosity or have some fun.

    This post probably won’t help you make money fast, it likely won’t give you some magical bullet to make pages rank higher than other pages, it may not even provide you with some special insight into how search engines work. But it does point out some things about how search engines work that is beneficial to SEO. Those points are actually many of the things that I wrote above that I mentioned limit the value of this comparison, and are the real point behind this post. Sometimes coming up with questions is more valuable than coming up with answers.

    1. The number of search results that you see when you perform a search are only estimates, and those numbers are often relied upon as one aspect of comparing how competitive those terms might be. Should they be? How would you feel if search engines stopped showing those numbers of results, or the amount of time it takes to perform a search?

    2. We don’t know if the numbers of results that we see from each search engine are calculated in the same way, and it’s interesting to ask why the numbers differ so much. Is it because databases are larger at different search engines, or is there some other reason?

    3. I wanted to learn a little about Google Caffeine without digging too deeply into differences that may be a result of a new way to handle master files in the Google File System, or a way of using much smaller chunk server file sizes, as reported a couple of days ago in the UK Guardian. What we do see with the numbers above seems to be a substantially larger number of search results for very frrequently appearing terms. Is there a significance to that based upon these new indexing methods? Is the database actually larger and we are seeing more occurences of those terms, or is the way of estimating the number of results different? Or is there some other reason?

    4. One of the founders of Cuill tells us that comparing the number of results that show up for rare queries is a good way of estimating the size of a search engine’s database. Is it really? Is knowing the size of a search engine’s database something that is helpful to us? Would it be worth doing the same kind of comparison with some terms that might be very rare? It might be interesting.

    5. When we search the Web using a search engine, we aren’t actually searching the Web, but rather a search engine’s index of the Web. That means that the numbers we see are limited by how much of the Web a search engine has crawled, what a search engine doesn’t crawl (such as some flash, text in images, text obfuscated by javascript or in some other manner, and in other ways).

    I was also hoping that the exercise of doing a comparison like this might lead to some other questions that I can think about, and I think it has.

  5. Bill, I actually missed the point of this comparison. How is this data beneficial for SEO purposes. Sorry if I missed something obvious.

  6. At first point surely this looks like a very uninformative post, but if you take a deeper look and read between lines some conclusions can be made. SEO purpose of this post is to learn even more about search engines. And that can surely be done by revealing and comprising data which at first glance is not so attractive.

  7. This goes to show that caffeine seems to be getting along nicely. It’s supposed to index more pages and return results faster than the current algo. Kudos to Google for improving their indexing engine. This is good news for everyone since it means more pages of your sites will be ranked and higher chance of someone finding your pages if your page is the most relevant match regardless of popularity of that page. I didn’t realise yahoo was so good at indexing. They seem to find backlinks faster than google as well. Interesting why bing is going to be yahoo’s new engine, your data seems to suggest yahoo is a lot better than bing at indexing. Granted bing is young and their index might grow. Or they may be going for a more selective approach which might work but I doubt it.

  8. Hi Web Design Beach,

    I’ve done some more research focusing upon a slightly different approach. Hopefully I’ll get to post it soon, but it works off this research, and provides a good counterpoint.

  9. Hi Staysure,

    At this point, I’m not sure if Caffeine actually provides access to a larger database. We’ll see. I think the numbers from Yahoo might be misleading as well.

    I suspect that Bing will grow, but they may be going for a more selective approach, as you suggest.

  10. crazy, yahoo has a lot more occurrences, i think this information is wrong, google have i think couple time more than yahoo

  11. Hi SeoProfy,

    The information above is correct, but it isn’t helpful in telling us about the comparative sizes of the indexes for the different search engines for many of the reasons that I mentioned in the post. There is another approach that can be used that might give us a little more insight, which is looking at how many times rare results show up in search results. I’ve followed up with a post that describes that method – Using Rare Words to Estimate Search Engine Index Sizes.

  12. A lot of stopwords are on here and words that start prepositional phrases.

  13. Those are not common words but elements of phrases that are written. I think that “a” is not a word used to find anything but often used indefinite article. The same thing goes to another “top” used words. I know that what I say has no influence on the results but I think that this also should be said that there are not words mostly used but just some phrase elements.

  14. Hi Denver Web,

    Yes, all of these words could possibly be considered stopwords since they appear so frequently on the Web, and many of them would start prepositional phrases. There are times when stopwords aren’t stopwords, like in the phrase “to be or not to be.”

  15. Hi Thomas,

    Good points. These are words on their own, and often parts of phrases as well. It is likely that people don’t often search for these words on their own, but rather as parts of phrases. There are a few different patents from Google on phrases, including phrase-based indexing and meaningful semantic units or concepts.

    You might find these past posts here interesting:

    Google Stopwords Patent
    New Google Approach to Indexing and Stopwords
    Google Patent Granted on Semantic Units (Meaningful Compounds)

    When someone searches for [The Matrix] they are likely searching for information about a film of that title, rather than for any pages that contain the words “the” and “matrix” somewhere on a page.

  16. It’s interesting to see the development of Caffeine for Google. The latest updates seem like they’re going to be great. The competition of the lesser search engines is definitely becoming tighter.

  17. Hi Christian,

    It’s great to see innovation and competition in search, and Google’s Caffeine update seems like it might bring an infrastructure to the search engine that can withstand considerable new growth on the Web. I don’t know what the joining together of researchers from Yahoo and Microsoft might mean in terms of competition, but I think there’s some potential there for a serious push towards competiting with Google. Definitely interesting times. Thanks.

  18. One thing is shows me that the all new “Google Caffeine” looks to be bigger than the current index. Is this good or bad ? More compitition or just more chance of being in the index ? I will leave that one to the readers.

  19. Hi Lee,

    I’m not sure that update actually includes a bigger search index. Rather, it appears to use a different method of storing and accessing that information that might be significantly more efficient. It provides them with room to grow. Should be interesting.

  20. Pingback: Tomislav_B » Blog Archive » Aktivnosti za 21.09.2009.
  21. Pingback: Daniel Lew
  22. Just wondering, a year down the line how much of the crystal ball thinking on this page turned out to be true. Do you know the latest on Google Caffeine? I guess “our” use of the basic search words haven’t changed all that much….

  23. Hi Ross,

    More information about Caffeine has come out, describing it as more of an infrastructure change to Google’s hardward and software rather than a ranking change. If it’s influenced Google’s index in any way, it probably has more to do with how quickly that index gets updated than anything else, but it also can mean that more information can be indexed as well.

    This is one of the most recent papers from Google on Google Caffeine:

    Large-scale Incremental Processing Using Distributed Transactions and Notifications

Comments are closed.