Which words show up most frequently on the Web? I’m not sure that question can be answered, but it’s something I’ve wondered for a while.
With a beta version of Google’s future update, code named Caffeine recently released to allow people to experiment with, I thought I would do a few comparisons.
I found a few lists of the most common words in the English language and came up with a top 50 to see how frequently those were estimated to show up in Google, Yahoo, Bing, Ask, and Google Caffeine. Those are shown in a table and a chart below.
I’m not sure how informative this might be, even after looking at it. It’s not a very scientific test as well. There are a few reasons for that:
One of them is that when you search at one of the search engines, you’ll see a message that says something like:
Results 1 – 100 of about xxx,xxx,xxx for [query term]
From at least one previous Google patent filing, we can guess that the total amount (xxx,xxx,xxx) of results listed is likely only an estimate and not an actual count. That patent application told us that the number shown might be estimated based upon a look at anywhere from 2 percent to 10 percent of Google’s index. Since the Caffeine update is a complete infrastructure/database update, we may not even guess that the estimates are shown for the present day Google is created in the same way that the Caffeine updates might be.
We also can’t be sure that the numbers for Yahoo, Bing, and Ask are calculated in the same manner either.
Another is that while I may see one total count at Google for each term, if you looked up the same terms at Google, you might see different numbers because you may be searching at a different data center. There may be differences from one data center to another.
A third thing to keep in mind is that we aren’t searching the Web when we search at one of the search engines. Instead, we’re searching the indexes of the Web that the search engines have created. That means that some pages may be indexed more than once under different URLs, that many pages on the Web may not be included since they haven’t been indexed yet, and that words that might appear on the Web as text in images or which are presented in Flash or hidden behind javascript or log-in screens aren’t going to be counted.
The table below is the number of total results in Millions. I sorted them by how frequently the terms tested appeared in Google Caffeine.
Query | Google Caffeine | Yahoo | Bing | Ask | |
a | 19,320 | 17,570 | 31,200 | 7,800 | 1,280 |
in | 15,850 | 13,980 | 30,200 | 7,850 | 900 |
to | 15,220 | 13,500 | 27,500 | 8,920 | 1,740 |
the | 14,850 | 13,900 | 28,800 | 8,170 | 747 |
of | 14,760 | 12,990 | 28,000 | 7,310 | 794 |
and | 13,980 | 12,950 | 28,000 | 7,490 | 789 |
for | 12,110 | 10,720 | 26,800 | 7,740 | 769 |
by | 12,080 | 10,420 | 27,000 | 6,120 | 956 |
on | 11,260 | 9,940 | 25,100 | 5,610 | 598 |
is | 9,580 | 8,870 | 22,600 | 4,250 | 699 |
I | 9,220 | 8,250 | 18,600 | 3,860 | 686 |
all | 9,110 | 7,580 | 27,200 | 6,990 | 1,020 |
this | 8,890 | 7,870 | 21,500 | 5,790 | 585 |
with | 8,490 | 6,300 | 20,900 | 2,440 | 636 |
it | 7,700 | 6,860 | 19,300 | 4,190 | 542 |
at | 7,410 | 6,600 | 20,800 | 3,930 | 552 |
from | 7,340 | 6,920 | 18,400 | 4,160 | 521 |
or | 7,030 | 6,210 | 19,500 | 3,940 | 567 |
you | 6,760 | 5,930 | 19,900 | 5,080 | 543 |
as | 6,460 | 5,750 | 15,400 | 3,550 | 884 |
your | 6,360 | 5,470 | 19,500 | 3,790 | 495 |
an | 6,260 | 5,520 | 16,500 | 3,780 | 489 |
are | 6,260 | 5,760 | 18,100 | 163 | 578 |
be | 6,120 | 5,460 | 17,100 | 3,990 | 473 |
that | 5,780 | 5,260 | 15,200 | 5,650 | 405 |
do | 5,500 | 5,020 | 13,000 | 2,090 | 410 |
not | 5,500 | 4,870 | 15,600 | 4,550 | 418 |
have | 4,870 | 4,390 | 14,500 | 4,130 | 468 |
one | 4,330 | 3,870 | 12,300 | 2,750 | 375 |
can | 4,150 | 3,690 | 13,300 | 3,030 | 367 |
was | 3,930 | 3,610 | 10,400 | 2,960 | 361 |
if | 3,810 | 3,500 | 11,200 | 2,660 | 345 |
we | 3,780 | 3,370 | 12,400 | 3,430 | 358 |
but | 3,610 | 3,340 | 10,100 | 1,680 | 327 |
what | 3,290 | 2,850 | 11,600 | 3,080 | 322 |
which | 3,020 | 2,810 | 7,750 | 1,810 | 300 |
there | 2,970 | 2,770 | 8,340 | 1,450 | 262 |
when | 2,850 | 2,600 | 8,360 | 1,580 | 306 |
use | 2,730 | 2,250 | 12,300 | 1,830 | 327 |
their | 2,690 | 2,680 | 8,210 | 1,650 | 254 |
they | 2,650 | 2,440 | 8,260 | 1,670 | 293 |
how | 2,470 | 2,170 | 9,050 | 1,730 | 289 |
he | 2,200 | 2,040 | 6,060 | 1,420 | 190 |
were | 2,130 | 2,100 | 5,320 | 2,770 | 203 |
his | 2,030 | 1,880 | 5,310 | 858 | 182 |
had | 1,860 | 2,240 | 5,090 | 966 | 191 |
each | 1,370 | 1,290 | 4,150 | 1,090 | 164 |
said | 1,210 | 1,350 | 4,060 | 857 | 128 |
she | 953 | 882 | 3,030 | 1,200 | 95 |
word | 780 | 685 | 2,280 | 469 | 80 |
I thought it would be helpful to present this information in a visually different manner as well. Therefore, the chart that follows is in reverse order of the table above.
As I mentioned above, this is a completely unscientific view.
It definitely won’t do is provide an idea of how large the databases might be for each of the search engines. However, according to a post at the Cuil blog on Bing (no longer available), there is a way to try to make that comparison. Still, it relies upon looking at the number of search results for rare terms rather than looking at the most frequently appearing words as I have.
I think it is very informative. How come Yahoo has so many more occurrences? You would not think that yahoo indexed more pages than google or bing, I wouldn’t anyways.
[I’m not sure how informative this might be, even after looking at it. ] Exactly. And using 2% of the index that’s worst than guessing because it can be so misleading. 2% of what slice? The one thing it does seem to be good at is verifying stop words. Cheers –
Thanks, malmilligan and Tommy,
I may have jinxed some of the discussion about this comparison of words and rankings and search engines by saying that this might not be that informative.
I did something like this a couple of years back, looking at how many sites were listed at different country level tlds (Google’s Most Popular and Least Popular Top Level Domains). A lot of interesting ideas came out in the discussion around the post, including people seeing very different numbers at different data centers.
One question I have, after doing this is why search engines take the time to tell us that they are showing us that that these are “Results 1 – 100 of about xxx,xxx,xxx for [query term].” Why do they also tell us how long the search might have taken as well? I guess it helps make the results appear more authentic.
For terms that appear in the search engines databases that show up less than 1,000 times, if you actually go to the last result in the listing, you can get more accurate numbers. If you want to try to estimate the actual sizes of the different search engine databases, there is a method that you could try to use. As the Cuill blog post notes:
Using anywhere from 2% to 10% isn’t necessarily bad. Many people have relied for years on Comscore numbers to measure search traffic. When you see a news site reporting that Google has 65 percent of search traffic compared to something like 16 percent for the next largest search engine, the stats that are being referred to are usually Comscore numbers. Comscore tracks around 2 million searchers in the US from are a panel of searchers who have agreed to have their searches tracked in exchange for something like antivirus software or the chance to win a sweepstakes. That’rs out of more than 200 million internet users in the United States, or 1/2 of a percent. (They are just starting to move beyond this approach though.)
It’s possible that search engines might use numbers like these to try to identify stop words, though it’s possible that the search engines have been treating stopwords very differently since at least early 2008. The word “the” may be a stop word for some queries, but not for a query such as “The Matrix.” The words “to,” and “be,” and “not,” could be considered stop words, but not when they appear in the phrase, “to be or not to be.” Google hasn’t been telling us at the top of search results that they may have removed some words from searches because those are “stop words,” since at least the early months of 2008. That seems to be an interesting upgrade of the search engine that many people don’t talk about much.
An interesting question. Over the years, we’ve seen Google and Yahoo both make announcements about the size of their databases, one-upping each over every so often. It’s quite possible that Google has a larger database than Yahoo, but after the first few billion pages, it’s hard to tell how meaningful that might be. Yahoo announced at a recent conference that at least 1/3 of the pages that they see on the Web are duplicates of each other.
I believe that I’ve been seeing much larger numbers from Yahoo than from Google for many different query terms since at least 2005. It’s possible that Yahoo is estimating differently than Google. I’ve never taken that to mean that Yahoo has a larger database than Google, though I’ve sometimes wondered if Yahoo did that on purpose to make it look as if they did.
Bill, I actually missed the point of this comparison. How is this data beneficial for SEO purposes. Sorry if I missed something obvious.
Hi Ravi,
One of the ways that we learn about what search engines do, and how they work is to look at search results, to make comparisons, to see some of the differences between what different ones do, and how they work. While I tend to focus here on search related patent filings and white papers with a majority of my posts, I sometimes like to just experiment a little with search results, even if to satisfy only my curiosity or have some fun.
This post probably won’t help you make money fast, it likely won’t give you some magical bullet to make pages rank higher than other pages, it may not even provide you with some special insight into how search engines work. But it does point out some things about how search engines work that is beneficial to SEO. Those points are actually many of the things that I wrote above that I mentioned limit the value of this comparison, and are the real point behind this post. Sometimes coming up with questions is more valuable than coming up with answers.
1. The number of search results that you see when you perform a search are only estimates, and those numbers are often relied upon as one aspect of comparing how competitive those terms might be. Should they be? How would you feel if search engines stopped showing those numbers of results, or the amount of time it takes to perform a search?
2. We don’t know if the numbers of results that we see from each search engine are calculated in the same way, and it’s interesting to ask why the numbers differ so much. Is it because databases are larger at different search engines, or is there some other reason?
3. I wanted to learn a little about Google Caffeine without digging too deeply into differences that may be a result of a new way to handle master files in the Google File System, or a way of using much smaller chunk server file sizes, as reported a couple of days ago in the UK Guardian. What we do see with the numbers above seems to be a substantially larger number of search results for very frrequently appearing terms. Is there a significance to that based upon these new indexing methods? Is the database actually larger and we are seeing more occurences of those terms, or is the way of estimating the number of results different? Or is there some other reason?
4. One of the founders of Cuill tells us that comparing the number of results that show up for rare queries is a good way of estimating the size of a search engine’s database. Is it really? Is knowing the size of a search engine’s database something that is helpful to us? Would it be worth doing the same kind of comparison with some terms that might be very rare? It might be interesting.
5. When we search the Web using a search engine, we aren’t actually searching the Web, but rather a search engine’s index of the Web. That means that the numbers we see are limited by how much of the Web a search engine has crawled, what a search engine doesn’t crawl (such as some flash, text in images, text obfuscated by javascript or in some other manner, and in other ways).
I was also hoping that the exercise of doing a comparison like this might lead to some other questions that I can think about, and I think it has.
At first point surely this looks like a very uninformative post, but if you take a deeper look and read between lines some conclusions can be made. SEO purpose of this post is to learn even more about search engines. And that can surely be done by revealing and comprising data which at first glance is not so attractive.
This goes to show that caffeine seems to be getting along nicely. It’s supposed to index more pages and return results faster than the current algo. Kudos to Google for improving their indexing engine. This is good news for everyone since it means more pages of your sites will be ranked and higher chance of someone finding your pages if your page is the most relevant match regardless of popularity of that page. I didn’t realise yahoo was so good at indexing. They seem to find backlinks faster than google as well. Interesting why bing is going to be yahoo’s new engine, your data seems to suggest yahoo is a lot better than bing at indexing. Granted bing is young and their index might grow. Or they may be going for a more selective approach which might work but I doubt it.
Hi Web Design Beach,
I’ve done some more research focusing upon a slightly different approach. Hopefully I’ll get to post it soon, but it works off this research, and provides a good counterpoint.
Hi Staysure,
At this point, I’m not sure if Caffeine actually provides access to a larger database. We’ll see. I think the numbers from Yahoo might be misleading as well.
I suspect that Bing will grow, but they may be going for a more selective approach, as you suggest.
crazy, yahoo has a lot more occurrences, i think this information is wrong, google have i think couple time more than yahoo
Hi SeoProfy,
The information above is correct, but it isn’t helpful in telling us about the comparative sizes of the indexes for the different search engines for many of the reasons that I mentioned in the post. There is another approach that can be used that might give us a little more insight, which is looking at how many times rare results show up in search results. I’ve followed up with a post that describes that method – Using Rare Words to Estimate Search Engine Index Sizes.
A lot of stopwords are on here and words that start prepositional phrases.
Those are not common words but elements of phrases that are written. I think that “a” is not a word used to find anything but often used indefinite article. The same thing goes to another “top” used words. I know that what I say has no influence on the results but I think that this also should be said that there are not words mostly used but just some phrase elements.
Hi Denver Web,
Yes, all of these words could possibly be considered stopwords since they appear so frequently on the Web, and many of them would start prepositional phrases. There are times when stopwords aren’t stopwords, like in the phrase “to be or not to be.”
Hi Thomas,
Good points. These are words on their own, and often parts of phrases as well. It is likely that people don’t often search for these words on their own, but rather as parts of phrases. There are a few different patents from Google on phrases, including phrase-based indexing and meaningful semantic units or concepts.
You might find these past posts here interesting:
Google Stopwords Patent
New Google Approach to Indexing and Stopwords
Google Patent Granted on Semantic Units (Meaningful Compounds)
When someone searches for [The Matrix] they are likely searching for information about a film of that title, rather than for any pages that contain the words “the” and “matrix” somewhere on a page.
It’s interesting to see the development of Caffeine for Google. The latest updates seem like they’re going to be great. The competition of the lesser search engines is definitely becoming tighter.
Hi Christian,
It’s great to see innovation and competition in search, and Google’s Caffeine update seems like it might bring an infrastructure to the search engine that can withstand considerable new growth on the Web. I don’t know what the joining together of researchers from Yahoo and Microsoft might mean in terms of competition, but I think there’s some potential there for a serious push towards competiting with Google. Definitely interesting times. Thanks.
One thing is shows me that the all new “Google Caffeine” looks to be bigger than the current index. Is this good or bad ? More compitition or just more chance of being in the index ? I will leave that one to the readers.
Hi Lee,
I’m not sure that update actually includes a bigger search index. Rather, it appears to use a different method of storing and accessing that information that might be significantly more efficient. It provides them with room to grow. Should be interesting.
Just wondering, a year down the line how much of the crystal ball thinking on this page turned out to be true. Do you know the latest on Google Caffeine? I guess “our” use of the basic search words haven’t changed all that much….
Hi Ross,
More information about Caffeine has come out, describing it as more of an infrastructure change to Google’s hardward and software rather than a ranking change. If it’s influenced Google’s index in any way, it probably has more to do with how quickly that index gets updated than anything else, but it also can mean that more information can be indexed as well.
This is one of the most recent papers from Google on Google Caffeine:
Large-scale Incremental Processing Using Distributed Transactions and Notifications