Just which words show up most frequently on the Web? I’m not sure that question can be answered, but it’s something I’ve wondered for a while.
I found a few lists of the most common words in the English language, and came up with a top 50 to see how frequently those were estimated to show up in Google, Yahoo, Bing, Ask, and Google Caffeine. Those are shown in a table and a chart below.
I’m not sure how informative this might be, even after looking at it. It’s not a very scientific test as well. There are a few reasons for that:
One of them is that when you search at one of the search engines, you’ll see a message that says something like:
Results 1 – 100 of about xxx,xxx,xxx for [query term]
From at least one previous Google patent filing, we can guess that the total amount (xxx,xxx,xxx) of results listed is likely only an estimate, and not an actual count. That patent application told us that the number shown might be estimated based upon a look at anywhere from 2 percent to 10 percent of Google’s index. Since the Caffeine update is a complete infrastructure/database update, we may not be able to even guess that the estimates shown for the present day Google are created in the same way that the Caffeine updates might be.
We also can’t be sure that the numbers for Yahoo, Bing, and Ask are calculated in the same manner either.
Another is that while I may see one total count at Google for each term, if you looked up the same terms at Google, you might see different numbers because you may be searching at a different data center, and it’s quite possible that there are differences from one data center to another.
A third thing to keep in mind is that when we search at one of the search engines, we aren’t actually searching the Web. Instead, we’re searching the indexes of the Web that the search engines have created. That means that some pages may be indexed more than once under different URLs, that many pages on the Web may not be included since they haven’t been indexed yet, and that words that might appear on the Web as text in images or which are presented in Flash or hidden behind java script or log-in screens aren’t going to be counted.
The table below is number of total results in Millions. I sorted them by how frequently the terms tested appeared in Google Caffeine.
I thought it would be helpful to present this information in a visually different manner as well. The chart that follows is in reverse order of the table above.
As I mentioned above, this is a completely unscientific view.
One thing that it definitely won’t do is provide an idea of how large the databases might be for each of the search engines. According to post at the Cuil blog on Bing (no longer available), there is a way to try to make that comparison, but it relies upon looking at the number of search results for terms that are rare, rather than looking at the most frequently appearing words, like I have.