A hat tip to Frank McKown and Michael L. Nelson who provide an answer to a question that I’ve heard a few times about the application programming interfaces (APIs) that search engines provide to researchers.
The question is, “How close are the results that are provided to researchers through those provided programming interfaces to the results that are shown to searchers?” An answer is provided in Search Engines and Their Public Interfaces: Which APIs are the Most Synchronized?
Here’s the abstract from their paper:
Researchers of commercial search engines often collect data using the application programming interface (API) or by “scraping” results from the web user interface (WUI), but anecdotal evidence suggests the interfaces produce different results.
We provide the first in-depth quantitative analysis of the results produced by the Google, MSN and Yahoo API and WUI interfaces. After submitting a variety of queries to the interfaces for 5 months, we found significant discrepancies in several categories.
Our findings suggest that the API indexes are not older, but they are probably smaller for Google and Yahoo. Researchers may use our findings to better understand the differences between the interfaces and choose the best API for their particular types of queries.
The study involved performing four different types of queries performed through the different APIs, and through scraping search results from the web user interface for each of the search engines. These searches were performed each day over the course of 5 months from May through October 2006.
These are the searches conducted:
- The top 100 results and total number of results using 50 popular search terms from http://50.lycos.com/, and 50 computer science terms from http://en.wikipedia.org/wiki/List_of_cs_topics.
- The number of backlinks to 100 randomly selected URLs.
- How many pages were indexed for 100 randomly selected websites.
- Seeing if 100 randomly selected URLs were indexed and/or cached.
While they found that the results from the APIs, and from the scraped web results were approximately the same age (neither appeared to be older or newer), there were a number of differences that are interesting.
Great research. I’m really happy to see this.
Here’s where to presently find the APIs for Google, Yahoo, and Microsoft:
I think most have always known there’s been a difference in results, but the findings seem to offer the reason why (drawing data from a smaller database in the API). Search engines don’t want us scraping so you’d think they’d offer the same data through their APIs.
If they don’t doesn’t that provide motivation to scrape?
It was really good to see some experimental data on this subject.
It might not be a bad idea for them to offer the same data through their APIs, though I wonder if that is possible. Are they providing some additional information, and reranking results through Web access based upon information that the search engines gather when someone searches? I think that’s a possibility that might make a difference.
What aspect would differing results in Google datacenters have to play? I’ve seen differences across the data centers with regards to our sites, where a term might rank on one but not rank on another as an algorithm was pushed out across their servers. It would be interested to test those variations in Google to see how widely the datacenters differ over time.
Hi Michael,
At any point in time, it’s possible for one or more different algorithms to be in place at a different data center. Those may also be running on different classes or categories of queries, or could be merged together into results on the same queries.
It’s also possible that the search engine may be looking at IP address, device type making a request, language preferences set in a browser, browser type, and other information when returning results.
So, comparing one data center to another might be difficult if you don’t have any knowledge of some of the other potential reasons why result sets might differ from one query to another, even when using the same query terms.
Good points, I can see how it might be difficult to measure exactly what the quantifiable differences are across the different data centers.
This is a side note but I would love it if Google would actually release real traffic data numbers and not just estimates in their adwords traffic tool. Overture used to do this before but Google has always used a “predicted” clicks which doesn’t really help one drive down in the tail of categories.
Hi Michael,
Yep, the testing would be tough. But it’s still pretty interesting to see the differences between the API versions and the scraped versions.
I don’t know if Google would consider the idea, but real traffic data numbers would be interesting to see. I wonder what kind of impact that would have on people advertising with different phrases.