How Google May Select Their Data Sources Based Upon Keywords

When I was in law school, I was a teaching assistant for an environmental law professor. One of the tasks he had me working upon was a review and analysis of electronic databases that could be used to assess natural resource damages when some environment harm took place, such as the Exxon Valdez oil spill.

How do you determine the cost of the spill to the environment, to wildlife, to people who live in the area, to people who rely upon the area for their jobs and welfare. In short, you look at things such as other decisions in other courts where such things may have been litigated.

flowers from a local park as part of a master gardener program.

At the time, the World Wide Web was still a year away, and many of these electronic databases we were looking at were very helpful sources of information. My task was to review those, and see how much value and help they might hold.

Move forward a number of years, and I find some Google search engineers engaging in a similar task, and using some interesting tools to perform their analysis of data sources. I would have never thought of looking at keywords from those sources to gauge their effectiveness. But looking at a coverage of specific topics was much more likely.

It’s rare to run across a patent filing from one of the search engines that discusses the economic costs of data collection, and the value of making wise decisions when such information is collected.

A dashboard for economic costs and data ssources

A patent application from Google explores how it might identify data sources and consider the costs of using those sources, and the potential benefits for using in terms of the information that they yield.

Redundancy – Too Much Information On The Web?

When we think about data at Google, it’s easy to believe that the focus of the search engine is to crawl pages and provide as much information as possible that it finds at web sites.

But indexing just the web of pages means that the search engine is likely to miss out on a lot of facts, and we see knowledge base sources like Wikipedia showing up well in a lot of searches.

Google does more than just index commerce sites and product offerings.

The patent filing provides an example of a query that might be a little disappointing in some ways.

Imagine someone looking for “New Jersey real estate” on the Web. The patent used this example with 27 million Web pages showing in the search results. At the time I wrote this post, there were 53 million results for that query.


There may be that many homes for sale in New Jersey. Maybe.

Too Much Missing Information?

In the home buyer example, the top 50 returned Web pages for the query “New Jersey real estate” included no information about:

  • School district
  • Crime rate
  • Transportation
  • Pollution situation
  • etc.

More information could be helpful, including something I heard about recently called walk scores.

The patent tells us that additional information comes with its own costs:

Paradoxically, returning such information in addition to the many home search websites as search results can add extra burden on the users and aggravate the problem of information overload.

Finding Data Sources

The patent is:

Method and apparatus for exploring and selecting data sources
Invented by Xin Luna Dong and Divesh Srivastava
US Patent Application 20130138480
Published May 30, 2013
Filed: November 30, 2011


A system and method for choosing data sources for use in a data repository first chooses an initial selection of data sources based on keywords. An exploration tool is provided to organize the sources according to content and other attributes. The tool is used to pre-select data sources. The sources to include in the data repository are then selected based on a marginalism economic theory that considers both costs and quality of data.

The patent relies upon techniques discussed in the paper by X. L. Dong, L. Berti-Equille, and D. Srivastava, in Integrating conflicting data: the role of source dependence

The process pf Data Selection and Exploration

These are steps described in the patent aiming to make it more likely that the data sources used by a search engine are good ones.

(A) Relevant Data Sources are identified with the use of keyword queries.

(B) A source exploration dashboard tool is used to show the big picture of available sources and highlight identified relevant sources.

This enables them to

  • (1) Understand the domain and contents of the identified sources and discover related sources that may be of interest, and
  • (2) Understand the quality (e.g., coverage, accuracy, timeliness) of the sources and the relationships (e.g., data overlap, copying relationship) between them. Data aggregators can use this tool to refine their information needs (e.g., collecting precise data for computer science books) and pre-select the sources that are of particular interest to them.

(C) Following specified criteria and budget, and a set of preselected data sources, the best sources are determined based upon “data purchase, integration, and cleaning cost.”


Google isn’t just indexing everything it can find out on the web. It’s very serious about what it includes and doesn’t include from the Web. It’s not a utility, but rather a business and part of its mission is to find and gather and serve knowledge

This kind of economic analysis of data sources and how they may impact searchers goes far beyond a broad indexing of businesses on the Web, to providing important information that people rely that includes information from knowledge bases, and sources that show off knowledge panels, that offer query refinements that anticipate future queries.

There’s a lot of information on the Web and a lot of misinformation as well. Ideally Google wants the useful information they present to outweigh the misinformation.

When someone searches about New Jersey Real Estate, hopefully they will find a page that has information on it that a searcher will want to see.

It will tell people about how good the nearby schools are, about whether or not the neighborhood it is located in is a safe one, and that the people who live nearby are happy with nearby parks and stores and their walk scores.

10 thoughts on “How Google May Select Their Data Sources Based Upon Keywords”

  1. Awesome post! You have shared pretty deep insights on how Google analyzes a page before showing it as a result in its SERP (search engine results page). Your post shows that you did a lot of research before writing this great article. Thanks for sharing with us.

  2. Thanks Bill. Very thought provoking.

    We are working with content clients to encourage them to start building structured information on their entity hub pages so they become mini-knowledge bases as well as content sites. The thing I picked up on here was ‘anticipating future queries’ – ie how do we create these mini-DBs for each entity so they actually offer something different and useful for future…

  3. Great post as always Bill…it’s difficult to work around some of the more “qualitative” data for a real estate area search, because there are many rules in place about what a REALTOR (as in an official real estate professional per the National Assoc. of Realtors) can say/show on a website and/or in person. So it can be tricky to walk the fine line about expressing opinions or feelings within the confines of a property listing’s official MLS information. Thus, many official property resources are limited in that regard.

    For example, you can list which schools are in an area, but can’t necessarily list school ratings on your site (but can link out to it). But each MLS has different rules, too.

    However, you can find qualitative neighborhood info and school info mixed into property info on many sites like zillow, trulia, etc. They can get away with a little more because they’re not official MLS listing sites (they receive syndicated feeds for the most part, so they don’t have to abide by all the MLS rules). Urban Compass does a great job of providing robust info about a neighborhood, btw.

    Anyway, just wanted to provide a little color about how it’s a little different in the real estate world. There are plenty of 3rd party, non-MLS affiliated sites though (like Walk Score) that provide great qualitative info. Google could do a better job of serving this to consumers, as real estate search goes WAY beyond just looking at property.

  4. wow…thank you for this awesome post. It really shows your immense knowledge and research on this topic. Please keep sharing.

  5. It is really a wonder how you can collect and showcase data from the depths of Google. Way to go Bill!!

  6. Hi Bill.

    Interesting article. You said that: Google wants the useful information to outweigh the misinformation!

    I agree to that but I am thinking how Google recognize what information is useful and what information is not! I know Google can recognize spam sites and duplicated content site quite quickly, but how do they know what information is useful? Or maybe I just do not understand how Google works 😉

Comments are closed.