The Expertise of Google Custom Search Engines vs. the Wisdom of Crowds
This is the third and final (for now) part in a series on Google Custom Search, and how information from custom search engines might be used in Google’s Web search.
In the first part of this series, SEO and Assumptions behind Web Searches, I described some assumptions search engineers often make that are challenged by a recently published Google patent application, Aggregating Context Data for Programmable Search Engines.
Quickly, those questioned assumptions are:
- Search Engines should avoid using information from external sources in learning how people search
- User data collected about a searcher’s past searches and browsing behavior can help identify the intent of that searcher during new searches
- User data collected about specific searchers, queries, and web sites can also be aggregated to help understand the intent behind a search
In the second part of this series, Is Google Custom Search Influencing Google Web Search?, I looked at some of the past accomplishments of Ramanathan V. Guha, the inventor listed on the Aggregating Context Data patent filing. I provided a very brief overview of some of the features that could be used to create Google Custom Search Engines such as subscribed links, promotions, context files, keywords, site labels, and query refinements.
I started that post off with a mention of some other Google patent filings either invented or co-invented by Ramanathan V. Guha which describe how aspects of Google’s Custom Search may be influencing what we see in search results.
These include some specialized subscribed links results that appear in Google search results even if you’re not subscribed to those links. For example, the latest baseball scores or schedules from an official MLB website are sometimes shown at the top of search results for some queries. The following comes up when I search for “rangers scores.”
Google’s patent involving trust rank weighs how reputable people are who provide annotations, or labels, for web sites, which might help boost the rankings of pages in search results for particular queries.
A couple of other Google patent filings may also use site labels and refinement labels, like those described for use in Google Custom search, to create categories for pages and queries, and query refinements for web search results.
In this part of the series, I’m going to look more deeply at why the Aggregating Context Data patent challenges the first assumption that I noted above – a search engine avoiding relying on information from external resources -in this case, people with expertise on specific vertical topics.
External Information from Trusted Experts
When someone creates a custom search engine for a web site or on a specific topic, and they go through the effort of creating a detailed context file for that search engine, there’s a good chance that they have some level of expertise about the sites and topics that they include in their custom search engine.
The patent provides some examples of vertical content sites that might be the focus of a custom search engine:
- Sites on particular technologies or products (e.g., digital cameras or computers)
- Political websites
- Community forums
- News organizations
- Personal websites
- Industry associations
Vertical content sites can offer “highly specialized content about particular topics” and possibly also include organized collections of links to other related informational sources. An example from the patent filing:
For example, a website devoted to digital cameras typically includes product reviews, guidance on how to purchase a digital camera, as well as links to camera manufacturer’s sites, price comparison engines, other sources of expert opinion and the like.
In addition, the domain experts often have considerable knowledge about which other resources available on the Internet are of value and which are not. Using his or her expertise, the content developer can at best structure the site content to address the variety of different information needs of users.
What this patent application aims at is having a general web search engine, such as Google, be able to take advantage of some of the expertise of the creators of vertical web sites when it provides query refinement suggestions and page results to searchers.
There are two different ways that the patent looks to harness the subject matter expertise of vertical content sites:
1. Enabling content developers of vertical content sites to use their expertise to enhance the search process of a general search engine and paying attention to the customizations that those content developers might make to a custom search engine.
2. Aggregating context data that has been harvested from a number of custom, or programmable search engines to process search queries based upon the expertise of those content developers.
Using Information from Custom Search Engines to Influence the Search Process
When context information from custom search engines is included as part of a Web search, it can influence how a search is performed both before and after the query terms might be sent to Google’s web search.
Before a query is executed:
- The query might be revised, modified, or expanded.
- Specific document collects might be chosen, from which to conduct the search.
- Various search algorithm parameters might be set from which to evaluate the query.
- Other operations might take place that might refine, improve, or enhance a query.
After a query is executed:
- Search results might be filtered, organized, and annotated.
- Links may be provided to related contexts that provide other types of information or address other informational needs.
Whether or not any of these steps are taken can rely upon “context files” that can be set up while creating a Custom Search Engine.
Context files from more than one source may be used for a single query, and the selection of those context files can depend upon the query searched for, information about the searcher, and the kind of device used to conduct the search.
Context data may be taken from a number of custom search engines and aggregated when appropriate, such as when they may all be related to a similar topic.
The patent application is:
Aggregating Context Data for Programmable Search Engines
Invented by Ramanathan V. Guha
Assigned to Google
US Patent Application 20100250513
Published September 30, 2010
Filed: March 29, 2010
Search results are generated using aggregated context data from two or more contexts. When two or more programmable search engines relate to a similar topic, context data associated with the programmable search engines are aggregated.
The context is then applied to a query in order to present, in an integrated manner, relevant search results that make use of context intelligence from more than one programmable search engine.
The patent filing includes an example of how the context file from a custom search engine might be used to influence search results on Google.
A searcher types [Canon Digital Rebel] into a search box at Google.
Since this is the name of a particular camera brand and model, the search engine might recognize that the context of this search is for camera models, based upon context files from Custom Search Engines.
In the search results shown to a searcher, a number of links might be provided as “navigational aids” to address possible different informational needs of a searcher. Each of those links may be associated with a related context file.
The links may include things such as:
If you are trying to decide which camera to buy – which can provide information on how to buy a camera, or comparisons between different cameras, and pricing information, meeting an intent to purchase a camera.
Where to buy this camera from – provides more specific information about the locations of vendors for that model of camera
If you already own one – could lead to technical support and service information for that particular model
In addition to these navigational aid type links, there may be additional links to related contexts as well, such as:
- More Manufacturer Pages
- More Guides
- More Reviews
These refinements may have been created by someone putting together a custom search engine, and labeling certain sites with annotations such as “camera review,” “camera vendor,” “camera technical support,” “camera manufacturer,” “buying guide,” and “camera review.”
The kinds of labels that can be created may cover a fairly wide range:
The vertical content provider can label (or tag) a site with any number of category labels.
The labels can describe any characteristic that the vertical content provider deems of interest, including:
- Topical (e.g., cameras, medicine, sports),
- Type (e.g., manufacturer, academic, blog, government),
- Level of discourse (e.g., lay, expert, professional, pre-teen),
- Quality of content (poor, good, excellent),
- Numerical rating, and so forth.
The ontology (i.e., set of labels) used by the vertical content provider can be either proprietary (e.g., internally developed) or public, or a combination thereof.
I’ve debated with myself about how deeply I should go in describing the various aspects of this patent, and decided that anyone who wants to learn more about how custom search engines might influence web search results at Google would probably learn more by spending the time working to create a custom search engine, and a context file for their search engine.
If you can build one that might anticipate the informational and situational needs of the searcher on a site, or a number of sites, you may get a good sense of how Google might use your context file information for web searches.
Interestingly, you can create a custom search engine, with associated context file, for a site that you don’t own.
This idea of using the expertise of people who know a lot about a particular vertical, to come up with query refinements, expand queries, and influence general searches in other ways goes against the assumptions that user data collected from an individual’s past searching and browsing history might help divine their intent in a present search, or that search behavior from a number of searchers might help pinpoint appropriate query results or refinements.
In a way, a web search engine learning from custom search engines is like pitting the expertise of knowledgeable individuals against the wisdom of crowds.
What’s to keep someone from maliciously creating custom search engines that might attempt to manipulate this process, and influence search results in a harmful manner?
A well made ranking algorithm should anticipate how it might be attacked. In this case, here’s a related patent from Ramanathan V. Guha, granted on June 22, 2010, that attempts to anticipate such problems:
Detecting spam related and biased contexts for programmable search engines
United States Patent 7,743,045
A programmable search engine system is programmable by a variety of different entities, such as client devices and vertical content sites to customize search results for users. Context files store instructions for controlling the operations of the programmable search engine.
The context files are processed by various context processors, which use the instructions therein to provide various pre-processing, post-processing, and search engine control operations.
Spam related and biased contexts and search results are identified using offline and query time processing stages, and the context files from vertical content providers associated with such spam and biased context and results are excluded from processing on direct user queries.