The Expertise of Google Custom Search Engines vs. the Wisdom of Crowds

This is the third and final (for now) part in a series on Google Custom Search, and how information from custom search engines might be used in Google’s Web search.

In the first part of this series, SEO and Assumptions behind Web Searches, I described some assumptions search engineers often make that are challenged by a recently published Google patent application, Aggregating Context Data for Programmable Search Engines.

Quickly, those questioned assumptions are:

  1. Search Engines should avoid using information from external sources in learning how people search
  2. User data collected about a searcher’s past searches and browsing behavior can help identify the intent of that searcher during new searches
  3. User data collected about specific searchers, queries, and web sites can also be aggregated to help understand the intent behind a search

In the second part of this series, Is Google Custom Search Influencing Google Web Search?, I looked at some of the past accomplishments of Ramanathan V. Guha, the inventor listed on the Aggregating Context Data patent filing. I provided a very brief overview of some of the features that could be used to create Google Custom Search Engines such as subscribed links, promotions, context files, keywords, site labels, and query refinements.

I started that post off with a mention of some other Google patent filings either invented or co-invented by Ramanathan V. Guha which describe how aspects of Google’s Custom Search may be influencing what we see in search results.

These include some specialized subscribed links results that appear in Google search results even if you’re not subscribed to those links. For example, the latest baseball scores or schedules from an official MLB website are sometimes shown at the top of search results for some queries. The following comes up when I search for “rangers scores.”

World Series standings and schedule shown at the top of Google Search Results on a search for rangers scores

Google’s patent involving trust rank weighs how reputable people are who provide annotations, or labels, for web sites, which might help boost the rankings of pages in search results for particular queries.

A couple of other Google patent filings may also use site labels and refinement labels, like those described for use in Google Custom search, to create categories for pages and queries, and query refinements for web search results.

In this part of the series, I’m going to look more deeply at why the Aggregating Context Data patent challenges the first assumption that I noted above – a search engine avoiding relying on information from external resources -in this case, people with expertise on specific vertical topics.

External Information from Trusted Experts

When someone creates a custom search engine for a web site or on a specific topic, and they go through the effort of creating a detailed context file for that search engine, there’s a good chance that they have some level of expertise about the sites and topics that they include in their custom search engine.

The patent provides some examples of vertical content sites that might be the focus of a custom search engine:

  • Sites on particular technologies or products (e.g., digital cameras or computers)
  • Political websites
  • Blogs
  • Community forums
  • News organizations
  • Personal websites
  • Industry associations

Vertical content sites can offer “highly specialized content about particular topics” and possibly also include organized collections of links to other related informational sources. An example from the patent filing:

For example, a website devoted to digital cameras typically includes product reviews, guidance on how to purchase a digital camera, as well as links to camera manufacturer’s sites, price comparison engines, other sources of expert opinion and the like.

In addition, the domain experts often have considerable knowledge about which other resources available on the Internet are of value and which are not. Using his or her expertise, the content developer can at best structure the site content to address the variety of different information needs of users.

What this patent application aims at is having a general web search engine, such as Google, be able to take advantage of some of the expertise of the creators of vertical web sites when it provides query refinement suggestions and page results to searchers.

There are two different ways that the patent looks to harness the subject matter expertise of vertical content sites:

1. Enabling content developers of vertical content sites to use their expertise to enhance the search process of a general search engine and paying attention to the customizations that those content developers might make to a custom search engine.

2. Aggregating context data that has been harvested from a number of custom, or programmable search engines to process search queries based upon the expertise of those content developers.

Using Information from Custom Search Engines to Influence the Search Process

When context information from custom search engines is included as part of a Web search, it can influence how a search is performed both before and after the query terms might be sent to Google’s web search.

Before a query is executed:

  • The query might be revised, modified, or expanded.
  • Specific document collects might be chosen, from which to conduct the search.
  • Various search algorithm parameters might be set from which to evaluate the query.
  • Other operations might take place that might refine, improve, or enhance a query.

After a query is executed:

  • Search results might be filtered, organized, and annotated.
  • Links may be provided to related contexts that provide other types of information or address other informational needs.

Whether or not any of these steps are taken can rely upon “context files” that can be set up while creating a Custom Search Engine.

Context files from more than one source may be used for a single query, and the selection of those context files can depend upon the query searched for, information about the searcher, and the kind of device used to conduct the search.

Context data may be taken from a number of custom search engines and aggregated when appropriate, such as when they may all be related to a similar topic.

The patent application is:

Aggregating Context Data for Programmable Search Engines
Invented by Ramanathan V. Guha
Assigned to Google
US Patent Application 20100250513
Published September 30, 2010
Filed: March 29, 2010

Abstract

Search results are generated using aggregated context data from two or more contexts. When two or more programmable search engines relate to a similar topic, context data associated with the programmable search engines are aggregated.

The context is then applied to a query in order to present, in an integrated manner, relevant search results that make use of context intelligence from more than one programmable search engine.

The patent filing includes an example of how the context file from a custom search engine might be used to influence search results on Google.

A searcher types [Canon Digital Rebel] into a search box at Google.

Since this is the name of a particular camera brand and model, the search engine might recognize that the context of this search is for camera models, based upon context files from Custom Search Engines.

In the search results shown to a searcher, a number of links might be provided as “navigational aids” to address possible different informational needs of a searcher. Each of those links may be associated with a related context file.

The links may include things such as:

If you are trying to decide which camera to buy – which can provide information on how to buy a camera, or comparisons between different cameras, and pricing information, meeting an intent to purchase a camera.

Where to buy this camera from – provides more specific information about the locations of vendors for that model of camera

If you already own one – could lead to technical support and service information for that particular model

In addition to these navigational aid type links, there may be additional links to related contexts as well, such as:

  • More Manufacturer Pages
  • More Guides
  • More Reviews

These refinements may have been created by someone putting together a custom search engine, and labeling certain sites with annotations such as “camera review,” “camera vendor,” “camera technical support,” “camera manufacturer,” “buying guide,” and “camera review.”

The kinds of labels that can be created may cover a fairly wide range:

The vertical content provider can label (or tag) a site with any number of category labels.

The labels can describe any characteristic that the vertical content provider deems of interest, including:

  • Topical (e.g., cameras, medicine, sports),
  • Type (e.g., manufacturer, academic, blog, government),
  • Level of discourse (e.g., lay, expert, professional, pre-teen),
  • Quality of content (poor, good, excellent),
  • Numerical rating, and so forth.

The ontology (i.e., set of labels) used by the vertical content provider can be either proprietary (e.g., internally developed) or public, or a combination thereof.

Conclusion

I’ve debated with myself about how deeply I should go in describing the various aspects of this patent, and decided that anyone who wants to learn more about how custom search engines might influence web search results at Google would probably learn more by spending the time working to create a custom search engine, and a context file for their search engine.

If you can build one that might anticipate the informational and situational needs of the searcher on a site, or a number of sites, you may get a good sense of how Google might use your context file information for web searches.

Interestingly, you can create a custom search engine, with associated context file, for a site that you don’t own.

This idea of using the expertise of people who know a lot about a particular vertical, to come up with query refinements, expand queries, and influence general searches in other ways goes against the assumptions that user data collected from an individual’s past searching and browsing history might help divine their intent in a present search, or that search behavior from a number of searchers might help pinpoint appropriate query results or refinements.

In a way, a web search engine learning from custom search engines is like pitting the expertise of knowledgeable individuals against the wisdom of crowds.

What’s to keep someone from maliciously creating custom search engines that might attempt to manipulate this process, and influence search results in a harmful manner?

A well made ranking algorithm should anticipate how it might be attacked. In this case, here’s a related patent from Ramanathan V. Guha, granted on June 22, 2010, that attempts to anticipate such problems:

Detecting spam related and biased contexts for programmable search engines
United States Patent 7,743,045

Abstract

A programmable search engine system is programmable by a variety of different entities, such as client devices and vertical content sites to customize search results for users. Context files store instructions for controlling the operations of the programmable search engine.

The context files are processed by various context processors, which use the instructions therein to provide various pre-processing, post-processing, and search engine control operations.

Spam related and biased contexts and search results are identified using offline and query time processing stages, and the context files from vertical content providers associated with such spam and biased context and results are excluded from processing on direct user queries.

Share

14 thoughts on “The Expertise of Google Custom Search Engines vs. the Wisdom of Crowds”

  1. Hi Bill, another fantastic post from your sage like SEO knowledge. Many thanks.

    My experience of CSE, or programmable search engines gaining rankings for a particular site/KW set has been positive. This was not a calculated thing, so I am making said assumptions.

    In my custom search engine there are no Context edits (promotions etc), but merely a targeted list of sites to include in the search. As my CSE gets a good number of queries what is the chance of Google matching not only:

    a)The sites included in the custom SE
    b)The user queries
    c)the site and page it is hosted on
    d)PPC ads clicked when Adwords is installed in the CSE.

    Big fan of good internet tools. I think CSEs are a terrific resource.

  2. I’ve been of the mind for some time that custom search engines/curated link collections can provide signals to wider websearch on particular topic categories. So for example, I’ve been exploring the creation of CSEs around the blogs of people using conferernce event hashtags on Twitter, by seeing who uses the hashtag, and grabbing their blog URL from their Twitter profile, and dumping his into an event-tag CSE. By looking at the network connections between hashtaggers on Twitter, and find the people highly followed within those communities, we can possibly identify a reputation signal (highly followed people have higher reputation) and use this to boost particular blog domains within the CSE [ http://blog.ouseful.info/2010/09/27/initial-thoughts-on-profiling-dirdigengs-friends-network-on-twitter/ ]

    As to the extent to which trusted individuals might influence the ranking of search results on queries run by their friends/social associates, I wonder if this is a new role for librarians….? [ http://blog.ouseful.info/2010/10/27/could-librarians-be-influential-friends-and-who-owns-your-search-persona/ ]

  3. Hi JC,

    Thank you very much for your kind words, and for sharing your experiences with your custom search engine.

    I’m pretty much convinced at this point that it’s a great idea point for anyone interested in learning more about SEO and search to create a CSE, and spend some time creating labels for different pages. I think that’s true regardless of whether or not Google is using information from context files for custom search engines.

    It sounds like you’ve created a custom search that people find value in, which is a good thing in itself.

    The focus of this patent filing, on how information from CSEs may influence web search results, appears to be in the creation of context files, and choosing useful labels for sites included in your search. If you decide to take those steps, it would be great to hear back from you on what kind of impact it might have.

  4. Hi Tony,

    I’ve been hearing the use of the word “curate” more and more in recent months, to describe sites and pages that may filter out some of the noise on the Web, including a fairly recent video from Google’s Matt Cutts where he described some of his favorite web sites as ones that curate content from a number of sources. I’m not sure if the concept itself is that new, but it seems like there’s a change in the way people are thinking about it.

    For instance, a Yahoo patent published not long ago described the kinds of things that they might look for in a seed site, from which they would start web crawls. The things they described as being features of a good seed site included some level of authority and expertise in a particular niche or market, and a good likelihood that they could find new URLs on the pages of those sites. Sites that curate information in a useful manner have the potential to be an important commodity to search engines.

    Custom Search Engines which use context files, with annotations and labels, take that a step forward by helping a search engine classify new URLs that they discover, and identify some of the contexts in which those pages might be useful.

    The research you’ve been doing with CSEs and Twitter is pretty interesting – creating the possibility of having a reputation score from the social network associated with pages that can be found in a CSE.

    I ran across your article on librarians as “influential friends” a couple of days ago, but I hadn’t seem a few of your comments that followed. Nice discussion on the topic. I do think that search engines could use the context surrounding your online persona to influence search results for others. That seems to be where Ramanathan Guha is going with his Search result ranking based on trust patent (more on that here), and with this patent filing on aggregating context files to inform web search results.

    I don’t think someone has to be a librarian to develop the kind of authoritative online persona that you describe, but having a librarian’s education and experience in curating information doesn’t hurt.

  5. @Tony Hirst
    1.) Excuse me if I am off the mark with what you are talking about, but in regards reputation what do you make of the way Rapleaf.com deciphers the nodes between Social Network accounts (aka people).

    2.) How exactly would you boost a particular blog domain?

    3.) What is the “event-tag CSE” is that the “promotion” tag?

    @Bill – the CSE is a job search engine tailored for 16-19 year olds, who otherwise spend all day on ebay looking at sneakers and cell phones! It gets good use. Young people are (in this case) awful searchers…

  6. @JC –

    1) Not seen rapleaf – will look at it;

    2) Boosting a blog domain – in a CSE, using score attribute in CSE annotation:

    3) “event-tag CSE” – I’ve been experimenting with discovering groups of people who may be interested in a topic by virtue of using particular hashtags on Twitter, and in particular hashtags minted for particular events. See for example: http://blog.ouseful.info/2010/09/08/deriving-a-persistent-edtech-context-from-the-altc2010-twitter-backchannel/ where I explore some of the ways we might generate ongoing benefit or value from the community so detected, and in particular a CSE that searches over the websites (typically blogs) linked to from the Twitter profile pages of the people who used the hashtag. By looking at social network statistics, we can generate CSE score values that reflect the social Twitter relationships between the people using the hashtag; this may or may not be useful/appropriate/meaningful/valuable/sensible…;-)

    (I guess the more general “event tag” community might refer to people identified through their use of a tag irrespective of where they use it; so whereas I’ve been looking at hashtags on twitter in particular, the same tag, particularly around an event, may also turn up on slideshare, delicious, flickr, youtube, in blog categories etc etc

  7. @bill re: “I don’t think someone has to be a librarian to develop the kind of authoritative online persona that you describe, but having a librarian’s education and experience in curating information doesn’t hurt.” One of the ideas I’ve been exploring for a couple of years is the future of the academic library. I agree that you don’t have to be a librarian to develop an authoritative persona, I’m just trying to explore the ways in which they can contribute to more effective discovery on the web, in a age where discovery doesn’t happen in the library, but elsewhere…

  8. Hi Tony,

    The future of the library and librarians does seem to be tied to the Web.

    What do you think librarys will look like 10 or 20 years in the future?

  9. Hey Bill, interesting few posts in this series. I may be way off topic here but do you think this is in anyway an attempt by Google to try nullify the impact of search engines such as Blekko (although this hasn’t been wildly successful so far) and potentially something in the future by Facebook, whereby results would be determined by user reviews and preference. By integrating to an extent ‘expert search results’ it gives what may seem a complex mathematical algorithm a human element – or is this something that Google aren’t even concerning themself with at the moment?

  10. Hi Jonathan,

    I’ve been wondering all along if Ramanathan V. Guha envisioned someday being able to collect aggregated information from multiple context files developed from custom search engines to use to create query refinements and boost some search results. I don’t think it’s an effort to imitate or emulate or build the kind of interactive search results that you see with Blekko.

    Instead, I sort of see it as a counterbalance to the “wisdom of the crowds” approach that Google seems to sometimes take when looking at massive amounts of user-behavior data.

    If 20 million people who are interested in film making, but none of them knows much about cameras, cinematography, editing, transitions, etc., then looking at things for those 20 million searchers such as search results selections, query refinements, dwell times on pages, bookmarking, browsing, etc., to influence search results may not be as helpful as looking at the labels and annotations that 100 experts on the topic might include in custom search engines. Or. looking at data from those “crowds” and from experts together may yield some interesting data points that are worth considering, especially if they seem to agree in some ways.

  11. Interesting, although I guess seeing how quickly they rushed out a patch in response to the recent article in the New York Times shows the “wisdom of the crowds” approach still has quite a long way to go to be perfected!

  12. Hi Jonathan,

    One of the problems with a citation-based ranking system like PageRank is that it can be hard to tell if the citations are positive or negative. PageRank is a wisdom of the crowds type approach in that Google is relying upon using links as a signal for quality, when it’s actually an approach that assumes that links equal popularity. It’s just as likely that they may be casting “negative” votes for a site or business.

    That’s long been a flaw of PageRank, and I’m surprised that Google says that a couple of researchers were able to come up with a solution in a couple of days to resolve the problem – I think this is a problem that won’t go away until Google stops using PageRank as a ranking signal.

Comments are closed.