Google’s Query Language

There’s a “Greenway park” near me, built where a train had previously roamed for over 100 years. The park is narrow and not much wider than the width of railroad cars. It cuts a nice path for local residents to use to walk across town, and it’s a relaxing trail to parks and schools and to walk a dog. Which is good since that rail mode of transportation has been replaced mostly by the automobile.

Sometimes the signs we see have meaning, and sometimes there's no train to be found nearby at all.
Sometimes the signs we see have meaning, and sometimes there’s no train to be found nearby at all.

Former Google search engineer Andrew Hogue, now the head of search at Foursquare, was in charge of a project at Google called the Annotation Framework project.

He put together a team in the mid-2000s who pursued patents on a range of related topics involving something called a browseable fact repository, which would later grow into Google’s Knowledge Graph.

Specialized Queries for Google’s Semantic Web?

One of those Annotation Framework patent filings describes different ways that searchers might search at Google to uncover facts that might be contained in that browseable fact repository. It is about a query language that, like the train in my town, may have seen its time slip by.

As far as I know, there really hasn’t been much of a discussion on the web about this “query language” patent, or the idea of specialized queries solely for Google’s Knowledge graph.

Google seems to be moving towards a united search interface that is filled more with knowledge panels and carousels and delivering more direct answers to searchers, and is less about a query language for their fact repository or knowledge graph.

It’s possible that this query language may have been too complicated for most searchers to use. Google introduced Structured Snippets in September that give us a look at facts in our search query results, but doesn’t have most of us trying to figure out a query language.

Many of us could learn a query language like SPARQL to search through something such as Google’s knowledge graph, but it would probably be lost on many people who just don’t have the time to take to learn something new. A query language built to make it easier to find answers through a knowledge graph makes sense. But Google’s fact repository is like the local train – it’s likely been replaced with something else.

A screen from a video on the Google Knowledge Graph page.
Google’s knowledge graph shares the world’s data with us.

I decided to share some of the query language, since it shows off a Google that most of us have never seen, and likely never will. You might take it as a piece of history that never happened, or a business lesson about choosing products carefully.

Ease of finding information at Google

This query language patent tells us about how easy it is to find web-based information:

Many search engines exist to search the World Wide Web. The Google search engine, for example, employs a user-friendly syntax that lets users simply type in a search query for items of interest (e.g., typing “Britney Spears” to find out information about the singer Britney Spears).

The Google search engine also allows users to construct more complex search queries. For example, advanced Google search allows users to search for web pages by specifying that the web page:

  1. Must contain an exact phrase (by placing the query terms in quotes);
  2. Must contain one or more of the query terms, or
  3. Must not contain one or more of the query terms.

This advanced search capability allows a user to tailor his search for web pages that contain specific information. Google search permits search of web pages, which are an example of unstructured data.

A couple of years ago, the paper Enhanced Results for Web Search (pdf) was published by Kevin Haas of Microsoft, Peter Mika and Roi Blanco of Yahoo!, and Paul Targan of Facebook. It describes the tremendous impact that enriched search results have given to search engines that display them, and that may also be a compelling reason to merge Knowledge Web results into search over web page results.

Not only are the results richer and more interesting than 10 blue links pointing to web pages, but they provide a much better user experience. If you haven’t, take the time to read the “Enhanced Results” paper. I think it’s a clear roadmap for the direction that today’s search engines are headed towards.

The patent, which I think was a push in the opposite direction, is:

Query language
Invented by Andrew W. Hogue, and Douglas L. T. Rohde
US Patent Application 20070198480
Published August 23, 2007
Filed: February 17, 2006

Abstract

A fact repository supports searches of facts relevant to search queries comprising keywords and phrases. A service engine retrieves the objects that are associated with facts relevant to a query. The query language described is designed for use with such a repository of facts and searches both the attributes of facts and the values of the attributes.

Searching a Fact Repository

The fact repository was intended to be searchable. it provides access to documents, and to facts and a mix of text and graphics and multimedia, and content presented in HTML markup code, and languages such as javascript.

This kind of document can be located by a Uniform Resource Locator (URL), or a Web address.

The fact repository has a number of moving parts, such as one or more importers, one or more janitors (to analyze and massage and shape data), a build engine, a service engine, and a fact repository.

The patent provides more details on how a fact repository might work and how importers and janitors would play a role in its operation. If you want more details, the patent has some.

The Index in a Fact Repository

The fact repository contains objects and collects facts about them. Each object contains unique IDs, each fact about an object contains a Fact ID, an attribute, and a value.

Notice that both Objects and facts have unique IDs.
Notice that both Objects and facts have unique IDs.

The index maintains a term index, which maps terms to {object, fact, field, token) tuples, where “field” is, e.g., attribute or value.

The service engine is adapted to receive keyword queries from clients such as object requestors, and communicates with the index to retrieve the facts that are relevant to user’s search query.

For a generic query containing one or more terms, the service engine assumes the scope is at the object level. Thus, any object with one or more of the query terms somewhere (not necessarily on the same fact) will match the query for purposes of being ranked in the search results.

The ranking (score) of an object can be a linear combination of relevance scores for each of the facts. We don’t know for certain if this patents description actually fits how facts and objects are actually ranked, but it’s interesting seeing one way of ranking them. There are definitely other approaches out there, which may be in use even today.

The relevance score for each fact is based on whether the fact includes one or more query terms (a hit) in one of the attribute, value, or source portion of the fact.

Each hit is scored based on the frequency of the term that is hit, with more common terms getting lower scores, and rarer terms getting higher scores (e.g., using a TF-IDF based term weighting model).

The fact score is then adjusted based on additional factors, such as:

  1. Consecutive query terms in a fact
  2. Consecutive query terms in a fact in the order in which they appear in the query
  3. An exact match for the entire query
  4. The query terms in the name fact (or other designated fact, e.g., property or category), and
  5. The percentage of facts of the object containing at least one query term

Each fact’s score is also adjusted by its associated confidence measure and by its importance measure. Since each fact is independently scored, the facts most relevant and important to any individual query can be determined, and selected. In one embodiment, a selected number (e.g., 5) of the top scoring facts is selected for display in response to a query.

The purpose behind the query language is to help people surface information about the many entities and facts about them within the fact repository

Details of Query Language

Queries to the fact repository generally return objects. The search engine decides which objects to return based upon which facts match a query.

I’ve never seen “Google Reference Pages,” as an actual Google product, but here’s a screenshot from the patent that shows it:

google-reference-pages

The service engine is also adapted to handle structured queries, using query operators that restrict the scope of a term match.

It looks like Google, but the snippets look more like Wikipedia.
It looks like Google, but the snippets look more like Wikipedia.

Examples

You can see from the patent screen shot above that this isn’t the Google you use everyday. The queries you would use on it would be different too.

Google does have special search operators that return certain types of results today. You can see those on their Punctuation, symbols & operators in search page. The Query Language patent points out some of those aimed at a fact repository. For example:

” “: Double quotes that surround a sequence of query terms require that the terms match in that order in a single field – This is called a phrase match.

^: If a caret immediately precedes a word, it may only match the first word of a field. If the caret immediately follows a word, it may match only the last word of a field.

Quotes and carets can be combined to produce an exact field match, for example “^George W. Bush^”. In one embodiment, carets may only occur within quotes. In other embodiments, carets can applied to any term.

[ ]: Square brackets restrict the enclosed expression to appear in the same fact.

{ } Curly brackets: restrict the enclosed expression to match a single field.

This can be further restricted to a field of a specific type, such as attribute{ . . . } or value{ . . . }.

[X:Y]: Shortcut for [attribute{X} value{Y}].

Matches an attribute/value pair of a fact with the specified values.

birth-fact-repository

This query will return all objects whose facts contain the specified query term “Birth”.

It is important to note that search queries performed by a service engine in accordance with the present invention look at both a fact’s attribute (also called an attribute name) and the fact’s value (also called an attribute value) to determine if the fact is relevant to the query.

Example

Other embodiments may default to also searching for query terms within a fact’s links, metrics, sources, or agents and so on.

Still other embodiments may implement a query syntax that allows a user to explicitly search within various fields of a fact (such as links, metrics, sources, or agents and so on).

Example

The search query “birth” matches a fact contained in a link field of “www.birth.com” but would not match “www.birthday.com” in the same field.

Set up a slightly different way, a query of “birth” would match “www.birthday.com” since “birth” is contained in “birthday.”.

birth-august-fact-repository

A logical “AND” operator is implicitly assumed if no logical operator is specified for query terms.

That is, an object must have associated facts matching both terms in order to be returned as a result of the query.

This query will return all objects that have the term “Birth” and the term “August” in one or more of their facts.

It is important to note that search queries performed in accordance with the present invention look at both a fact’s attribute (also called attribute name) and the fact’s value (also called attribute value) to determine if the fact is relevant to the query.

Example

john-is-a

The ampersand (&) is an explicit logical operator that indicates that all search query terms must be present (although not necessarily in the same fact or in any particular field of the facts) for an object to match.

This search query will return all objects with facts that contain both the term John” and the term “is-a”. Here, the term “John” is in fact #1 and the term “is-a” is an attribute of fact #2, so object #1 would be returned since it is associated with facts containing both search query terms.

Thus, even though the original source documents on document hosts that were used to create the facts of object #1 may not have contained the word “is-a,” object #1 will be returned by the search query since at some point a fact with an attribute of is-a was added to the object.

Example

A janitor whose function is categorizing objects might have created multiple new “is-a” facts having an attribute of “is-a”

Thus, for example, a janitor may exist that searches the fact repository and categorizes objects, an creating new facts with an “is-a” attribute having a value of “person” “cat,””dog” and so on for each categorized object.

It will be possible for a user to enter a search query to locate all objects that have been categorized by the janitor (by searching for the attribute “is-a”). It would also be possible for a user to enter a search query to locate all objects that have been categorized as persons (by searching for the attribute “is-a” and the value “person” as an attribute/value pair within a single fact, as discussed below).

john-human-being

The vertical bar (|) is an explicit logical operator that indicates that only one query term much be present to match, although both may be present and still match.

This search query will return all objects containing either the term “John” or the phrase “human being.”

Here, the term “John” is in fact #1. Even though the phrase “human being” is not found, object #1 would be returned since it is associated with fact #1, and therefore satisfies the Boolean disjunction. FIG. 5(e) shows the following search query that is entered into search query field:

john-person

This search query will return all objects containing either the term “John” or the term “person”.

Here, the term “John” is in fact #1 and the term “person” is an attribute of fact #2, so object #1 would be returned. Other embodiments may allow a user to perform an exclusive OR’d search (i.e., only one fact, not more or fewer must match).

birth-august

A search query using square brackets ([ ]) will return all objects where both query terms are in the same fact.

Here, this search query will return nothing since no object in the example has a fact containing both “Birth” and “August.”

Take Aways

The patent contains a number of additional punctuation terms and specialize formats you can use in a query. It’s possible that they would make an “advanced” search page, like Google did for their Advanced patent search, which inserts specialized query terms. For example, if I want to search for patents assigned to a specific company such as Google, I use the advanced patent search page and it types in a specialized query language for me. Google would include within that query the phrase, “inassignee:google”.

But instead of a query language for their Knowledge Web, it seems that Google is happier providing results in knowledge panels and carousels and other specialised formats, triggered by queries that might result in a scrolling list of episodes at the top of a search result screen when you type in as a query something such as [“popular new TV show name” episodes].

People may refer to these as “direct answers”, but they are also simple answers that it doesn’t take much magic for a searcher to find out how to use, and for a search engine to deliver upon.

Instead of a query language for searchers to try to figure out, Google may see if it can figure out when people want some kind of enhanced results as a result of a query. That seems to be the message behind the “Enhanced Results” paper I mentioned above.

9 thoughts on “Google’s Query Language”

  1. I have a long ways to go until I’m capable of responding regarding what you wrote here… So I’ll just remark that I love that you are promoting your Kickstarter campaign! Yay, Bill!

  2. Thanks, Kristin

    We all have to start somewhere, and if you keep on working to learn, you will. Thanks, regarding the kickstarter campaign. I probably need to do a blog post to help with it more. 🙂

  3. Hi Miraj

    Thank you. I was pretty excited to read about Google’s efforts to try to put together a query language for semantic web type information. I think in the end it might have been a little too complex for many search engine users, and what we are seeing now, with knowledge panels and carousels and other specialized search results is Google’s solution to that problem. It will be interesting seeing how they focus their efforts to make both web-indexing results and data-indexing results available to searchers.

  4. Hi Bill,
    I don’t know what to write in the comment section after reading an article like this! Very very informative and in-depth content. I have to read again 🙂 For now, I just want to thank for sharing the cool search operators.

    Best Regards
    Miraj Gazi

  5. Yes Bill! It will be Really interesting seeing how Google focus their efforts to make both web-indexing results and data-indexing results available to searchers. I hope you will keep us updated ! 🙂

    Best Regards

    Miraj

  6. Hi Bill,

    I found this blog today and it is very unsual. No blog explains technical details and explores the indepth concepts behind web-search.

    I am reading this blog again and again to understand google’s query language, some thing I never heard about before. I am not only bookmarking it but also recommending it to lots of my friends, who will be shocked to see, a gem of a blog in the blogosphere.

  7. RDFs wasn’t good enough for Google, so not surprised they’re ignoring SPARQL. Why not at least just build an abstraction language over SPARQL to query triples?

  8. Hi Matt,

    Thanks. I don’t mean to mislead anyone – this is an older patent, but it came in a flurry of semantic related patents from Google that happened when Andrew Hogue hired a number of people to provide an “annotation framework” to Google, and before he led to the acquisition of Meta-web. SPARQL isn’t a query language that most searchers are going to pick up and start using quickly, but this patented query language from Google still might have been a little too complex for the average searcher. I think we are seeing more semantic answers in Google’s search results without a query language like this one in place. I’d love to see more features added to an advanced search from Google that focuses more on data results.

Comments are closed.