Over at Threadwatch, Graywolf started a thread titled Are you Optimizing for Google Definitions? There are some insightful comments in the thread, and I recalled a Google patent application that covered the topic.
I looked around the web to see if there had been any discussion about the patent application, but couldn’t find any. The document is System and method for providing definitions, (US Patent Application 20040236739) invented by Craig Nevill-Manning, filed on June 27, 2003 and published on November 25, 2004.
The abstract for the application is pretty general, but the document is fairly detailed. Here’s the abstract:
A system and method for providing definitions is described. A phrase to be defined is received. One or more documents, which each contain at least one definition, are determined. The phrase is matched to at least one of the definitions. One or more definitions for the phrase are presented.
An example of Google Definitions in action
Before looking at the words of the patent application, let’s take a look at the results of a definition search on Google.
Below is a screenshot of search results for a Google search for define:indeterminacy. The search is showing five results, and also includes clickable suggested related search queries for “indeterminacy principle” and “quantum indeterminacy.”
Looking at sources of definitions
A quick look at the pages linked to from the definitions of “indeterminacy” tells us a little about pages where definitions come from.
The first result listed is from Critical Vocabulary, which has more than 40 definitions, and the words being defined stand out from the definitions by being encased in font, bold, and span elements like this:
<p><font face=”Century Gothic”><b><span style=”color: #495987″>Indeterminacy:
The second result listed was from Factor Analysis Glossary (no longer available), and it included over 50 definitions. Words defined were differentiated from definitions by being bolded, and having a semicolon separate them from their definitions in this manner.
Result number three, Classical Guitar Dictionary I, includes over 100 definitions. Dictionary terms and definitions are different colors and the defined terms are bolded as follows:
<b><span style=”font-size: 8.0pt; font-family: Verdana; color: #FFFF00″>Indeterminacy</span></b>
The next result doesn’t come from a page listing a number of definitions, but could be said to be from an authority page: WordNet Search – 3.0. The WordNet Search pages are popular sources of definitions amongst computer scientists and people who work in information retrieval.
The last result listed is from another authority site. It’s the indeterminacy entry in Wikipedia.
Where definitions come from
There are a couple of different potential sources for definitions, according to the patent application.
- They can be found during web-crawling or spidering by the search engines. If it is determined that a page contains definitions, either the document or information about it may be indexed by the search engine, and stored.
- “Authoritative” sources for definitions could also be used, such as our WordNet Search or Wikipedia pages above.
- Alternatively, pages with definitions could be searched by querying the search engine in real time, instead of ahead of time.
- Or, a mix of both previously collected documents, and new ones identified during real time processing could be used to locate definitions, remove duplicates and clean up definitions in response to a query.
Determining that a document has definitions
Exactly what might be looked for to decide that a page is a good candidate for definitions? These are some possibilities:
- Terms on the page such as “glossary,” “definition,” “dictionary,” and other similar words including variants and canonicalizations of those.
- The search could look at the text of the whole document, or just in certain areas such as title field, meta data, or other places within the document.
- Use of HTML within a document may also be important and meaningful.
- One version of this method would look for terms like “glossary,” “definitions,” or “dictionary” in the titles of Web pages.
Possible methods used to parse apart documents, identify headwords, and/or return definitions:
The application tells us that “definition containing documents” may be organized with “headwords.” A headword is a word or phrase that can be identified in some manner as separate from the definition for that word. In our examples above, HTML and punctuation is being used to distinguish words and their definitions.
Here are some examples from the patent application:
1. Use of html definition tags:
<dt>Headword 1 <dd>Definition of Headword 1
<dt>Headword 2 <dd>Definition of Headword 2
<dt>Headword 3 <dd>Definition of Headword 3</dl>
2. HTML separators between adjacent definitions:
There needs to be a way for the search engine to distinguish between different definitions, and it will look for HTML such as <p>, <tr>, <li>, and <br> to figure out that there is more than one definition.
3. Punctuation needs to be removed from the definitions, like our semicolon above from the “Factor Analysis Glossary” page.
4. Headwords need to be identified.
HTML such as <b>, <strong>, <em>, <code>, or <span> may be helpful in identifying those headwords. Our first three examples above use a number of those elements to separate headwords from definitions.
5. The number of definitions on a page may be looked at
The patent application notes that if there are less than a certain number of definitions on a page, all of the definitions from that page may be removed from consideration as results to a definition query.
6. Precision versus Recall
Since there are potentially a lot of documents on the web where definitions can be taken from, the focus of gathering definitions will be upon getting a few good definitions rather than a larger number which might include duplicates or entries which cummulate other definitions.
1. Order of results
One version of the process described in this application would return results in an order based upon the PageRank of the documents where the definitions came from.
2. Processing results
One or more of the following steps might be taken when presenting definitions:
- HTML markup
- leading and trailing white space in the headword and definition
- punctuation: (. : ; ! ? -) in the headword
- leading non-alpha and non-parenthesis in the headword and definition
- trailing non-alphanumeric and non-parenthesis in the headword.
A definition could be thrown away if it:
- starts with “see”
- is a duplicate of one already retrieved.
The first letter of the definition will also be capitalized.
Presenting additional information
While it is possible that only exact matches for the definition query would be returned, it is also possible that more will be retrieved by the search engine.
Where the word or words are part of a larger phrase, like the “indeterminacy principle” and “quantum indeterminacy” shown above in the screenshot. Those may be returned with definitions, or possibly as a link.
2. No definitions found
If there are no definitions found, other words may be presented to the searcher. These can include related terms, other ones that might be of interest, or even random results.
Google will often show some information from Google News, or Froogle, or Google images, instead of advertisements above organic results for some searches.
One of the many good points raised at threadwatch was that a query entered into Google structured like “what is indeterminacy” often returns a definition above the organic results too.
It’s important to keep in mind that the process described in this patent application may not be what Google is actually using, but it may provide some insight into how the search engine might be returning queries where people ask to have something defined, and how pages need to be written to be considered as sources of definitions.