How Does a Search Engine Know the Language of A Query? Google Explores Character Mapping

Choosing the right character set for your web page might mean that it is easier for a search engine to understand what language your page is in, though there are also other ways that it might be able to determine that.

But, what about when someone types in a query?

– How does a search engine know what language a search query might be in?

– How does it handle queries in different languages made on devices that might not be capable of creating some special characters outside of the latin alphabet?

Also, do webpages that use a certain charater set (something that webmasters can choose in their HTML for a page) stand a better chance of having the language that they use be identified more easily by a search engine?

Google Patent Applications on Languages and Queries

Google published four patent applications recently that delve into these areas, covering the “handling of language uncertainty in processing search queries and searches over the web, where the queries and documents can be expressed in any one of a number of different languages.”

A search engine is called upon to index and search documents written in a wide variety of languages, and a number of documents that are expressed in multiple languages.

Keyboards without Non-Latin Characters

Another challenge is that some devices that are used to create content and display web pages can have difficulties in producing some of the characters used in different languages.

People searching on a handheld, or on a keyboard, may use characters that are close substitutes for the ones that they would actually want to use, such as an unaccented character.

A search engine could process content that it has indexed, to remove accents and convert special characters into a standard set of characters, but this would result in losing information from the search index, and making it impossible to retrieve content when a searcher does use their natural language in a query, when their search does use non-latin characters.

The Query Language Patent Filings

The patent applications were published on December 13, 2007, and were originally filed on April 19, 2006.

Search Engine Learning the Language of a Document

Under the approach in these patents, a training model is created to use to identify the language used in documents to be searched. The training model focuses upon a specific body of documents when training, and those can be a mix of different types of documents, such as:

  • HTML
  • PDF
  • Text documents,
  • Word processing documents,
  • Usenet articles, or;
  • Any other kinds of documents having text content, including metadata content.

These documents should ideally represent what might be found on the Web, and might be the Web itself, or a snapshot or extract from the Web.

That body of documents should include all languages represented on the Web, with enough documents from each language, so that they might contain a significant enough portion of the words found within all documents of the language on the Web.

The Role of Character Encoding

A system like this might work best if each of the training documents and each document to be searched would be encoded in a known and consistent character encoding, such as 8-bit Uniform Transformation Format (UTF-8). Of course, that isn’t what you’ll find on the Web, where you will see many pages not even including a character set defined, or another character set completely. Here’s what the code what look like in the HTML for a page using UTF-8:

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″>

If a page doesn’t use UTF-8, and this language determination process does, then documents using some other encoding might be converted into UTF-8. That conversion might result in some funny looking characters ending up in results.

Language Detection on Pages, Using Probabilities

The document language detection process uses statistical learning theories, and classification models.

The most likely class or classes for a page of text may be based on the text from the page, and possibly by looking at the URL of the page.

This could be done by breaking the text down into words, and computing the probabilities of those words appearing upon the page together in different languages, to predict the most likely language for that text.

So, on a page where the word “Hello” occurs frequently, and in the training model, it appears most frequently on English and then German pages, there’s a probability that the page may be in English, and then in German.

Looking at certain characters can be helpful, too. If certain characters don’t appear very frequently, if at all, in some languages, then pages which have words in them with those characters might be less likely to be in those languages.

The Use of Character Mapping

One of the keys to this process is the creation of character maps that may be more unique to one language than to others. A common form of a word in a specific language may contain accented characters, for instance.

The patent applications go into a great deal of detail on how these character mappings can be used in a few different ways.

One is to help identify languages for some queries.

Another is to identify when certain queries might be simplified versions of a word, when a searcher can’t use certain characters because the device that they are using, such as a smart phone, is incapable of using those characters. There are a number of examples of how this might work given in the patent applications.


If you work with websites written in non-latin characters, you may find these patent applications worth digging into in much more depth.

Another patent application, mentioned in these patent filing but unpublished at this point, Query Language Identification looks like it might go into even more depth on the topic.

Some of the languages, and conversion maps created for those languages discussed in the patent filings include:

Catalan, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, French, German, Greek, Hungarian, Icelandic, Italian, Latvian, Lithuanian, Macedonian, Polish, Portuguese, Romanian, Russian, Serbian, Slovakian, Slovenian, Spanish, Swedish or Finnish, Turkish, and Ukrainian.

Other Resources

I looked for a number of documents that explore queries in other languages, and came up with the following:


Author: Bill Slawski

Share This Post On