The following patent application was published by the US Patent and Trademark Office earlier today:
Systems and methods to process and correct spelling errors for non-Roman based words such as in Chinese, Japanese, and Korean languages using a rule-based classifier and a hidden Markov model are disclosed.
The method generally includes converting an input entry in a first language such as Chinese to at least one intermediate entry in an intermediate representation, such as pinyin, different from the first language, converting the intermediate entry to at least one possible alternative spelling or form of the input in the first language, and determining that the input entry is either a correct or questionable input entry when a match between the input entry and all possible alternative spellings to the input entry is or is not located, respectively.
The questionable input entry may be classified using, for example, a transformation rule based classifier based on transformation rules generated by a transformation rules generator.
It’s at least the second patent application from Google this year involving handling searches in Chinese. Another was published on September 22nd:
Systems and methods to process and translate pinyin to Chinese characters and words are disclosed.
A Chinese language model is trained by extracting unknown character strings from Chinese inputs, e.g., documents and/or user inputs/queries, determining valid words from the unknown character strings, and generating a transition matrix based on the Chinese inputs for predicting a word string given the context.
A method for translating a pinyin input generally includes generating a set of Chinese character strings from the pinyin input using a Chinese dictionary including words derived from the Chinese inputs and a language model trained based on the Chinese inputs, each character string having a weight indicating the likelihood that the character string corresponds to the pinyin input.
An ambiguous user input may be classified as non-pinyin or pinyin by identifying an ambiguous pinyin/non-pinyin ASCII word in the user input and analyzing the context to classify the user input.