IBM tackles multilingual web searching
I’ve been enjoying visiting a number of sites that are written in languages other than English, such as Google.Dirson.com and Référencement, Design et Cie, and others. I often rely on some of the translation services available online to read those sites, but I have trouble when searching the web in finding some information that isn’t written in English.
It would be nice to have a way to search non-English sites without having to try to translate queries into other languages first.
IBM has a patent filing, published as a patent application last week, which tries to help people find sites in other languages that are relevant to their searches, and might be authority sites on those subjects.
One of the fastest growing groups of users on the web don’t speak English, and while they may be searching for information in their native languages, they may also want to see results consisting of documents in other languages. The method described in this patent application involves helping us overcome that language barrier without having to resort to a translation service to form queries.
Inventors: Ling Zhang
Assignee Name and Adress: International Business Machines Corporation
US Patent Application 20060059132
Published March 16, 2006
Filed: July 29, 2005
The present invention provides methods, apparatus and systems for searching hypertext based multilingual Web information when searching on a network for keywords to be queried. A method includes: a receiving step for receiving keywords input by a user; a native language hypertext searching step for searching on the network, according to the keywords to be queried, for all hypertexts whose representing language is the same as a language representing the keywords and which matches the keywords to be queried; extracting hyperlinks related to an arbitrary language from all the searched hypertexts; a hyperlink ranking step for ranking the extracted hyperlinks according to the correlativity of the hyperlinks with the keywords to be queried; and returning to the user ranked search result. Thereby, an accurate cross language searching can be provided without extra machine translation effort, being more accurate and objective than machine translation, even than human translation.
This patent uses an approach involving anchor text and hyperlinks to solve problems with language translation, and help people find authority pages in more than one language based upon a query in his or her own language.
Here’s one example offered by the patent application on how this could work:
supposing a Chinese Internet user tries to locate the homepage of “Reader’s Digest” magazine, he/she will input “(Reader’s Digest)” (keyword) expressed in Chinese, since many Chinese Web pages include hyperlinks to the Web site of the magazine of “Reader’s Digest” and most of the hypertexts corresponding to the hyperlinks include “Reader’s Digest” expressed in Chinese ( (Reader’s Digest)), by matching the hypertexts with the keyword and analyzing the hyperlink distribution, the URL www (followed by) rd.com of the magazine of “Reader’s Digest” can be retrieved.
This seems fairly simple, and the process of how this could be implemented is spelled out in much more detail in the document. Amongst other implications it may hold, it describes a good reason not to use “click here” as anchor text on your site, and to choose your anchor text carefully.