Penalizing Pages in Search Results Based upon Language (Except English)

Knowing something about the language used in a query might help a search engine decide which pages to show a searcher. A search engine wants to lead its users to pages they can read. A recent Microsoft patent application explores how language types can be used in ranking pages in search results.

Language types can be seen as a measure of relevance because they can help find pages relevant for a search. They are considered a “query-dependent” measure of relevance, because while the language type for a page can be identified before anyone performs a search that might include the page, the language used in the query influences which results are shown.

Query-independent measures, or attributes, are different. I wrote previously about a couple of other Microsoft patent applications which this one notes are related, in a post titled Ranking Search Results by File Type and by Click Distance.

Those two measures are considered “query independent,” because whatever words used in the query that might return those pages is irrelevant to the ranking method.

If an html document is preferred to a pdf file, it doesn’t matter what the search term was. If a page is one click away from a home page, as opposed to five clicks away, the search phrase used to find it could be anything and that wouldn’t affect the ranking factor bestowed upon the page based upon click distance.

Query dependent and query independent attributes can be combined to determine search result orders. Something like pagerank is query independent. Add to it an analysis of the anchor text used to point to pages, a query dependent attribute, and you can see how the two might work together.

Another example of a query-dependent ranking function would be to count the number of times a search term appears in a document.

In addition to being about ranking pages by language types, this patent application is one of a number that look at different ranking factors, both query dependent, and query independent.

The Language Type Ranking Patent Application

Ranking search results using language types

Invented by Dmitriy Meyerzon and Hugo Zaragoza

Assigned to Microsoft
US Patent Application 20060294100
Published December 28, 2006
Filed: April 26, 2006

Abstract

Search results of a search query on a network are ranked according to an additional ranking function for the prior probability of relevance of a document based on document property. The ranking function can be adjusted based on a comparison of the language that a document is written in and the language that is associated with a search query. Both query-independent values and query-dependent values can be used to rank the document.

A Summary of the Process

This is a way of ranking search results according to language. It aims to penalize documents that don’t match the language of the query, in a way which is independent of other ranking features.

1. A document’s language is identified by a statistical analysis of the page’s character distribution, and by comparing it to trained language character distribution.

2. The reason why language is detected rather than taken from metadata (such as language tags in html) is because “language detection is a relatively straightforward procedure with high precision”, and because “metadata is often ambiguous or wrong, or missing.”

3. Language detection is normally performed during indexing (rather than crawling or serving search results).

4. A query’s language is taken either from browser request headers or from client application (language preferences set in a browser, for example).

5. The language from the query is compared with detected language from the document.

6. Matches of language happen when the document and query share a primary language (note that a German-Swiss query will typically be considered to match a German-German document), or if the document’s primary language is English.

7. So, pages written in languages a user can’t read are penalized, except for English documents (because of the assumption that most people that use the Internet can read English or understand different flavors of English).

8. The total ranking function is modified by this anguage type feature, which adjusts the ranking of documents based upon language matches between files and queries, which should improve the overall precision of the search engine.

9. User Feedback from previous queries can be used as relevance judgments to derive a weight of relevancy associated with each language type comparison.

10. That weight could be treated as a ranking function parameter, and the behavior of the performance measure on different values of the weight may be observed.

11. Once a language type comparison is performed on a page, the file type is incorporated into the score for the page.

12. The page’s score, incorporating the language type comparison, determines the page’s rank among the other pages within the search results.

13. Other document properties may affect the relevance of a document independent of the query (such as file type and size of the file).

14. Instead of individual language types, classes of language can be used, so that a document isn’t penalized when the document has a language that is in the class as the query language. So, if the language on a page is determined to be Dutch, the language type stored in the index for the page can be Dutch or possibly German, because it can be assumed that German readers can read Dutch.

Conclusion

I’m beginning to wonder if we will see a separate patent application in the future from Microsoft on every ranking factor that they decide to use to rank web pages. I don’t think that’s a bad idea, but it does potentially provide a lot of insight into how they are ranking web pages.

I don’t know how safe that assumption is that most folks who use the internet understand one of the different flavors of English.

Interesting that the process relies upon its own analysis of the language type of a page rather than relying upon a language attribute or meta tag.

Share

2 thoughts on “Penalizing Pages in Search Results Based upon Language (Except English)”

  1. I am not sure. Does it mean that English language queries won’t bring up other-language queries? What if I specifically want to find pages in my language, yet containing English definitions and terms? MSN, err, Live can’t do that now? I guess I am overly picky here. I’ll go use G for this ;)

  2. Reading back through my post, I probably should have started it with a couple of examples, Yuri.

    Keep in mind that Microsoft may not be doing this, and may never, but they thought it was important enough to file a patent for.

    Some examples:

    Searches for English phrases (ignore for now that you can set language preferences for the search engine or your browser).

    1. You enter an English phrase in the search box and hit enter.
    2. Any pages in the results that aren’t determined to be in English are penalized, or lowered in the search results.

    Searches for German phrases (ignore for now that you can set language preferences for the search engine or your browser).

    1. You enter a German phrase in the search box and hit enter.
    2. Any pages in the results that aren’t determined to be in German or English are penalized, or lowered in the search results.

    Notice that they aren’t penalizing, or pushing down, English results.

    Now there may be some results in languages different from the language used for the query, but they might still be very relevant for the phrase, so they possibly may still appear high in results. But they wouldn’t appear as high as they might otherwise if the process in this patent application was being used.

    Let’s ignore for a moment that English language results aren’t being pushed back like other language results.

    If you have your browser language preference set for “Français (France)” and you do a search for an English Language phrase, then some of the French results may be pushed down. For instance, a search for “Eiffel Tower” shows a number of English language results (and more French results). It’s possible that even though your browser setting indicates that you want French results, you still get some English results because the query phrase was in English.

    If there were some German pages that were very relevant for “Eiffel Tower” which should appear in the top ten, they might also be pushed down because the query was in English.

Comments are closed.