How a Search Engine Might Handle Nicknames and Given Names

One of the challenges that search engines face during searches can involve returning and expanding results that include given names and nicknames for people.

With a given name of William, I usually go by the name Bill, and people rarely refer to me as William (especially people who know me – with the sometimes exception of my mom). I will use William on official government documents, resumes, and in other places that seem to call for a formal use of my name. Searches on my name at one of the major search engines will return some results refering to me as Bill, and a lesser amount that refer to me as William. It would be nice if they included both, regardless of whether my search query used “Bill,” or “William.”

What can make searches for nicknames more challenging is that a nickname of Bill might refer to a given name of William, Wilheim, Wilfred, Guillaume, or Guillermo. Someone with the given name of William might also commonly use a nickname of Bill, Will, Willie, Billy, or others.

Can a search engine help a searcher find results for a person whom they only know the given name for, or whom they only know the nickname for?

A Yahoo patent application published this past week explains how the search engine might expand queries to include nicknames for a person when a given name is used in a query, or expand a search to include a given name when a nickname is used in a query.

Expanding queries for names like this is only useful when the results returned are actually relevant for the search. For example, someone looking for “Prince William,” of England probably doesn’t want to see results for “Prince Wilhelm” of Germany. Also, when someone searches for “Bill Clinton,” and the majority of relevant search results use “William Clinton” instead, the expanded query should treat results for “Bill Clinton” and “William Clinton” as equally relevant to each other so that the results returned aren’t “Bill Clinton” results followed by “William Clinton” related pages.

The patent application is:

Predictive Person Name Variants for Web Search
Inventors: Yumao LU, Fuchun Peng, Benoit Dumoulin
US Patent Application 20100312778
Published December 9, 2010
Filed: June 8, 2009

Abstract

Techniques for determining when and which name variant candidates to use to re-write a search query that includes a person’s name in order to provide the most relevant search results are provided. A determination is made whether a person name is present in a search query request entered by a user. Name variant candidates are generated for each person name. Then, the name variant candidates are ranked for each person name based upon one or more models that calculate a probability value for each name variant candidate. Based upon these rankings, the query may be re-written to include the original person name and a specified number of top ranked name variant candidates to present the user with the most relevant search results.

When a search is performed, and name variants are considered in a query expansion, the search engine shouldn’t just look in a dictionary to find all possible variants of a particular name. If the nickname “Bill” is part of the query, the search engine should not just expand the query to include all given names that might correspond to the nickname Bill, such as “William,” “Wilfred,” “Guillaume,” “Guillermo,” or “Wilhelm.”

Instead, probabilities should be used to try to find the most likely name variant to use.

Indentifying Names in Queries

The patent describes some models that could be used to determine that a person’s name is included in a query, including a Conditional Random Field model, a Hidden Markov Model, and a Support Vector Machine (SVM) model. It then describes some approaches to use to determine which variants to use in an expansion of the query

Conditional Random Field (“CRF”) model

A Conditional Random Field (“CRF”) model may be used to label sequential data. A CRF engine may look at 250,000 previously submitted search queries, and tag each term of each query with a label that states whether the term is a persons name. For example, with a search query of “bill clinton president,” the first term might be labeled as the beginning of a person’s name. The second might be labeled as the end of a person’s name, and the last might be labeled as not containing a name. Through machine learning, patterns identified in these previous queries might be used to identify beginning and ending names in new queries that aren’t labeled.

Hidden Markov Model (HMM)

A Hidden Markov Model (HMM) could be used to determine the presence of person names within a search query. This is a statistical model that can be used to find the part-of-speech of a given word. A word starting a query such as “the” might indicate that the next word in the sequence may be a noun 40% of the time, an adjective 40% of the time, and a number 20% of the time. Statistics like this could help determine what part of speech a word in a sequence might be. A model like this might be used to try to find the presence of persons’ names.

Support Vector Machine (SVM) model

Another machine learning system used for classification, a set of data might be analyzed in which words in the body of data has been assigned one of two classes – a name or not a name. This training set might be used to identify names in new data that hasn’t been classified.

Locating Variants of Names used in Queries

Once a person’s name has been identified in a search query, all possible name variants might be identified through one of two dictionaries.

1) a nickname to formal name dictionary, and
2) a formal name to nickname dictionary.

Information about names in these dictionaries might be culled from exisiting dictionaries such as a social security registry of names, or previous search queries used, or names found on the Web. Uncommon names might also be added such as unusual spellings of names, place names used as peoples’ names, and names used in places like popular culture periodicals.

Determining the Highest Ranked Name Variants

There are a number of different algorithms that could be used to decide which variants of a name to use in a query expansion. The patent application describes a few options. including White Page Frequency, Statisitical Translation Models, and Session based Query Analysis.

White Page Frequency

Take a known list of names, like the names from the Social Security Administration, to find the popularity of names of people from the United States for a given year. Find out the popularity of use of those names, and the counts or popularity of name variant candidates within that list.

Statistical Translation Model

Take a body of information, such as all the words that appear on the Web, or a set of previous web searches, or words used in a collection of books, and break them down into sequences of phrases. For instance, in the index of web pages, break each page into four word phrases, like you might for this sentence:

- Take a body of
- a body of information
- body of information, such
- of information, such as
- information, such as all
- such as all the
- as all the words

Understanding probabilities about the number of times that a given sequence of words might appear in the body of data being explored can tell us soemthing about the use of words in that body of data. Probability values might be created involving name variants from this information:

By determining the number of times a name variant candidate appears within the corpus and within the context of the other terms in the search query, a probability value may be determined for each name variant candidate and rankings determined from those probability values.

Session Based Query Analysis

Sets of previous queries that appear to have been used in the same search sessions can tell us something about the relationship between name variants. For example, someone searching for “Babe Ruth” might also search for “George Henry Ruth,” “Bambino,” and other related terms including nicknames and given names in the same search session. If those related queries are seen in a large number of other search sessions, that might boost the confidence that the names used are variants of each other.

Conclusion

A search for “Bill Clinton” on a search engine will return much more rewarding search results if they include results for “William Clinton” and even “Bubba Clinton.” This patent describes a few different algorithmic approaches that could be used to identify whether a query includes a person’s name, a way of using dictionaries to expand those name, and a number of algorithms to limit which name variants should be used to expand a query that includes a name.

The process described in this patent may help when you’re writing content for a page on a website that involves an individual, and that person often goes by a nickname. Rather than just relying upon a system like this, it’s probably not a bad idea to refer to the person by both their given name and the nickname, especially if there’s the possibility that someone searching for the page might use either variation.

Share

18 thoughts on “How a Search Engine Might Handle Nicknames and Given Names”

  1. It’s amazing all the details that go into creating a search engine. My husband has created some small ones for car databases and he said he had a new found respect for all the little details that google and other companies like it take to turn out good results.

  2. Hi Bill/William…

    Interesting article, which has got me thinking about the power of a brand name. In the UK we have a major chain store called Marks and Spencer – however, many people now call it M&S for short. As a result, the company have evolved the brand massively over the past few years, focussing on its shortened brand name. SEO has obviously played a major part in this, as there are other companies using the same initials… whilst they still have to consider those people who continue to refer to them by the former name.

    A search engine’s ability to handle name alternatives is vital to those working in the SEO industry.

    Thanks for another great post.

  3. Hi Crystal,

    I think many people underestimate how much it takes to create a useful and effective search engine. We’ve been told by representatives of Google that they update their ranking algorithms in some way almost everyday as well.

  4. Hi SEO Solihull,

    Thanks you. I focused upon people’s names with this post because the patent primarily does as well, but it’s possible that there may be a similar process involving business names, where nicknames or tradenames for businesses might also be identified in a similar manner. I would imagine that there might be some differences in terms of sources where business names and brand names might be found, but some of the algorithms used to identify names in queries, and to identify variants might follow very similar approaches.

  5. I really enjoyed reading this post, I was just wondering do you trade featured blog posts. Thanks for sharing your Blog with others.

  6. Hi Bill,
    Very Interesting article, educating the newbie seo’s about the regular trends in seo is appreciable. Nice work.

  7. Hi, in my opinion such a change could generate some confusion, if a person wants to rank for his nickname should use the standard seo techniques, optimizing a page for a keyword (nickname) means that the owner wants to rank for the keyword.
    Best wishes to you Bill.

  8. This is an interesting post. I honestly wouldn’t have thought there needs to be a lot of work done on something like nicknames. If you are searching for someone you kind of know, it seems likely that the closely related searches coming up would come with those names closely related. And if you’re searching for someone famous, their nickname will probably come up easily too.

  9. Hi Jim,

    Thank you. I don’t usually “trade” featured blogs. I have written a few articles in the past on other sites, such as Search Engine Watch, and Search Engine Land, but I don’t usually feature articles written by someone other than me here.

  10. Hi John,

    Thanks. I’d guess that when I start mentioning things like Hidden Markov Chains, and Support Vector Machines, that a lot of people might not consider topics like those to be a “newbie” topic. I did like that the patent described possible approaches like those in determining how names might be identified in a query though – it gives us some insight into how the algorithms behind the searches work.

  11. Hi Martin,

    The search engines are interested in finding relevant content in response to queries, and if an article about “Bill Clinton” uses “William” instead of “Bill,” it still may be a very relevant result for the person searching for the information, regardless of whether or not the person who created and published the article had the foresight to realize that they might want to use both President Clinton’s given name and nickname.

  12. Hi Sarah,

    Sometimes people who are very well known, such as Yogi Berra, aren’t very well known by their given names, and aren’t always referred to very often by those names. That’s where something like this approach might be useful.

  13. It certainly is amazing the lengths that major search engines will go to. It’s obvious that they see many areas that can still be improved when it comes to search. I think it’s very interesting how intelligent search engines could be in a few years!

  14. Hi Matthew,

    One of the things I think that the patent I’ve written about shows is that there are a lot of very small details that the search engines have to take into account when it comes to making search results more useful.

Comments are closed.