Yahoo Phrase Based Indexing in a Nutshell

Search engines are getting smarter about the phrases that they see and understand online, and Yahoo recently published a patent application that describes a number of the ways that they learn about and understand the use of phrases in documents on the Web.

Exploring how Yahoo might use phrases to rerank search results may show how they may try to understand data from published documents on the Web, and from log files that collect information about the queries that people use when they search for information about different concepts.

From Keyword Matching to Phrase-Based Indexing

A page’s placement in search results for certain queries can involve looking at ranking criteria and algorithms applied to documents involving keywords in search queries for things like:

  • The number of occurrences of the query terms on a page,
  • How close those terms might be together (proximity), and;
  • The placement of the terms on a page (the location and types of elements those words may be within).

Those kinds of signals don’t take into account the context of the search terms, related to other words on the same page. They also don’t try to understand when queries are used as meaningful phrases.

Concepts and Contexts

The Yahoo patent application tries to determine the context of one or more terms as a concept or phrase as it is associated with other related phrases upon a page, to identify the most relevant pages in response to a given search query. I’ve written about a similar Google approach in the past in a post titled Phrase Based Information Retrieval and Spam Detection

The Yahoo process is different, but there are a number of similar ideas.

System and method for determining concepts in a content item using context
Invented by Jignashu Parikh and John Thrall
Assigned to Yahoo
US Patent Application 20080033982
Published February 7, 2008
Filed: December 15, 2006

Abstract

The present invention is directed towards systems and methods for indexing one or more items of content. The method of the present invention comprises extracting one or more items of text from a given item of content.

The one or more items of extracted text are tokenized into one or more concepts. One or more related concepts associated with the one or more concepts are identified.

A support score is generated for the one or more concepts, and the item of content is index with the one or more concepts and the one or more associated support scores.

Identifying Concepts Associated with Pages

The patent filing gives us a number of technical details on how concepts, or phrases, might be identified. One of them is trying to capture meaningful phrases, or concepts.

A search engine may strip out content from web pages using a text extractor program, which may also steal metadata and other information related to specific pages.

That text extractor may also look for content left behind by the readers of pages – user generated tags that identify and describe what a page might be about.

To break the words upon a page into phrases, the search engine may look at them as combinations of one, two, three, and more combinations of words, which it would try to match up against phrases appearing in a concept directory.

So, a sentence like, “The quick brown fox jumps over the lazy dog,” might be broken down into a number of phrases like:

  • The quick brown
  • Quick brown fox
  • Brown fox jumps
  • Fox jumps over
  • Jumps over the
  • Over the lazy
  • The lazy dog

The text and the tags describe what those pages are about, and are sent to a program that takes that data and uses an aboutness extractor to break the text down and match it with keyword phrases maintained in a concept dictionary, to see if they are listed there as concepts.

The keyword phrases in the concept dictionary are ones that appear frequently in places like the Web or a database list of user queries.

The aboutness extractor breaks text taken from a page into tokens (like “the quick brown fox”) and identifies how frequently keyword phrases found within a “concept dictionary” appear in that page. Keyword phrases that are located in both the concept dictionary and upon on a specific page comprise “concepts.”

The concepts appearing on a page and their frequency of use upon that page is identified and maintained by that aboutness extractor.

If none of the phrases extracted from pages also appear in the concept directory, they may not be used as concepts for this concept based indexing.

In addition to concepts, this system looks at the context that those concepts are used within.

The aboutness extractor identifies the frequency with which related concepts appear on a given page. A context dictionary maintains information identifying related concepts – one or more keywords and/or phrases associated with a given concept.

Phrases that appear upon the same page together regularly may be more likely to be related to a specific topic, and using the context dictionary to identify them might help in relating those pages to that topic.

Contexts Identified as Query Refinements

The context dictionary contains information about keyword phrases that people have used to refine their search queries. The context dictionary might also be used to store information identifying search queries during a given time period or query session.

For example, a search engine may receive a number of queries during a given time period. Those may be monitored to identify related keyword phrases, and to store them in the context dictionary.

When someone searches and their results aren’t relevant, or they receive too many results, they might change the search terms they used to redefine or narrow the scope of their query.

The refined terms might be sent to a context dictionary, where they might be considered as indications of related concepts (keyword phrases) and possibly associated with a concept in the concept dictionary.

Example

A searcher may look for “Toyota” and receive a number of results. They may then search for “Honda,” followed by a query for “Mitsubishi.” Similarly, another user may search for “Mitsubishi” and receive results, and then look for the term “Toyota.”

The search engine may monitor searchers’ queries and determine and determine that the terms “Honda,” “Mitsubishi,” and “Toyota” frequently appear in queries submitted by users during a given time period or query session. The search terms and queries and related keyword phrases may be maintained in the context dictionary.

These related keyword phrases may show up as suggestions shown by the search engine to searchers as suggested query refinements.

Co-occurrence

If these related phrases appear upon the same page, they may also help to provide a “context” for that page, helping the search engine know what the page is about.

The context dictionary might maintain information identifying frequently co-occurring keyword phrases on the Web, and upon a page. The search engine may monitor web pages to identify keyword phrases frequently co-exist on the same pages.

Example

The search engine may look at advertisements and web pages on sites and determine that keyword phrases “interest rates” and “mortgage” frequently co-appear in advertisements and web pages.

Similarly, the search engine may decide that the keyword phrases “patent” and “intellectual property” frequently co-appear on pages. Those frequently co-appearing keyword phrases may be recorded by the context dictionary.

Co-Occurence and Human Editors

The index of co-appearing keyword phrases in the context dictionary may be supplemented with information from a human editor, who identifies keyword phrases related to a given concept.

For example, a human editor may submit an index entry pair comprising the phrase and keyword “notebook computer, laptop,” indicating that the phrase “notebook computer” is associated with the keyword “laptop.”

Related Concepts

The “aboutness extractor” retrieves keyword phrases from the context dictionary associated with concepts from sites. The keyword phrases retrieved from the context dictionary are “related concepts” with respect to the concepts on a site.

The aboutness extractor will then look at sites to determine if related concepts in the context dictionary are present or absent, and store that information.

Using Frequency and Relatedness to Determine a Support Score for Phrases

The aboutness extractor uses the frequency of concepts appearing on a page, as well as information about the presence or absence of related concepts associated with a given concept on a page, to calculate a support score for concepts on a page.

The aboutness extractor may also use information from another information data store to calculate a support score for a given concept. That other information data store may contain human editorial information about the concepts that could be used to appropriately increase or discount (not all human editorial information is reliable) the support score for a given concept.

It could also maintain tags or metadata generated by users describing a given content item. If user generated tags or metadata for a page are similar or match concepts identified for a page, the support score for that page for that concept may be increased.

Concept Index

The search engine then creates a concept index with entries identifying the one or more concepts associated with pages and their support scores for concepts.

Dominant Concepts

A search engine might then create an index for “dominant” concepts associated with the pages received by a search engine. A dominant concept or concepts may be the concepts associated with one or more items of content (pages) with the greatest associated support scores, or supports scores over a certain support score threshold.

The patent application tells us that there are a number of well known models that could be used to determine “dominant” concepts for pages.

Matching Concepts with Queries During Searches

The search engine would sort through the concept index to find concepts matching or similar to the terms in a query received from a searcher.

Pages associated with the concepts matching or similar to a searcher’s query terms may be selected and added to a search result set. Those added pages may be sorted in descending order according to the support scores for the concepts with which the pages are associated. The sorted result sets may then be sent back to the searcher.

Dictionary Manager

A dictionary manager could provide periodic updates to the concept dictionary, context dictionary and the “other information” data store at the search engine.

When a searcher refines their query by adding words or removing words or modifying them, the query refinement might be sent to the concept dictionary. If pages are updated with additional content, the updates may be sent to the context dictionary to ensure the context dictionary maintains the most relevant information on keyword phrases associated with concepts.

Further, if a user submits a query to a search engine, the dictionary manager may update the concept dictionary query logs to reflect the user’s query.

How Concepts Associated with Pages are Identified

You’ll find lots of different types of documents in a search index, from static pages to blog posts to advertisements, as well as audio and video and ecommerce platform product pages.

In this concept index approach, a document is first selected from pages linked to in the search index.

Concepts (keyword phrases) associated with a selected page are identified by seeing if they are also listed in a concept dictionary which contains concepts taken from query logs or a body of documents like the Web.

Concepts and related concepts on the page are identified, as are keyword phrases associated with those concepts. Related concepts are kept in a context dictionary, and are identified using a number of different information retrieval techniques.

Keyword phrases related to concepts might be created using information taken from user query sessions involving one or more terms over a certain period of time or during a given search session.

A context dictionary might also be created using information from human editors, who may identify keyword phrases associated with concepts on pages.

A support score is calculated for concepts identified on pages. These scores are numbers indicating the relevancy of a page with a specific concept, and with related concepts.

An index may be created where an index entry is made up of a page, concepts associated with the page, and respective support scores for the concepts. The index generated may be used to locate pages responsive to search requests, such as user search queries.

Calculating Support Scores for Concepts Associated with Pages

I need to go back over some of the processes described above a little, to explain how support scores are calculated.

A page is selected, and text from the page is extracted, and possibly also tags (user created metadata) associated with the page may be extracted.

The extracted text is broken into concepts (keyword phrases), identified using a concept dictionary. Concept keyword phrases in the concept dictionary are keyword phrases frequently appearing on the Web or in query logs.

A concept is selected from the one or more concepts associated with a page by looking at the frequencies with which the concept appears upon a page (How often does a concept/phrase “wireless laptop” show up on a page, for example).

Related concepts are identified using a context dictionary index. For example, concepts related to “wireless laptop” may be the keyword “computer,” as well as the keyword “notebook.”

The page is searched to see if it contains those related concepts associated with the concept selected.

The frequency of the concept in the page selected, as well as the presence or absence of related concepts are used to calculate a support score for the selected concept.

A support score is calculated using a combination of concept frequency, a term frequency/inverse document frequency (“TF/IDF”) measure using the one or more terms comprising a concept, and query log history.

The tf/idf measure looks at how frequently a phrase is mentioned on the page compared to how frequently it might be mentioned in a body of documents like Yahoo’s Web index, or query log history.

When all concepts associated with the page selected have been analyzed, a concept index entry is made, taking the page selected, concepts associated with page, and the support scores corresponding to the concepts on the page.

The concept index may identify dominant concepts associated with each page. Dominant concepts are ones that may have a support score exceeding a given amount or threshold.

Using Concepts to Find Relevant Search Results

Again, repeating some of the processes describe above, here’s a summary of how concepts can be used to provide a searcher with relevant pages using concepts associated with pages listed in the concept index.

The search engine receives a searcher’s query.

The query is parsed to identify the terms within the query.

A concept index is accessed, which contains pages, concepts associated with pages, and support scores for the concepts on those pages.

A page is selected from the pages listed in the concept index.

A check is performed to see if the concepts associated with the page matches or is similar to the query.

Example, a query may be performed for the term “desktop computers.” A concept from a page in the concept index might be the phrase “wireless desktop computer,” matching the query “desktop computers.”

If the concepts associated with the page in the concept index match or are similar to the terms comprising the query, the content item and concepts matching or similar to the query, as well as support scores associated with the matching or similar concepts, are added to a result set.

After pages in the concept index have been analyzed, the pages in the search result set are sorted in descending order according to the support scores corresponding to the concepts with which a given page is associated, and are returned to the searcher.

Using Concepts to find Related Pages

In addition to helping with search results, this system may be used to provide integration of related pages.

Someone searches, and a concept index is created, with entries identifying concepts associated with pages. Phrases within the search query may be used to identify pages with concepts (keyword phrases), which match or are similar to the terms in the user search query.

Results are returned to the searcher, using the concept index approach.

If the searcher chooses a page from the results, concepts related to that page are identified.

A search can then be performed to identify pages in the concept index associated with the concepts that are associated with the selected page. So, a searcher may select a link to a page associated with concepts A, B, and C.

After that selection by the searcher, a search may be performed to identify one or more pages (content items) in the concept index that are associated with concepts A, B, or C.

The page is sent to the searcher, along with links to pages identified as being associated with the concepts associate with the original page.

The searcher is presented with the page selected, as well as links to pages, or portions of the pages, which are related to the selected page.

Conclusion

Understanding concepts, in the form of phrases that appear on pages may help a search engine understand the pages that a searcher is looking for when they submit a query.

For example, someone searching for “ice cream” isn’t just looking for pages where the words “ice” and “cream” appear together.

So, a page that contains the actual phrase “ice cream” plus some related phrases such as “desserts,” or “ice cream parlor,” or “homemade waffle cones,” is more likely a match for a search for “ice cream” than a page describing someone slipping on ice, when going to a store for cream for their coffee, that doesn’t contain related phrases like that.

A page that includes phrases such as “double scoop,” “gelato,” “waffle cones,” “chocolate,” but not the phrase “ice cream” might also be seen as a “related” page for pages about “ice cream.”

What does this mean for people writing web pages? For one thing, it might be helpful to use words as phrases that the page is about together upon the page a few times, as opposed to mostly including those words on the page separately from each other. For another, it can be helpful to include related phrases, to help the search engines understand the context of those phrases.

Some Related Posts on Phrase and Concept Indexing

If you are interested in learning more about how concepts might be used in indexing by a search engine, these posts are worth a look:

Share

24 thoughts on “Yahoo Phrase Based Indexing in a Nutshell”

  1. Well an ‘aboutness exactor’ and ‘concept and context dictionaries’ huh? I simply must get a list of whacky search patent terms going on, I like those ones. Kind of interesting they are looking at search query refinements for context/meaning… not so sure how effective that would be long term. It does though start to bring user performance metrics and temporal factors into the phrase based world which is certainly interesting. There is much that I am finding interesting in this approach, certainly worth digging into.

    It actually sounds like this one is a little more comprehensive than the ones from G (Anna Patterson)… I am thinking I might revisit the Google methods and do a little comparative analysis.

    Oh… and happy Valentines to U and the little lady :0)

  2. Hi Dave,

    I do see a lot of similarities between this approach, and the Anna Patterson one from Google. This one does give us some more details about the mechanics behind what they might be trying to do, when it comes to indexing, including looking at query refinements (like that touch, but I think that it would need to be implemented carefully). The Google one gave us some more details involving related phrases in anchor text…

    But I’d really like to see a little comparitive analysis if you create one.

    Thanks for the Valentine’s well wishes from Kimberly and I. I hope that Valentine’s Day in your house is a special one, too.

  3. Super post. But it must have taken you quite some time to write this up.

    I think the interesting thing is that data is more and more being pulled from multiple sources to determine what the searcher is likely looking for. I think there are some great ideas for any SEO in the above, most especially when considering keywording and copy writing.

    Hope all is well Stateside :)
    Rgds
    Richard

  4. Hi Richard,

    Thanks. It took a while to sort through the patent filing – I started on Thursday, and trimmed out as much as I could.

    I like that they are using information from different resources in a number of ways – the content and meta data that they mine from pages, the tags and other annotations from visitors to those pages, the human input from editors, as well as query session information to identify related concepts.

    I suspect that there are other ways that they might use this information, and other information that they are looking at that isn’t included in the patent filing.

    There are a lot of nice hints in the document for keyword research and copywriting about the value of related concepts.

    Cheers.

  5. The concept approach probably makes search results far more accurate. Cheers for the explanation, I didn’t know search results were so refined.

  6. Superb detail about the under-the-covers processing – and a great effort to bring it to the readers, Bill. The extraction of concepts from pages would go a long way to understand with little ambiguity what the pages are about. But I also want to remind about the need to understand of the query. The patent app refers to the “query terms” – but unless those are understood unambiguously, the user will still not be happy with search results. While the pages have context to disambiguate their “aboutness”, the query usually does not. I just thought up a search for “iraq war and peace plan” – if the same phrase-extraction approach is applied to it, some pages about Tolstoy’s novel may come up along with the Iraq war-related ones, which wouldn’t be relevant. (Or would it?)

  7. Hi Bill, does this mean that it would be possible for a search engine to realize that a page that contains a phrase like “Quick Brown Fox” is about “learning to type” ?

    Thanks,
    Tim

  8. Pingback: links for 2008-02-13 oggin.net
  9. Great post and extremely indepth. I too, would love to see some comparative analysis done between this and G.

    I’d do it myself, but I’m too busy reading huge posts like this and planning my wife’s VD present.

    Because I’m juvenile like that, her card will read “Happy VD Honey!”

    Happy VD Bill.

  10. Excellent post, must have taken a while to type this one up! The tags that social sites like stumbleupon and others allow users to attach to sites will also start featuring more prominently….instead of trying to emulate human response with algorithms you simply have them do the work…needs loads of refining but like where this is going!

  11. Bill, thanks so much for going in depth on this. I live reading about PaIR and it’s fundamentals. I see so many similarities between PaIR and LSI/A that it seems like sometimes they blend together as far as their core principles go. Maybe a good future post would be about differentiating the two?

    Nevertheless, I see it essential to use natural semantic language in all your copy, tags, and anchor text to meet relevancy standards.

  12. @ Dara, Thanks. A search engine understanding phrases and concepts better may be a huge move forward.

    @ Dmitri, Good point about efforts to understand query terms. There’s been talk, in a few of Google and Yahoo patent applications about exploring what is meant by the use of specific queries when they are entered into a text box, and when they are refined by searchers.

    Those steps in understanding a query may look at the actions of other searchers, and they may look at what that particular search is interested in as reflected in search histories and other places. I’ve written about a number of these kinds of patent applications, and you can see many of the in my “search queries” category.

    @ Tim (Local Hound), That’s a good example. If the search engine sees the phrase “Quick Brown Fox” on enough pages that are related to learning to type, then it might make that association. The Google patent application that I point at above discusses the idea of what “good” phrases are, and what “bad” phrase might be, and the determination of which phrases to include as meaningful in a phrase-based index can hinge on that distinction. Another place that “Quick Brown Fox” appears frequently is on pages that show samples of different kinds of fonts, so it may be a fairly frequent appearing phrase.

    @ Judd, Thank you. I hope that your Valentine’s Day celebration was a nice one. I’d love to see a more indepth comparison. It would be great if Dave did one, as he mentions considering, in the first comment. Otherwise, I might try to put something together if I have a chance.

    @ Jacques, It did take a while for this post – I spread it out over a few days. Tagging is an interesting problem, because like meta keywords, it isn’t necessarily always reliable. People often tag based upon their relationship with a page (like “toread”), or are too lazy to come up with good tags, or reuse the tags of the people who tagged before them, even though those tags may not be too good, or may even maliciously tag something with inappropriate words based upon dislike or anger or misunderstanding or some agenda.

    @ Jordan (Utah SEO Pro), Thank you. If you have some spare time, you might want to explore a couple of other topics. Take a look at probabilistic latent semantic indexing (pdf) sometime, instead of lsi/lsa. Think you might find that interesting. Another to look at is from William Woods – Conceptual Indexing: A Better Way to Organize Knowledge (pdf)

  13. Is this Yahoo’s attempt at defeating search engine spammers?
    Could this system have been created to stop keyword stuffers and page generators?

  14. Tagging is definitely open to misuse / disuse. The trick would be to educate the users of social bookmarking sites / services on adopting effective tagging practise, maybe an article on basic parameters to adhere to might be on the cards? Keep up the good work!

  15. @ SEO Ninja

    Not necessarily an attempt aimed at search engine spammers as much as an effort to find pages that are relevant for specific concepts. It might have a side effect of stopping keyword stuffing and page generators.

    @ SEO Jacques

    One of the things about tagging that researchers have come to find is that tags are often as much about the relationship between the person tagging and the content as they are about the subject matter of the content itself.

    For example, if I have a project that I am researching, I might use a specific tag to identify all of the documents that I bookmarked that I think might be helpful for my research. The tag I use might not describe the content of the pages that I bookmarked at all, yet my use of tags isn’t abusive or misleading at all.

  16. Been researching how the search engines work and after reading this document it scares me as it is so much more difficult then I thought. It makes sense why we outsource to other companies as oppose to doing this on our own. Interesting reading, your blog.

  17. Hi Kweezel,

    There have been a few people over the past few years who have proclaimed that “SEO isn’t rocket science,” and they are correct – it’s computer science, information technology, linguistics, marketing, and a number of other disciplines rolled together. There are a number of basics to SEO, and I would urge anyone with a site to try to understand those as much as possible. It can help :)

Comments are closed.