Searching with Pronouns: What are they? Coreferences in Followup Queries

At Google’s 15th anniversary celebration last summer, shortly after Hummingbird was introduced, Tamar Yehoshua, Google VP of Search, showed us conversational search at Google by first demonstrating a query asking for “pictures of the Eiffel Tower”, and then following up with the query “How tall is It?”

Looking through the base of the Eiffel Tower.

In that second query, Google had to not only remember the Eiffel Tower was being asked about, but also to recognize the Eiffel Tower when it was being referred to as “it.” That is part of the new “conversational search” that Google is now engaging in, using something know by linguists as a “coreference.” I wanted to write about coreferences to clear up confusion that people might have had about them.

I was inspired to do that after reading an article from Eric Enge earlier today, where he wrote about Knowledge Graph Advances From Google

In his article’s section on “query sequences”, he tells us

One last sequence to show for today, which is an extended query sequence. This shows how Google can maintain the context of a conversation, and also how it can understand the context of a response. First, we start by asking the height of the empire state building

He then refers to a similar query that Tamar Yehoshua used in the presentation above, where she asked, “How tall is it.” He asks instead, for “Pictures?” and gets pictures of the Empire State Building.

The long definition of what is going on here from Wikipedia is something called coreference.

In linguistics, coreference (sometimes written co-reference) occurs when two or more expressions in a text refer to the same person or thing; they have the same referent, e.g. Bill said he would come; the proper noun “Bill” and the pronoun “he” refer to the same person, namely to Bill.[1]

Coreference is the main concept underlying binding phenomena in the field of syntax. The theory of binding explores the syntactic relationship that exists between coreferential expressions in sentences and texts.

When two expressions are coreferential, the one is usually a full form (the antecedent) and the other is an abbreviated form (a proform or anaphor). Linguists use indices to show coreference, as with the i index in the example Billi said he would come. The two expressions with the same reference are coindexed, Hence in this example “Bill” and “he” are coindexed, indicating that they should be interpreted as coreferential.

Google has at least one project involving coreference and has a few people involved in papers on the project

The project at Reference Coreference Scorers, mentions the following papers, and states that they should be cited when it is:

Scoring Coreference Partitions of Predicted Mentions: A Reference Implementation. BY Sameer Pradhan, Xiaoqiang Luo, Marta Recasens, Eduard Hovy, Vincent Ng and Michael Strube. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, MD, June 2014. [pdf]

An Extension of BLANC to System Mentions. BY Xiaoqiang Luo, Sameer Pradhan, Marta Recasens and Eduard Hovy. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Baltimore, MD, June 2014. [pdf]

These were a couple of others that I saw while looking around, where people from Google were involved in their writing:

Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models (pdf)

Wikilinks: A Large-scale Cross-Document Coreference Corpus Labeled via Links to Wikipedia

Most interesting to me from these papers was the abstract in the last one which tells us:

Cross-document coreference resolution is the task of grouping the entity mentions in a collection of documents into sets that each represent a distinct entity.

It is central to knowledge base construction and also useful for joint inference with other NLP components.

Obtaining large, organic labeled datasets for training and testing cross-document coreference has previously been difficult. This paper presents a method for automatically gathering massive amounts of naturally-occurring cross-document reference data.

We also present the Wikilinks dataset comprising of 40 million mentions over 3 million entities, gathered using this method. Our method is based on finding hyperlinks to Wikipedia from a web crawl and using anchor text as mentions. In addition to providing large-scale labeled data without human effort, we are able to include many styles of text beyond newswire and many entity types beyond people.

8 thoughts on “Searching with Pronouns: What are they? Coreferences in Followup Queries”

  1. This is really interesting. I’m especially intrigued as to the possibilities with intent in search. Meaning articles/pages would be extremely casual and conversational; intent-driven. This could slowly remove the “SEO-optimized content” and make way for some better content-focus articles, and not synonym-based and keyword-based content. That brings me to another question, though…

    The method highlighted, was this exclusive to this testing for this paper? Or, do you think this might be Google saying ‘we use an internal ranking metric, and high “internal-PR” will be using anchor text for this ability’ – meaning, Wikipedia?

    Once again, great article, good read and definitely something that expands my noggin about possibilities in the near future.


  2. Hi James,

    It appears that the Wikipedia was used as a source for “coreference resolution” because Wikipedia is structured in a way that makes it easy to extract both entities and information about them as things that might be referred to with a pronoun or even in an implied manner, such as asking for information about the “empire state building” and then sending a query to Google for “pictures”. Other knowledge bases on the web, such as Yahoo Finance is structured in a way that similar advantages to the way it is organized could help provide similar information. It’s not so much the PageRank of those sites, though Wikipedia does have a “notability” requirement for the things they post about that makes it more likely that more important entities are included. There are also notability polices for Freebase entries as well, that may help serve a similar purpose.

  3. I can’t think of an example, but I feel like lately I’ve seen stuff like this: (not a real e.g.)

    Will query “new blue denim”

    Begin new query: “le…” (And ‘levis jeans’ will autosuggest.

    Without going into a long explanation, I just feel like autosuggest is playing a part a lot lately in conversational search.

  4. Thanks, James.

    There is a lot of useful and helpful information there, but the format does help Google extract it and learn more about it, so yes – the format is an important element regarding what is going on.

  5. Hi Patrick,

    I haven’t looked at an auto suggest or query refinement patent here in a while, and there was one from a month or two back that looked interesting. Maybe I should test that one, and see if it has some of those elements to it.

  6. Hi Bill,

    Thanks for that great response. I understand now it was more about the structure of the data that it was chosen for the testing.

    As always, more knowledge = good things:)

    Be well!

  7. Great stuff Bill. I am intrigued by Patrick’s idea of the modified auto suggest. Totally seems like something that G would do.

  8. I feel like this is the one area where Google needs to catch up. Siri (as terrible as she is) and soon Cortana both fare pretty well with conversational queries.

Comments are closed.