Query Logs and the Slang of the Web

One way to help in that process of organizing the Web is to use what people do in the Web.

– Ricardo Baeza-Yates, from a presentation on Extracting Semantic Relations from Query Logs

How related might different search queries be when they share a number of pages in search results, and searchers tend to click upon those shared results more than other results?

If you go to Yahoo’s search, and perform a search for the term [wcca], the first result that you see in the search results is a page titled “Wisconsin Circuit Court Access.” If you search for [wisconsin circuit court], you’ll see the same page at the top of the search results. If many people searching for each of those terms tend to mostly click on the link for that page, and no other pages, it’s possible that Yahoo might start considering those query terms to be very closely related.

Because of that relationship, the search engine might start offering searchers a query suggestion for a related term at the top of the search results for an original query.

A recent Yahoo patent application explores these types of relationships, and tells us that it might learn a great deal from comparing which search results searchers click upon. It describes three relationships for query terms, based upon click data found in its query logs, where it keeps tracks of which results searchers choose for specific queries.

Synonyms (close relationship) – Query terms that share a substantially equivalent set of clicked search results.

Lesser but included (inclusive relationship) – Where the set of clicked results for one query term is smaller in size than another, and those clicked URLs are substantially included in the clicked URLs for the second query.

Related (lesser relationship) – Where the clicked search results between two queries overlap, but not quite to the same level as the two relationships above – synonyms and lesser but included.

In my example above, if people searching for [wcca] and [Wisconsin circuit court] mostly click upon that first search result for “Wisconsin Circuit Court Access,” the search engine might consider those query terms to be synonyms.

The choices of which pages searchers click upon is viewed as implicit user feedback – searchers aren’t explicitly stating that these queries are related in some way, but when they choose shared pages in search results for those queries, it’s assumed that the terms are related.

What would a search engine do with this information?

It might offer query suggestions at the top of search results for a related query, or reformulate or expand search results to include results that are also relevant for the other query term. The search engine might also use these relationships to match queries with advertisements, and in other possible ways. We’re told about this process, that:

Embodiments can detect the slang of the Web (e.g., a taxonomy used by users to perform searches on the Web).

The patent application is:

Extracting Semantic Relations from Query Logs
Invented by Ricardo Baeza-Yates and Alessandro Tiberi
Assigned to Yahoo
US Patent Application 20090164895
Published June 25, 2009
Filed: December 19, 2007

There is a white paper on this topic from the listed inventors on the patent filing available to subscribers of the ACM portal at Extracting semantic relations from query logs. If you’re not a subscriber, there is a video presentation on it from Ricardo Baeza-Yates, which I also linked to at the start of this post.

There are three yahoo research papers co-authored by Ricardo Baeza-Yates which cite that paper, and are worth looking at if you’re interested in how query logs might be used by search engines:

  • Search, Web 2.0, and the Semantic Web The importance of search (pdf)
  • Clique analysis of query log graphs (pdf)
  • The anatomy of a large query graph (pdf)

