Exploring Connections Between Books in Google Book Search

In September of 2007, Google research scientists Bill Schilit and Okan Kolak announced a new feature for Google Book Search which they called Popular Passages. The announcement came in an Inside Google Book Search blog post titled Dive into the meme pool with Google Book Search

Popular Passages provides us with the ability to find connections between books by taking interesting quotations or passages from one book or magazine or publication, and showing where those appear in other literary works. For example, the following passage shows up the book Moneyball: The Art of Winning an Unfair Game:

As thus: lately in a wreck of a Californian ship, one of the passengers fastened a belt about him with two hundred pounds of gold in it, with which he was found afterwards at the bottom. Now, as he was sinking — had he the gold? or had the gold him?

This John Ruskin quote starts off the book, and appears in at least 35 other publications.

Passages might be taken from material in a book that appear within quotation marks, like the one above, or in unquoted passages from the text of the book. For example, another passage from Moneyball appears on the 37th page of the book:

From Paul’s point of view, that was the great thing about college players: they had meaningful stats. They played a lot more games, against stiffer competition, than high school players. The sample size of their relevant statistics was larger, and therefore a more accurate reflection of some underlying reality. You could project college players with greater certainty than you could project high school players. The…‎

The Popular Passages feature tells us that this passage shows up in two books from 2003-2008, and we find that the other book it appears within is
The Baseball Economist: The Real Game Exposed

What’s interesting about the Popular Passages Book Search feature is the ability to create links between documents based upon passages that are shared between them, in a very large collection of documents that don’t contain links to each other.

An addition to this feature looks at the text of those passages, and a certain amount of words after them to identify key terms that co-occur within the context of those passages, so that the passages and the books they are contained within can be searched by those “Key Ideas.”

The technical challenges behind the development of Popular Passages and the searchable key ideas are described in a couple of white papers from the researchers behind the processes:

There are also some Google patent filings associated with the identification of quotations and passages and key ideas, and the rankings of those passages when they appear as results in Google Book Search:

Identifying and Linking Similar Passages in a Digital Text Corpus
Invented by William N. Schilit, Okan Kolak, and Adam Mathes
Assigned to Google
US Patent Application 20090024606
Published January 22, 2009
Filed: July 20, 2007

Abstract

A corpus contains digital text from multiple documents. A passage mining engine identifies similar passages in the documents and stores data describing the similarities. The passage mining engine groups similar passages into groups based on degree of similarity or other criteria.

The passage mining engine ranks the similar passages found in the text corpus based on quality or other criteria. A user interface is presented that includes hypertext links associated with the similar passages that allow a user to navigate the documents.

Ranking similar passages
Invented by William N. Schilit, Okan Kolak, and Justin John Paul Vincent-Foglesong
US Patent Application 20090055389
Published February 26, 2009
Filed: June 5, 2008

Abstract

Passages in a digital corpus are scored and ranked based at least in part on characteristics of instances of the passages occurring in the corpus.

Such characteristics include the popularity of the author, the characteristics of the words introducing and following the similar passage, frequency of appearance of the passage in the digital corpus, the length of the similar passage, the words of the similar passage, the usage of punctuation with the similar passage, and the diffusion of the similar passage within the digital corpus.

The characteristics are scored and weighted to produce ranking scores for the associated passages. The ranking scores are used for purposes including selecting passages to display in association with a document and ranking passages displayed in response to a search.

Identifying Key Terms Related to Similar Passages
Invented by William N. Schilit and Okan Kolak
US Patent Application 20090055394
Published February 26, 2009
Filed: January 30, 2008

Abstract

Key terms for similar passages from a large corpus are identified and used to enhance searching and browsing the corpus. The corpus contains multiple documents such as the text of books.

Browsing by concept is supported by identifying a set of similar passages or quotations in documents stored in the corpus and assigning key terms to passages which links conceptually related passages together.

The context of each passage instance is identified and can include, for example, the text surrounding the passage. The contexts of all similar passage instances are analyzed in order to identify key terms for the similar passage.

The related key terms are analyzed to identify relationships among the key terms from different similar passage sets. The key terms can be used as a basis for navigating the documents in the corpus. The key terms enable browsing the documents in the corpus by concepts referenced in the documents.

Google Book Search provides a number of other interesting features, such as:

  • Reviews of books listed,
  • References from web pages and other books and scholarly works,
  • Links to other editions of the same book and to related books,
  • A list of the “key terms” that appear in the book with links to where they show up, and;
  • A Google Map to places mentioned in the books.

The white papers above tell us that Popular Passages has proved to be one of the most popular navigational features of Google Book Search since its release.

I’m not surprised by that admission. Being able to find interesting quotes that appear within a book, and are shared in other books is an engaging way to discover ideas within books that are shared by other authors, and to see how those ideas spread.

Seeing how the inventors of Popular Passages came up with their methods to finding interesting shared passages in scanned documents and ranking them, in the white papers and patent filings above, gives us insights into how challenges of search and discovery of ideas might be uncovered.

What does this mean for search on the Web?

Considering the growth of availability of books, magazines, and other documents on the Web without hyperlinks, methods of finding information like the automated links between Popular Passages in those printed materials and an identification of query terms that match Key Ideas taken from text associated with those passages may become fairly common on the Web in the future.

Share

11 thoughts on “Exploring Connections Between Books in Google Book Search”

  1. “What does this mean for search on the Web?”

    It means that Google has possibly managed to make popularity and frequency of appearance of a passage the newest comparison to paid links.

  2. Pingback: » Search Engine News Wrap-up March 1
  3. I think this would be a great cross reference tool for anyone doing research on an author, book or topic, whether it be for book reports, dissertations etc.

  4. Google books is brilliant for IT related books. I recently did a vmware exam and found lots of material on google books.

  5. I’ve been using Google books quite a lot, as I found it very useful for previewing books before actually purchasing it from Amazon. Think Popular Passages is great!

  6. Hi Kimberly,

    Thanks. That’s a good point. Basing the value of a site upon link popularity has some serious limitations. For example, a site might focus upon a very narrow topic might be the best source of information about that topic, but it might not have very many links. Sites that have much more limited information about that topic, but have a good number of links pointing to them may rank much higher. The same can be said for books.

    There’s a paper that goes into a lot of depth on search engine bias that discusses that point in depth, and some ways in which search engines may be misleading based upon assumptions that they make:

    Shaping the Web: Why the politics of search engines matters (pdf)

    The paper is a few years old, but the ideas are still pretty timely.

  7. Hi peterK,

    I found myself getting sucked into the Popular Passages, going from one book to another.

    Hi PeopleFinder,

    I am concerned that it might keep ideas from less popular books and authors from being found, but it is useful.

    Hi Mr. Gadget,

    I imagine that the coverage of IT books in Google Book Search is pretty extensive, especially since most of the people working on it likely find that to be a topic they might be interested in.

    Hi ePyramid,

    I feel like I’m starting to see more books in my search results, especially at Google Scholar. I’ve always liked the Amazon book previews, and Google does provide a chance to get a deeper look. It is nice to get that glimpse before buying.

  8. Thanks Bill. I’ll check it out.

    By the way, I’m lovin’ your new site design. Tre’ schweet. *-)

  9. That is an interesting thought Kimberley – I hadn’t looked at it like that until I saw your comment. The books search looks like a bit of fun – I’m going searching for my favourite phrases :)

Comments are closed.