How Google Sets Works

A tool from Google that is often overlooked is Google Sets (no longer available), which allows you to “automatically create sets of items from a few examples.”

Google Sets was one of the first applications in the Google Labs (no longer available) pages.

Those pages are “Google’s Technology Playground,” and contain a number of programs that may or may not be tomorrow’s useful applications from the search engine. As Google tells us,

Google labs showcases a few of our favorite ideas that aren’t quite ready for prime time. Your feedback can help us improve them. Please play with these prototypes and send your comments directly to the Googlers who developed them.

Google was granted a patent this week on the process behind Google Sets, and the patent document provides some details on how the program finds additional words based on “items from a set of things” that you enter.

screenshot of Google Sets with 4 Delaware cities entered as a starter set

I haven’t used Google Sets much in the past, but now that I have a sense of how it works, I might use it more often.

Since the program allows you to enter a number of items that might be members of a set, I decided to type in the names of 4 cities in Delaware:

Newark, Dover, Wilmington and Georgetown

You can then choose to get a small set, or a large set in response to the items that you chose. Here are the results that I received after picking a large set:

wilmington, georgetown, newark, dover, new castle, rehoboth beach, bear, milford, hockessin, lewes, seaford, smyrna, millsboro, middletown, claymont, milton, selbyville, townsend, laurel, harrington, felton, greenwood, clayton, magnolia, camden, wyoming, dover afb, odessa, elsmere, delaware city, bethany beach, dewey beach, ocean view, fenwick island, bridgeville, newport, montchanin, ellendale, brookside, dagsboro, millville, winterthur, saint georges, philadelphia, delmar, yorklyn, glasgow, frankford, lincoln, port penn

Most, but not all of these results, are cities in Delaware.

Not all sets received provide such good results, but if you have an idea of how Google Sets works, you may end up with better results when using the tool.

The simple explanation of how the program works is that Google attempts to identify lists on the web as it crawls pages. It may look for these lists by considering:

  • HTML tags (e.g., <UL>, <OL>, <DL>, <H1>-<H6> tags).
  • Items placed in a table,
  • Items separated by commas or semicolons,
  • Items separated by tabs.
  • Other ways.

Items typed into the Google Sets interface by users are matched up against these lists, and probabilities are calculated to determine which items might be a good match for the items submitted by someone using Google Sets.

If you keep in mind that Google Sets is suggesting additional terms for your set by considering words that might appear together in lists on Web pages, you may find the results you receive more useful.

The patent is:

System and methods for automatically creating lists
Invented by Simon Tong and Jeff Dean
Assigned to Google
US Patent 7,350,187
Granted March 25, 2008
Filed April 30, 2003

Abstract

A system automatically creates a list from items in existing lists. The system receives one or more example items corresponding to the list and assigns weights to the items in the existing lists based on the one or more example items. The system then forms the list based on the items and the weights assigned to the items.

While the Google Sets application isn’t named in the patent filing itself, one of the images that accompany the patent is a screen shot of the front page of Google Sets.

screenshot of patent drawing showing front page of Google Sets

Share

31 thoughts on “How Google Sets Works”

  1. Interesting that they’ve actually patented it! I’ve been using this handy tool for too many years to remember, and it’s usually good. Do you think that the results of applying the tilde operator in Google search queries highlight word relations created using a similar technology to Google Sets?

  2. I test similarity with Google Adwords keyword tool.
    Especially when testing in english, with the group of “other keywords” in Adwords.
    Not all keywords given by Google Sets match but a large part yes.
    Another percentage of keywords are in different form (singular, plural).

  3. Thanks for the heads-up about the patent.

    Now, I often use the Google Keyword Tool, not just to get some sense of the relative demand for specific keywords, but also to prospect for other possible keywords I might have missed otherwise. It’s clear that I could also be using the Google Sets tool for this purpose, but what I’m wondering is how much overlap you think there would be between both tools. (I suppose I could just go put it to the test, plug in the same five KW in both, and examine the suggestions…)

  4. It is playful and instructive tools like this that will keep Google on the cutting edge of search technology for a long time to come.

    Google’s playful and innovative approach to search engine use and advances reminds me of the creativity at Apple.

    Just like Apple, I would never count the Google people out for long.

  5. Hi SEO Ranter,

    I’ve used Google Sets before, but not really for too much. It can be helpful sometimes when you are trying to broaden a set. I’m not sure that the tilde search (similarity or synonym search) is really using the same technology.

    Hi Francesco,

    The adwords suggestion tool can be useful. I’m not sure if it’s the best way to search for similarity either, but it can be a useful tool.

    Hi Winooski,

    You’re welcome. As I was reading through the patent, I wasn’t sure that it covered an existing tool from Google. It took a couple of reads before I even remembered Google Sets.

    I’m not sure that there’s a lot of overlap between the two – I’m going to have to do a writeup of another recent patent filing that may give us some more details….

    Hi People Finder,

    I like Sets, and I like the spirit in which tools like it, and others in the Google Experimental Labs are released to the public to try and use. I’ll second your hat tip to creativity, and companies that aren’t afraid to do something different.

  6. Google Sets is a great tool to find semantically related keywords to use in your text for purposes of phrase-based IR. Do you concur?

  7. Pingback: » Pandia Weekend Wrap-Up March 30 2008
  8. Hi Jordan,

    I’d hesitate to agree completely with that use of Google Sets. It does provide information about words that occur together frequently within lists based upon the items of a set that you may provide. I see at least two issues with that, when it comes to trying to understand how semantically related those words might be.

    The first is that the application is limited to lists that it finds upon pages rather than the whole content of those pages. So using Sets to understand how phrases within documents relate to each other when you are just looking at parts of documents (lists within them) may be a problem.

    The second is that you don’t know how sematically related your initial choice of items might be that you input into Google Sets in the first place, before it attempts to add to that set for you.

    The best use of Google Sets that I can think of at this point is as a brainstorming tool, when you might want to find words that people may have included together within lists on the Web. Unfortunately, it doesn’t provide any insight into why those words were listed together.

  9. I think you have highlighted the sticking point about the Google sets results at the moment:

    Suggestions are based on current on page factors.

    The thing about synonyms, such as the ones produced by AW keyword tool, is that they have some basis in the “mind” of searchers. Although website authors have the same minds, “on page” lists produce some pretty arbitrary results.

    I tested some related two word keyphrases that I know well. It didn’t take too many iterations to find my results littered with “about us” “our projects” and even “sitemap”.

    Maybe that fact that navigation is essentially a list mean that the lists themselves should not be the measure of a real “set”.

    Just found your site and am looking forward to reading a slightly more interesting SEO conversation than the usual dross peddled everywhere these days. Best wishes.

    Rob

  10. Hi Rob,

    Thanks for stopping by, and for your comment.

    The types of things that the patent mentions might be viewed as lists is broader that I might expect, including the words found within heading elements on pages. Unfortunately, many designers misuse headings on pages, and use them as headings for navigational content as well as text within the main content areas on a page. There are also some people who are considered “standards thought leaders” who have been pushing forward the idea that a company logo image should be encased in an >h1< element, since there is no HTML element for a site name or company name. Unfortunately, that means that the main heading on a page isn’t about the content that it heads, but rather is the same from page to page – a practice that makes search engine indexing harder, and may result in lists created by Google Sets from headings found on pages less useful.

    I don’t think that using Google Sets is a good idea to try to find words that might be semantically related. It’s best use does seem to be a way of finding words found in lists together that include the items entered by a user of Google Sets.

    I’m not sure how useful it is otherwise. It’s been Google’s longest beta project for a long time, and may remain as beta program in Google’s labs long after other programs added later graduate from the labs section of their site.

  11. I use Google Sets to uncover obvious related terms that you might not get right away unless you happen to be using several different keyword tools for your research. Usually you have to do a lot of “drilling down” before you come across an obvious connection which, of course you cannot ignore. Too bad it only returns 20 to 30 results, but I love it anyway.

  12. Hi Marcus,

    Google Sets can be a very useful tool for more than just finding keywords. I like it as a brainstorming tool while writing, and looking for related ideas. If you enter a couple of related concepts that might appear together in a number of lists on the web, it can help you expand those to a much wider range of concepts. If you’re doing some mindmapping, it can be very helpful.

  13. First of all, I’ve heard of Google Sets for the first time here, I can’t say I was totally blown by it but it could be used by web masters for the research purposes. I’ve played with it a little and it seems that Google Sets either
    – returns the results that are very close to Google Keyword Tool results or
    – returns the results that complete the group of not synonyms but items, that can be grouped under a single definition or
    – returns words that sound similar
    I tried to add descriptive words hoping that the tool would return me the objects described by these words but that didn’t happen.
    All in all, no doubts this is a useful tool and I’ll keep this in mind for the future because nothing yet offers this sort of grouped results.

  14. Hi George,

    Google Sets have been around for a long time, but they aren’t really that well known.

    Considering that they are taken from lists that Google finds on the Web, one of the ways that I try to use them is to imagine things that might show up in a list together, and put those as my starting terms, and see what else might show up. Sometimes the results are helpful. It’s worth exploring, even when they aren’t.

  15. Hi Andrew,

    That’s definitely a paper worth taking a look at. However, I’m not sure that we should look at it as describing how Google Sets work, since its focus is upon finding and extracting key/value pairs of data on the web, such as author/title. The patent is specifically about Google Sets, so it may be a better source for someone who wants to learn about Google Sets. I do appreciate your pointing it out – it is a very interesting document.

  16. Hi John,

    From what I’ve seen, the processes behind Google Sets doesn’t involve Latent Semantic Indexing (LSI), and I don’t think that the relationships between words found through Google Sets have anything to do with LSI, either.

  17. Pingback: bpchesney.org
  18. Pingback: Anonymous
  19. I tested some related two word keyphrases that I know well. It didn’t take too many iterations to find my results littered with “about us” “our projects” and even “sitemap”.

    Maybe that fact that navigation is essentially a list mean that the lists themselves should not be the measure of a real “set”.

  20. Hi Robin,

    Interesting that you would see so many navigational items in Google Sets, but it’s possible that people are using more styled lists to present their navigation than they probably were when this patent was filed and Google Sets was originally developed. I remember creating nav bars using tables back in the early 2000s.

    Ideally Google should probably be filtering out of Google Set results lists that are used as navigational items.

  21. Unlike the now extinct wonder wheel, the google sets seems more like it was Google’s first step toward LSI. If you put in a brand you get substitutes (sony, panasonic etc) and a few random terms. Alot like the adwords keyword tool feature that scans the webpage for keyterms and gives you a list, google sets sees what products or terms are grouped together. It begs the question why have some valuable google tools (like wonder wheel) ended while experiments like google sets still remain?

  22. Hi Ben,

    Google Sets definitely led to Google’s Webtables project, and other efforts they are making towards attempting to extract structured data from pages on the Web. Check out the following papers for some more:

    WebTables: Exploring the Power of Tables on the Web (pdf)
    Uncovering the Relational Web (pdf)

    Google does definitely do some semantic analysis, but I don’t think that LSI as it was first described and implemented in the 90s is scalable enough for an index of documents on the Web, which both changes too quickly and contains too much data for LSI to be helpful in most instances.

Comments are closed.