Google on Generating Statistics from Search Engine Query Logs (Hot Trends and More)

How might statistics created from user query logs be useful to search engines and to searchers?

A Google patent application published at the World Intellectual Property Organization, Systems and Methods for Generating Statistics from Search Engine Query Logs (opens in new window), explores how such statistics might be created.

The filing lists Olcan Sercinoglu, Artem Boytsov, and Jeffrey, A. Dean as inventors, and was filed with WIPO on May 9, 2007. It was published on November 22, 2007, and appears to show the process behind Google Trends. But it provides much more information than that.

A real life example which expands upon how such statistics might be useful is a study that was conducted with the help of two of the inventors listed in the patent filing, Language Preferences on Websites and in Google Searches for Human Health and Food Information.

Here’s a description from that paper of their method used to create recommendations from health related web sites, which incorporated query information from Google:

To estimate the percentage of Web publishers that translate their health and food websites, we measured the frequency at which domain names retrieved by Google overlap for language translations of the same health-related search term.

To quantify language choice of searchers from different countries, Google provided estimates of the rate at which its search engine was queried in six languages relative to English for the terms “avian flu,” “tuberculosis,” “schizophrenia,” and “maize” (corn) from January 2004 to April 2006.

The estimate was based on a 20% sample of all Google queries from 227 nations.

Query information may be used in different ways for different purposes, and people such as social scientists or marketers or politicians might have different motivations and need to look at different data. The patent application describes what a search engine might collect, and how it could be used:

A search engine receiving millions of queries each day from users around the world would generate a query record in its query log, which could include attributes such as:

  1. Query terms submitted,
  2. A timestamp indicating when the query is received by the search engine,
  3. An IP address identifying a unique device (e.g., a PC or a cell phone) from which the query terms are submitted,
  4. An identifier associated with a user who submits the query terms (e.g., a user identifier in a web browser cookie,
  5. Whether the user identifier is associated with a toolbar or other application or service to which the user has subscribed.

An example given of another use for these stats might be a publisher gauging “the popularity of a newly released book in a specific city from the frequencies of relevant queries submitted by users from that city within a given time period.”

Zeitgeist and Hot Trends

We’ve seen some of this statistical query analysis from Google described in the Official Google Blog in a post from Artem Boystov titled How we came up with year-end Zeitgeist data

As he tells us there, the Zeitgeist information doesn’t tell us the most searched for terms, but rather ones that have increased from one year to the next in queries performed at Google.

This sharing of information about searches at Google has been expanded upon with Google’s Hot Trends, and Trends. Interesting tools to look at if you want to see what kinds of terms and news are timely, according to Google searches.

Query Analysis in Other Google Patent Filings

How else might Google be using query related information? There are a number of Google patent filings that describe ways the the search engine might be using that information to influence the rankings of search results. Here are three of them that are worth a look:

Some aspects of the methods described in the patent filing

Privacy protection — This process attempts to include safeguards to prevent disclosing information that may be traced to individuals or small groups of users.

Query record information — Different parts of what may be contained in a query record are described, such as the query terms, timestamps associated with searches, IP addresses mapped to the devices searches came from, cookie information or other user identifying information, and possibly a way to understand the language associated with a query.

Sampling Schemes — It’s probably not necessary to look at all of the query records associated with a query, and this patent application describes a number of different possible ways of sampling that query data that would involve only looking at a percentage of the data – perhaps 10 percent to 20 percent. Every fifth query could be viewed, or queries could be broken into geographical regions using IP address information, and then a percentage of those could be taken, so that there’s a diversification of results from different places.

Query Sessions — The number of query records from the same IP address over a given period of time might be limited to avoid the sample from “being corrupted by bogus query data associated with malicious operations such as query spam.” But, looking at query records from the same IP address might impart important information. For example:

Very often, a user may submit multiple related queries to the search engine within a short timeframe in order to find information of interest. For example, the user may first submit a query “French restaurant, Palo Alto, CA”, looking for information about French restaurants in Palo Alto, California. Subsequently, the same user may submit a new query “Italian restaurant, Palo Alto, CA”, looking for information about Italian restaurants in Palo Alto, California.

These two queries are logically related since they both concern a search for restaurants in Palo Alto, California. This relationship may be demonstrated by the fact that the two queries are submitted closely in time or the two queries share some query terms (e.g., “restaurant” and “Palo Alto”).

Understanding Query Sessions — Query sessions might be viewed in short bursts from individual users, like ten minutes, with all queries assumed to be related, or longer sessions such as two hours, when the terms being used appear to be related. Multi-tasking, when searches definitely appear to be unrelated, such as a search for an “apple ipod” where the other searches were for restaurants in Palo Alto, could be split off into separate query sessions, even though they would continue to be associated with the same user and/or cookie information.

Query Extraction Heuristics — guideline rules that might be followed by this process, such as one that determines that consecutive queries would belong to the same session if they share some query terms or if they are submitted within a predefined time period (e.g., ten minutes) even though there is no common query term among them.

Using Timestamps to Organize Queries — timestamps when query sessions start (or end) and geographic values (from IP addresses) could be another way to organize query records, and could be usefrul for aggregating information about the use of those queries. The Trends or Zeitgeist applications mentioned above would benefit from such an organization, though the information could be useful in other ways, too.

Partitions for Query Session Records — these records are too large for a single computer server to process them efficiently. A good precentage of the patent filing goes into detail on different strategies for partitioning these query log records, and searching through that information. If you’re interested in how some of those different approaches may work, the document includes a number of examples which can walk you through some of the processes described.

Interfaces for Statistics Related to Specific Query Terms

The patent also tells us about how they might show some of the statistics about users related to a particular query terms. The screenshot included doesn’t look too different from this Google Trends search for “ipod”.

Popularity in Queries over time — We might be shown a graph which displays the popularity of a term over a specific time period. Each of the data points shown on a curve of that information might corresponds to a ratio between the number of users that have submitted at least one query related to that term during a particular week and the number of users that have submitted any query during that week. Peaks and troughs in the graph would suggest that the term’s popularity varies with time.

News coverage over time — Another curve might represent the volume of news coverage of the term during the same time period, with each data point on the curve telling us of the number of occurrences of the term in that week’s news coverage.

Cities, Countries, and Languages — Tabs might be shown which provide statistical information about the usage of specific queries based upon cities, countries, and languages. Under the Cities tab may be the top 10 cities that have the largest number of users that have submitted at least one query related to the term. These numbers may not be the actual number of users searching from those cities, but rather a a normalized value. Using this kind of normalized value, we learn about where increases in searches for specific terms from specific cities or countries or in different languages are taking place.

Sorting by time or place — Dropdown lists might also be included which reflect statistics based upon query session records for certain months, or particular countries.

Conclusion

As important as a search engine’s index of the Web may be, it may be possible to say that their index of query session information could be just as important, in that it can provide many insights into what people are looking for, and how they attempt to find that information.

Information about those searches, such as when and where they take place, and how they are used in the context of query sessions can be useful to the search engine, and to others – such as in the example of the research surrounding health related web sites that I linked to above.

Being able to analyze user query information can be useful in personalization of search results, in relating different query terms together, in determining which topics and search terms are timely and which are seasonal, in understanding how people might search for different topics differently, and in many other ways.

Share

17 thoughts on “Google on Generating Statistics from Search Engine Query Logs (Hot Trends and More)”

  1. Hi Bill,

    I actually thought that Google already does that (or tries) with their personalized search and your search history enabled. I learned in a very practical example that this query term session analysis or recorded previous searches did not do much to make Google realize after 5+ searches and visits of 2-3 sites for a very specific niche topic

    The visited sites were visits of a few minutes, which should have signaled relevance of those sites for a specific topic that I was obviously interested in and a clear pattern should have emerged. Google Toolbar was also on to realize that I browsed a number of pages on those sites and looked at stuff, scrolled down on the page etc. After looking for an visiting a bunch of site about the same subject, did I do a search where the the keyword was ambiguous if taken out of context. I hoped that Google would realize that I was interested in sites and information about a specific topic to be able to filter out some of the ambiguous results that are off topic and boost at the same pages that match my keywords AND the subject or theme.

    Google didn’t do anything like that. The top pages (beyond top 10) were ALL unrelated. I got the impression that it didn’t do anything or little with the collected data. Lets hope that they will improve on that.

  2. Hi Carsten,

    Hope that you had a good Thanksgiving.

    This patent application does cover some areas that should look familiar to people paying attention to papers and patent filings from Google on personalization, and how Google might be using query information.

    I hadn’t seen the paper I linked to above about conducting research on different queries to see what kind of information is available in different languages before writing this post, and it’s nice to see a concrete example of how query log files can provide some helpful information to researchers.

    The patent filing also provides more information about how Google Trends works than I had seen in any one place before, and gives us a few new vocabulary terms, and ways to think about how query analysis works at the search engine.

    Given all that, using query analysis for personalization is still in its infancy, and all signs point towards it being used in a statistical training model that, like most machine learning processes, takes time and lots of information before it has much of an impact. You’ve probably seen it more effectively used in areas such as spelling correction, where the aggregation of query refinements lead to those “did you mean xxxx” messages at the tops of some searches.

    I think that it is fair to say that Google’s use of query analysis in personalization efforts has a lot of room to grow.

  3. Yup, my example might gives them some practical ideas. I am not sure how good Google is when it comes to determining “topical neighborhoods”, which is a strength of Ask.com as we know, but that would play a role to be able to resolve query disambiguation based on previous queries, click streams and length and details of site visits after the click.

    I can tell you my example. I was searching for demo-parties, which have ambiguous names like “The Gathering” or “Assembly”. You have to use qualifiers to find them, like “The Gathering” + Norway and “Assembly” + demoparty (or “demo party”).
    That was also not a problem. I looked for those and browsed them and some other sites that were related to the subject. Those parties also still exist and are being held once or twice per year.

    When I spent some time having fun with that kind of stuff, was I attempting to get more information about the demo-party with the name “The Party”, which was held in Denmark every year from 1991 to 2002 between Christmas and New Year

    http://en.wikipedia.org/wiki/The_Party_%28demo_party%29

    I did not want to use too much qualifiers for my queries, because I wanted to find as much as possible information and data related to it as possible. Demo-scene and Art-scene stuff is poorly indexed, which is not Google’s fault, but is due to classic SEO mistakes that are related to site architecture.

    Anyhow, queries with “The Party” after all my other queries and site visits should result in pushing up of pages in the SERPS about things related to the demo-scene with the keyword match because of relevance. I got instead all kinds of “party” related stuff back that was related to the subject like a fraternity party to the 750 anniversary party of Berlin.

    You can’t get much more generic in regards to disambiguation as with a term like “the party”, right? Detection of the topic that I seem to be interested in based on previous search and site visit activities (they could even used my bookmarks in Google or Gmail data :) ) and re-ranking results according to that would be a blessing.

    Cheers!

  4. Hi Carsten,

    That particular term (The Party), within the context that you provide, might be a difficult one to tackle for a search engine attempting to do some personalization.

    By itself, “party” can generically refer to a political organization, to an individual or group of people (the party in question), to a celebration, to participants in a wedding, and probably a few other uses. The other part of that term (the) may often be seen as a stop word.

    It may also be difficult to recognize “The Party” as a concept, or semantically meaningful unit, since it’s likely that “The Party” appears in many other contexts on the Web.

    Could a search engine recognize the term as referring to a specific event? There are some patents and papers from Google and others that explore data extraction within the context of named entities – specific people, places, objects, and events. Those might struggle with an event that has such a generic name. Here’s one approach involving that type of extraction:

    Unsupervised Named-Entity Extraction from the Web: An Experimental Study (pdf)

    How big of a challenge might it be to associate past searches for some generic terms such as “The Gathering” or “Assembly” and visits to pages that describe specific events, with a search for another unusual usage of the generic term “Party?” I suspect that it probably is a significant challenge. :)

  5. Who says that it would be easy? :) At least I tried to throw them some bones to make it easier for them. I was aware of the personalization features enabled and tried to send as much and as clear as possible signals in the hope that Google would recognize the theme.

    The theme is by the way a very unique one. Have a look at the pictures in this Picasa album with images from several of those “types” of parties and tell me if you recognize a unique theme there that separates them from any other type of party :)

    http://picasaweb.google.com/Carsten.Cumbrowski/SceneParties

    Well, this is a good and fun example for testing things related to “themes” and/or detection of “named entities” based on past queries, clicks and pages visited and of course things like bookmarks and Gmail data, Picasa albums ( :) ) etc.

  6. Nice stuff as always Mr. B…. I am seeing more user data. A lot more directed at the ‘historical’ elements as well which is of particular interest. I also like the part looking at ‘query spam’ as I have spent some time this weekend looking at how one might find vulnerabilities in a User Performance Metric environment.

    I am also keen on collection methods. Originally I wasn’t as convinced on the ‘passive’ collection methods delivering a strong enough signal, much of that has changed. Outside of the many mentions of cookies/IP through many of these, there is the looming pervasiveness of Google Toolbars via FireFox which furthers the cause (and Google Pc et al).

    I also see a hint of the ‘terminology’ starting to take place… much to do…

    On with the day … talk soon.

    Dave

  7. Thanks, Dave.

    I think one of the biggest challenges of user data may be that there’s so much of it. Coming up with a sampling method or methods and helpful heuristics is one step towards makiing working with that volume of information feasible.

    The statements in the patent about query spam were interesting, and I suspect that more has been done in that area that we aren’t being informed about here, which is probably a good thing.

    I’m seeing a growth in the ability of Google to collect browsing and searching information.

    Another source of information that Google is collecting that they are now sharing with people is site search information in Google Analytics.

    Imagine a search engine collecting information like this: you search with a particular query at a search engine, click upon a result, browse pages on that site, and search the site in its site search. Can the search engine query logs, toolbar browsing activity logs, and Google analytics and site search records be tied together in some meaningful manner?

    It’s a lot of information, but is it helpful information? It might be.

  8. Carsten,

    That does look like an interesting party — lots of people and lots of computers.

    I’m wondering whether a search engine would pick up on some of those themes if the pictures had tags, titles, descriptions, and comments, as well as something like geotagging. You do have a group title (Scene Parties) but not much else in the way of textual content.

  9. Hi Bill,

    One thing I don’t like about Google Picasa is that it does not show the file name of an image. You can get to it, but it is buried. It would be nice to be able to see it right in the detail and even the gallery, if no tags were provided. The name and some other info should also be shown (at least as tool tip) when you hover over a thumbnail in the “organize gallery” screen.

    The file names are for the most part very descriptive and have often the name and year of the event in the name itself. There is no way to tag larger amount of images quickly, especially within a single set, where it makes most sense to provide a mass edit feature. I would go and tag more images, if the tags would be stored in the image itself that it could be reused by other services like flickr too, There is some space to store some information. I noticed for some images that it showed things like the camera used for the picture etc.

    Well, that is enough “off topic” stuff hehe.

    This type of party can’t me experienced anywhere in the US. Pictures are nice, but you have to see it to believe it in real life. It’s crazy and fun :) (okay, was fun, the last demoparty I attended was “The Party 1998″ in Herning, Denmark. It’s for geeks, but I don’t deny being one :)

  10. Hey Bill,

    Each time I visit your blog my mind is left spinning. The user data that Google has access to (by strategically creating products that obtain it) and the investment that they make in understanding and utilizing it is astonishing.

    Tools like Google Trends provide limited access to the masses. We get to play with a barely-surface-scratching interface of invaluable information about user intent. If any entity on the planet understands the interests, desires, needs of the world’s citizens it is Google.

    Data like that, backed by the intelligence to interpret it is a license to print money.

  11. Hey Lindsay,

    It’s good to see you. Scary as it is to say it, I wonder if Yahoo has even more user data than Google. There are a lot of folks who use Yahoo portal services in addition to search and their social applications. Yahoo Buzz (used to) show off some of what they are watching on a regular basis. And Ask.com also has their eyes on Trends and user data.

    Know what people are searching for, and where; I think that you’re right in it being like a license to print money. Selling ads pales in comparison with what could be done with the intelligence being collected by the search engines.

    Carsten,

    It would be interesting to see what kind of improvements could be made to Picasa, especially with some of the technology that Google acquired from companies like Neven Vision. Guess we need to wait for some of that.

    I’ve been to a few barcamps, which seem to share a spirit with a demo party, but not the proliferation of computers (though I’m guessing that most of the people who I saw at those had at least one computing device on them, though most may have been smart phones).

    While in San Jose at the 2006 Search Engine Strategies Conference, there was another conference in town for people using wearable computing devices. They were doing a lot of fun things, like going on treasure hunts with mobile computing devices, and receiving new clues when they found the locations of older ones – with images and maps on their handhelds.

    It looked like they were having a lot of fun.

  12. Pingback: Hot or not? Does Google really use a new ranking algorithm? | Fenetre Marketing Blog
  13. Pingback: Search Engine Query, Parsing Improved « The Blog at Secret Search Engine Lab
  14. Wow, it’s funny that just three years ago we we’re talking about how Google uses search queries to determine what’s viral. Ahaha.

  15. Hi Brian,

    I think the wider the range of data that is available to the search engines, the better for determining what kinds of things might be viral.

    While it’s possible to look at information from microblogging networks like twitter to find similar information, one of the good things, at least from Google’s perspective, about using query logs is that Google has much more access to information about searches and may be able to more easily apply more filters to that data to get rid of noise.

Comments are closed.