A couple of months back when I was traveling, I wrote a quick post about a new PageRank patent issued to Stanford University on PageRank and asked if anyone would be interested in trying to break it down to see if it had anything interesting in it. David Harry took a look in a post titled Tale of the two PageRank Patents.
David and I have been exchanging some emails since on some of the patents that we see, and an area that we are both fascinated with are some that delve into a kind of a behind the scenes personalization. He has written a couple of very thoughtful and interesting posts involving personalization at Google recently, which are worth checking out:
- I have seen the future and it is VERY personal
- Why Google Personalized Search is Important to You
Personalized Search Goes Beyond Google’s Personal Search
This isn’t the kind of personal search that you log into at Google, because chances are that it will happen whether you are logged in or not.
This isn’t the kind of personalization that focuses solely upon your Web History, because it looks at more than just your browsing and searching activities.
This isn’t the kind of personalization that relies upon which pages you have bookmarked in a Google bookmark program, though that kind of activity could be one of many signals that could be used.
Personalized Search Behind the Scenes
This kind of personalization that I’m talking about, and that Dave is talking about in his latest post, is the kind that could account for a growing influence on what you see when you search, regardless of whether or not you are logged into Google.
I described something like it in a post about Microsoft using Personalization Through Tracking Triplets of Users, Queries, and Web Pages
I wrote about it concerning Google in How Previous Searchersâ€™ Queries Could Be Used to Re-Rank Your Search Results, which discusses how a search engine might expand your queries based upon other things that you may have searched for in the same session based upon a language model centered around the queries themselves.
That kind of model created for queries could also be created for searchers, who could be clustered together with other searchers who seem to choose similar results for certain queries, in Google Patent Application Clustering Users for Personalization
We are also seeing web site traffic profiles, and profiles created to understand sites at the internal site search level as well as group profiles created for individuals searching at Google and browsing the Web — both of which I described in Googleâ€™s Profiling Both Users and Sites?
Understanding User Intent Through Statistical Models
We often talk about the ranking of Web pages with terms like PageRank or relevancy, meaning how relevant terms on a page might be to a query used by a searcher.
Many patent filings coming from Google refer to statistical models, like a probabilistic model that can learn about how words are related to each other, and how pages might be similar. Those models might tell us something about searchers.
A Google patent that I’ve written about here before, but haven’t described in very much detail provides some nice details on a model like that:
Method and apparatus for learning a probabilistic generative model for text
Invented by Georges Harik and Noam M. Shazeer
US Patent Application 20070208772
Published September 6, 2007
Filed: April 27, 2007
One embodiment of the present invention provides a system that learns a generative model for textual documents. During operation, the system receives a current model, which contains terminal nodes representing random variables for words and cluster nodes representing clusters of conceptually related words.
Within the current model, nodes are coupled together by weighted links, so that if a cluster node in the probabilistic model fires, a weighted link from the cluster node to another node causes the other node to fire with a probability proportionate to the link weight.
The system also receives a set of training documents, wherein each training document contains a set of words. Next, the system applies the set of training documents to the current model to produce a new model.
The patent tells us of some possible uses of such a model:
1) It can be used to guess the concepts behind a piece of text, and those concepts can be shown to a user to allow them to better understand the meaning behind the text.
2) It can be used to compare words and concepts between a document and a query. As an information retrieval scoring function, that can be helpful when running a search engine.
3) It can be used to identify clusters of different concepts or meanings for a specific query. A search for “java” could result in clusters related to a programming language, or coffee, or an island of that name. If a search engine understands that such clusters exist, it could show results that break down results shown to a searcher, in a percentage related to the number of different clusters there are, and how large or small each cluster may be. This could mean that there is a diversity of results for a search.
4) It can be used to comparing the words and concepts between a document and an advertisement. This might provide an estimate of how well an advertisement might perform, and it might help decide which ads to show on which pages.
5) It can also be used to comparing the words and concepts between a query and an advertisement, for ads that are served when certain terms are searched for at the search engine. So, a search for “java” might show many programming-related results, fewer coffee results, and perhaps some travel ads.
6) It’s also possible to compare the words and concepts between two documents, to see how similar or different they are, and know-how tightly they should be clustered, if at all.
A few others mentioned include a way to possibly filter some results, expand queries, and understand whether a particular word is a misspelling of another word.
How does this statistical model learn about such things?
The model is one part of this “behind the scenes” personalization that I write about above. The patent application describes how user query sessions and searching activities can be one way for such models to learn.
I’m going to return the ball to David for him to write some more on this kind of personalization, but I’ll have more on some of the other documents from Google that discuss how user data may be used by a search engine through models that look at probabilities.
David wrote at the end of his last post that he is going to be writing about things that you can do in response to this move towards personalized search. I’m looking forward to that post.
13 thoughts on “Google and Personalization in Rankings”
Not sure whether I’m more scared or fascinated by this – the seo trade would certainly need to become far more advanced to deal with this sort of results segmentation.
Reading your post I checked my gmail and surprise…
I found out Google alerts responding to my old query: rotation number of a product of circle maps (it is some math)
Unortunately no one is related to what I was looking for:
1) Is this the X-signal system which I have hypothesized, or just a product of our
2) Feature: Support rotation of rasters in GMaps* Comments: Wanted one
features in GE and GMaps
3) Lesbians Taking Showers
… agencies bang,molly circle of willis aneurysm in the head spears hilton no underwear photo denise richardsons playboy spread xl_lingerie erotic stories … gay ruff raw video [Sorry Sid, I edited this part slightly as it started to get explicit, including a link to an x-rated site that transformed into HTML upon posting] …
Anyway I’m a mathematician and I will study how personalization works.
It makes things interesting, doesn’t it, Richard.
There is more to support this kind of movement towards incorporating user behavior and statistical models into the way that queries are understood and served to searchers.
Nice to meet you. If you’re into math, you may find Google’s patent application on Predicting ad quality to be interesting. It describes a statistical model based upon a number of points of user data that could be used to determine the quality of ads.
Data mining and pagerank meet for a marketer’s dream and , quite possibly, an end user’s nightmare with personalized rankings. I suppose there are positives and negatives to this all the same.
It is scary how fast things are moving with Google and the internet in general. It wasn’t long ago when there was all kinds of irrelevant listings in Google’s results. Now personalization. Which I suspect others are working on as well as I have noticed that ads I get at hotmail seem to be tailored around my interests at times and it happens enough of the time for me to believe that it is more than a coincidence.
I think in a few ways, personalization presents some challenges to marketers, and potentially some benefits to searchers rather than the other way around. There are interesting times ahead, though, with those challenges.
I think that the movement towards recommendation and personalization is inevitable. Knowing your audience is a key to the success of many businesses, and I think that search isn’t an exception. Interesting news on Hotmail.
Correct me if I’m wrong, but Google already has a lot of these synonymy and polysemy retrieval technologies already. Seems like they’re seeking to apply such technologies to their personalized search. FORCED personalized search that is.
Really good question.
Truth is that we don’t know for certain what they are using.
There’s been some discussion in patent filings on Google using things like a probabilistic statistical model for a couple of years, but not a lot of discussion outside of those patent documents.
A sample chapter from “Google’s PageRank and Beyond: The Science of Search Engine Rankings” talks a little about polysemy and about probabilistic models.
A probabilistic model really does need the kind of information that you would get from query logs to work well.
Regarding your response to my previous comment —
You could be right. Personalization in rankings could potentially make searching the internet more efficient and less time consuming for the end user. I can see this as a definite plus, especially for people who use search engines as part of their day to day job. Thanks.
Comments are closed.