On Personalized PageRank and Personalized Anchor Text Scores

Last week, I made a post introducing a newly granted patent from Google, Personalizing anchor text scores in a search engine (US Patent 7,260,573) which was filed in May of 2004.

In the midst of the Search Engine Strategies Conference, I didn’t have a chance to delve too deeply into the patent. I am returning to it, and to the context in which it was filed and granted. The Mad Hat has a nice overview of the processes involved in Personalized Anchor Text Score.

Let’s look at a little of the history, and some of the papers and ideas around at the time that it was filed.

The Role of Kaltix in Personalizing PageRank and Page Rankings

The inventors listed in the patent are Taher Haveliwala, Glen Jeh, and Sepandar Kamvar, and all three came to Google when the stealth startup that they founded, Kaltix, was acquired by the search engine in late 2003. Kaltix was started as a spinoff of the Stanford Personalized PageRank Project.

The trio were working on a way to speed up the calculation of PageRank so that a personalized PageRank could be calculated for each searcher. But a personalized PageRank wasn’t their only goal. They also wanted to use other methods to reinforce that personalization, and this patent aims at taking advantage of the meaning of text in the anchor portion of links pointing to pages.

Gord Hotchkiss posted parts of an interview with Marissa Mayer at Search Engine Land in February in which they discussed aspects of personalized search at Google. One of the topics of the discussion involved the use of technology developed by Kaltix in Google’s approach to personalization, and the reasons why Google acquired Kaltix:

The reason we were really interested in them was: one, because they really grasped and cogged all of Google’s technology really easily; and, two, because we really felt they were on the cutting edge of how personalization would be done on the web, and they were capable of looking at things like a searcher’s history and their past clicks, their past searches, the websites that matter to them, and ultimately building a vector of PageRank that can be used to enhance the search results.

Marissa also noted that their methods of speeding up a PageRank calculation was of interest to Google. Some of the ideas behind that process may be found in papers from people involved in the Stanford project and Kaltix:

There are a good number of other papers involving PageRank that are worth a look mentioned on the pages of the Stanford Personalized PageRank Project, including Topic Sensitive PageRank.

Patents Related to Personalizing Anchor Text Scores

We also see the names of the Kaltix crew on a couple of other patents granted to Google, which I wrote a little about.

My first post on one of those is Stanford’s New PageRank Patent (Methods for ranking nodes in large directed graphs), which appears to incorporate ideas from the “Block Structure” paper that I link to above. It is mentioned in the Personalized Anchor Score patent.

In April of 2006, I wrote a post on Google’s adaptive pagerank patent, which looks at Adaptive computation of ranking. It appears to focus upon the processes described in the “Adaptive Methods for the Computation of PageRank” that I also link to above.

While those patents and papers focus upon increasing the speed, and decreasing the computational expense of calculating PageRank, this patent on personalized Anchor Text adds the element of hypertext analysis to personalization. It mentions personalized PageRank, and also the possibility of a “Block Structure” calculation of PageRank.

There is another patent filing cited (and incorporated by reference) in this present patent: “Anchor Text Indexing in a Web Crawler System.” It is presently unpublished, and unavailable from the USPTO, but it was filed on July 3, 2003 and was given U.S. patent application Serial Number 10/614,113. It sounds like it might provide some ideals on how anchor text is incorporated into a relevancy determination for document ranking purposes. Might be worth keeping an eye out for.

The problem addressed in the Personalized Anchor Text Score patent

PageRank by itself attempts to use the link structure of documents in a search engine database to computer global “importance” scores for those documents, which help influence the order that documents are presented to searchers in search results.

But PageRank looks at the existence of the links themselves, while ignoring that those links often, but not always (as in the case of image links), contain text that describes the destination webpage of the link.

That text is commonly referred to as anchor text, and the authors of the patent tell us that it “often provides a more concise and accurate description than the destination web page itself and therefore can be used in determining the relevance of the destination web page to a particular query.”

They also provide the following snapshot of how Google worked at one point in time:

It is noted that the Google search engine, as of late 2003, determines the position of a document in a set of search results as a function of the PageRanks of the documents in the search results, the query terms, the documents in the search results, and the anchor text of links to those documents.

The question raised, and attempted to be answered by the patent, is if a way can be devised that will pay attention to the specific personal preferences of a seacher when ranking pages in a computationally feasible manner. The patent inventors tell us that the use of PageRank and the relevance factors cited may not provide optimal results, “attuned to a user’s personal preferences.”

The creation and use of Personalized Anchor Text Scores

Page importance scores determined
Anchor text indexed
Multiple databases used in processing a query
User information database
Page importance ranking
Limitations on source pages
Personalized anchor text score

Here is a walk-through of a number of the processes described in the patent

Page Importance Scores determined

Web pages are crawled and then processed by a content indexer to produce a set of inverted content index entries for the content that appears upon pages.

A page importance ranker computes the document’s page importance score (possibly the document’s PageRank, which is based upon the PageRanks of the documents linking to that document). The page importance score is stored in a database.

Other scoring systems could be used, such as scores from another link analysis or page importance determination methodology.

Anchor text indexed

An anchor text indexer generates an inverted anchor text index from links in each page received by the server, including text surrounding those links.

The links and text are extracted from pages, and recorded to identify:

  • The source document,
  • The target document associated with a link, and;
  • the anchor text associated with the link.

An inverted anchor text index is generated where anchor text terms are mapped to the documents that are the target of the corresponding links. That index could be merged with or incorporated in some manner with the inverted content index.

Multiple databases used in processing a query

Upon submission of a query, the request is sent to a server-side query processor to respond with search results. That query processor checks with multiple databases to identify pages satisfying the search query, and it determines how to order those search results.

Those databases may include:

  • The inverted content index,
  • The page importance scores database, and;
  • the inverted anchor text index.

The inverted content index may first return a set of pages containing the query terms used, and the query processor send the same query to the inverted anchor text index to find another set of pages satisfying the query. It is possible for a document to appear in both sets of documents.

The two sets are sent to page importance scores database and ordered according to their page importance scores and possibly their query dependent relevance scores.

User information database

The search engine may also have a user information database, which includes personalization information for searchers, often referred to as a user profile.

Under this system, a user can also submit user information, which make take the shape of an actual user profile or a set of parameters specific to each user which includes their background and preferences.

Instead of a user database containing this information, it is also possible that the user information could be submitted to the server together with the search query, or submitted separately from the query.

Some or all of the user information could be derived from previous search queries and by the pages in the search results that the user chooses to view or use. That personal information may be stored in a user information database and associated with a unique user ID. Or it could be received with each search, and not retained by the search engine for subsequent searches.

Page importance ranking

User information can be used to compute personalized page importance scores (personalized page rank) for at least a some of the documents returned by the crawler. A Page Importance Ranker generates a page importance score for each page, and user-specific (personalized) page importance scores for some of those pages.

The concept behind computing personalized page importance scores is that the Page Importance Ranker boosts the page importance scores of pages that are deemed to match the user-specific parameters, which in turn boosts the downstream pages linked to those documents.

In other words, the Page Importance Ranker boosts the page importance scores of documents of each host whose URL matches one or more of the user-specific parameters. It’s possible that a page may be deemed to match (or not match) user-specific parameters solely based on the URL of the document.

More than the URL may be looked at in this process, with a match of the user-specified parameters and the content of the page, and/or the anchor text content of links to that page.

If the document is deemed to match the user-specific parameters in a user profile (e.g., if the URL of the document includes any of URL keywords in the user profile), the document is assigned a personalized page importance score specified by a parameter in the user profile. For example, the user profile may specify for each URL keyword a particular page importance score adjustment that is to be applied to matching documents.

If a page matches more than one URL keyword, a larger page importance score adjustment may be applied to the page. A user’s profile may specify the adjustment or assignment of personalized page importance scores in other ways, too.

A personalized page importance score for a document only involves the document, a user’s specifically defined parameters, and the link structure through which the page is related to other pages. It is one signal amongst many other ranking factors, and is not related to the specific query being searched for by the searcher.

It may be possible, though, that past searches and page choices in results may indirectly affect a document’s personalized page importance score, because they can become part of a users profile (or user defined parameters).

Example:

If a user has submitted many queries related to the standard aptitude test (SAT), the server may update his user information and incorporate this information into the set of user-specific parameters.

It’s possible that the number of documents for which personalized page importance scores are stored may be limited because of storage issues or computational expense, and scoring mght be done at the time of a search. The patent describes some of the implications of that approach that I’m not going to discuss.

Limitations on source pages

It’s possible that source pages listed for a page may be limited to pages satisfying a predefined requirement with respect to the search query. A version of this process might require that the anchor text of the link to the page contain at least one query term from the search. Or, that the anchor text contains all of the terms of the search query.

Or, all source documents may be included, regardless of whether the anchor text of the links to the respective page contains any of the query terms. They authors tell us that limiting the source documents to those whose links have anchor text that includes at least one query term is preferred – this ensures that only source documents with anchor text relevant to the search query are used to personalize the ordering of the documents within the search results.

Personalized anchor text score

Here’s how this score is calculated:

1) Source page information is extracted from the initial search results.

2) That source page information and the user profile of the user who submitted the search query are used by a Personalized Anchor Text (AT) Score Generator to generate personalized link analysis (LA) scores for the source pages that correspond to each respective document (D1, D2, etc.) in the initial search results.

3) The personalized LA score for a source page may be its personalized page importance score (e.g., personalized PageRank), or the personalized LA score for a source document is a function of its personalized page importance score.

4) The personalized LA scores for the source pages are summed or otherwise combined by an accumulator, thereby producing an anchor text (AT) score for each respective document in the initial search results.

5) Results are then reranked by a search result ranking function, as AT score and the IR score of the document are combined, producing a set of final personalized ranking scores or values.

6) The pages are then ordered in accordance with the personalized document rankings to come up with a final, ordered set of search results.

Conclusion

A quick takeaway point – image links, and anchor text using generic terms like “click here” may not benefit from the boost that well chosen anchor text may.

Is Google using this personalized anchor text scoring to rerank results? It’s hard to tell. But they do seem pretty interested in making personalized search work well.

A timeline of some Kaltix and Google Personalization posts and news:

Share

20 thoughts on “On Personalized PageRank and Personalized Anchor Text Scores”

  1. This is some research you compiled here. I don’t see how there is enough computing power to do a deep custom search for each search. Let along that one algorithm can do it all.

    BeachBum

  2. Hi Beachbum,

    It does seem like it would be computationally expensive to do something like this. The original patent on PageRank filed in 1998 also discussed creating personalized PageRank for searchers, though there’s been some question as to whether or not that would be possible:

    Various implementations of the invention have the advantage that the convergence is very fast (a few hours using current processors) and it is much less expensive than building a full-text index. This speed allows the ranking to be customized or personalized for specific users. For example, a user’s home page and/or bookmarks can be given a large initial importance, and/or a high probability of a random jump returning to it. This high rating essentially indicates to the system that the person’s homepage and/or bookmarks does indeed contain subjects of importance that should be highly ranked. This procedure essentially trains the system to recognize pages related to the person’s interests. The present method of determining the rank of a document can also be used to enhance the display of documents. In particular, each link in a document can be annotated with an icon, text, or other indicator of the rank of the document that each link points to. Anyone viewing the document can then easily see the relative importance of various links in the document.

    It also discusses the use of anchor text in indexing pages:

    In addition, the search can include the anchor text associated with backlinks to the page. This approach has several advantages in this context. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for images, programs, and other objects that cannot be indexed by a text-based search engine. This also makes it possible to return web pages which have not actually been crawled. In addition, the engine can compare the search terms with a list of its backlink document titles. Thus, even though the text of the document itself may not match the search terms, if the document is cited by documents whose titles or backlink anchor text match the search terms, the document will be considered a match. In addition to or instead of the anchor text, the text in the immediate vicinity of the backlink anchor text can also be compared to the search terms in order to improve the search.

    It doesn’t bring up the idea of trying to personalize that anchor text analysis, which is the improvement that this patent brings.

    I think that the methods developed by the Stanford group and Kaltix to speed up PageRank calculations were important to work upon first, to even think about getting a personalized PageRank to work. I don’t know if we are there.

  3. Great analysis and I would have to agree about your statement of not being there yet. With each level of personalized detail it becomes exponentially more difficult to algorithmically rank each query. Currently I would think that this would be limited to a broader view of “categorical personalization” rather than a true level of personalization they discuss in the patent.

  4. Thanks, TheMadHat.

    Like most things, you have to start somewhere, and this patent was filed a few years ago. This process, if being used, is just one step towards personalization, but I suspect not the only one that they are trying.

    Google has moved from recording search history to web browsing history, for one thing. I think that they’ve gotten understanding the context of queries down a little better than back then, too. And, according to Marissa’s keynote at SES last week, people are using Google’s personalization features – so they have a chance to learn from user interaction. Interesting to see where they go from here.

  5. The technical lead of personalization at Google, Sepandar Kamvar, is on record as saying that Google wants to compute PageRank for every single person. How this is (or will be) achieved remains a secret as does how this is (or will be) factored into personalization.

    What is not a secret is that personalization in a primitive form exists now and is enabled by default in every Google account. For example as we know if a user has specified a default location in Google Maps then for relevant searches Google will personalize the results based on that location. It is relatively straight forward to determine if user specified preferences for the Google personalized homepage, Google News, Google Groups etc., etc, are used in a similar way.

    It is also possible to see how the SERPs are influenced by search history. Creating hundreds of separate ‘personas’ and constructing predetermined search histories for each gives an insight to the level at which Google is disambiguating search terms for example. You can even perform some elementary monitoring of an individual persona using the ‘Picks For You’ Toolbar custom button in the course of the experiment.

    However there is very little SEOs can do directly to influence a user’s Google account preferences or their search history.

    Indirect influence is another matter and this is where the technical SEOs will be thinking out of the box over the coming months. Of course before thinking out of the box it is necessary to understand what the box actually is and what it is made of :)

  6. Hi Michael,

    There are a lot of reasons to want to understand how Google might be using data gathered about us in presenting search results and advertisements and maps, and other results.

    One is that our search and browsing history is our data.

    In her keynote at SES last week, Marissa Mayer mentioned the possibility that someday the personalized version of Google might be the default version. It definitely bears paying attention to in the future.

  7. A question and a few points.

    1 – If Google make personalised ranking a more important part of their system how do you feel this will affect searches conducted by people with no ‘user data’? Will the user data of other Google users influence the SERPs as a whole?

    2 – Nobody I know has any association with Google’s services bar Google Analytics. I think that personalisation will only really be effective in technological industries where people are more linkely to be using Google’s services.

    3 – I am with you on personalisation not being a major factor at the moment, but it is possibly the best avenue Google could take in an attempt to increase the quality of search results.

  8. Hi David,

    Nice question on the impact of user data upon those who don’t have much of a user history. It’s likely that Google already incorporates some aggregated user data into the search results that we see even if we don’t have the personalized Google turned on. The decisions on some of the Universal search results that we see, suggestions for spelling corrections and query refinements, what links are shown in Site Links for some top results – these are all things that are impacted by user data.

    Google does provide some services that people may use which aren’t necessarily for people in technological industries, such as Gmail, and the chance to have a personalized homepage. Those may be gateways for people to try out, and use personalized search. I can see that happening.

    Personalization is one avenue to attempt to increase the quality of search results, but there are others, and I think that Google is exploring many different options beyond personalization.

    Many of the specialized vertical searches make sense for people to use when they know that they want to focus upon a more narrow area than just a broad Web search, such as Local Search, Google Scholar, or Video Search.

    There are other features that they are working upon such as question answering, translation, news archives, and book search that offer potential possibilities to broaden our searches beyond our own languages, and beyond what we might find upon single web pages, or even on the Web.

  9. Thanks for you insights Bill. One thing Google will have to look at first is keeping me logged in. I am from the UK and the only Google service I use is Analytics. After I log in, I am logged in on google.com, but not on .co.uk.

    Another question pooped into my head today – if you check your older posts – is:

    If Google can tell from my user data that i’m into SEO and all sorts of programming stuff, how might this affect a search I do for ‘Hopi Ear Candles’? Or would it not have any affect?

  10. Thank you for the interesting questions, David.

    I haven’t covered the Google patent applications for sign ons, but they ae out there. :)

    That is odd that using analytics logs you into google.com and not google.co.uk. It’s possible that the analytics program is only located in one geographic area rather than multiple regions, and doesn’t recognize regional preferences in usage of the main search. That might be something to mention in perhaps the webmaster central part of Google Groups.

    A recent paper that I wrote about recently, co-authored by someone from Yahoo, focused upon understanding the context of queries, saying that there were some issues with personalization that couldn’t be easily solved.

    An example that they used was a computer scientist who often searched for computer related subjects through personalized search decides to take a vaction on an island in the south pacific, and while there decides to find out about public transportation by searching using the term “java bus.”

    In response, all he gets for answers are pages involving programming, and some O’Reilly books – no bus scedules at all.

    If all you look at are internet marketing and computer related stuff, than that might influence some of your searches. But it’s likely that a search which isn’t ambiguous, like “hopi ear candles” won’t be affected.

    Without any user data that could be construed to be on that subject, you would probably initially see results unrelated to your query. Of course, followup searches might be influenced by those searches for theraputic health treatments like hopi ear candles.

  11. Great example of how it could go wrong. I noticed you have Pandora Radio on your side-links, this service provides an interesting example of how simplicity can prevail in personalisation.

  12. Thanks.

    Yes – Music recommendation services are some of the earliest on the Web to try to come up with personalized services. With the limitation of being in a narrow niche, they have less issues than trying to do something like provide personalized results for Web searches. It is interesting to see what they provide when it comes to personalization.

  13. Hi Matt,

    It’s possible that Google is using some method similar to this in personalized search. It’s one of those things that can be difficult to tell on the surface, but definitely worth exploring.

  14. Interesting post, really.
    the potential rewards of personalized search are great, so are the hurdles to building a successful personalized search engine. The problem is, there isn’t a ‘one size fits all’ formula…

  15. Hi giorgia,

    I’m not sure that I would say that the concept that there isn’t a “one size fits all” formula is a problem as much as it is a challenge. We’ve seen the search engines try a lot of different approaches to personalization, and chances are that they will be experimenting on personalized search for years to come. It’s pretty interesting to track the things they are trying.

Comments are closed.