Social Trustrank and User Annotations as Anchor Text

Golden Gate Bridge

Imagine taking information from social networking sites like Digg or Del.icio.us, about users’ likes and dislikes concerning web pages. Add to that their personal interests mentioned in profiles from places like Myspace or Facebook, and their relationships with other people who may also have created social profiles. Throw in descriptions used in tags and annotations that they may have left about pages, and the timestamps attached to those tags and annotations. Now, use that information to influence the rankings of a search engine.

Yahoo offers a wide range of personalized services, many of which allow their users to tag or annotate things they find on the web, create profiles, and build and define relationships with others. Can that information be used to rerank web pages in search results, and also identify whether some pages are more trustworthy than others?

User Added Content and Interactions on the Web

Can user added content, in the form of things like tagging, user annotations, and recorded user behavior improve the relevance of search results?

Might adding someone as a friend in Flickr, or participating with someone in a Yahoo Fantasy Football league help shape the search results you see? Could the rate at which annotations and tags are added to pages influence which pages are seen as more popular, or less, and cause changes in rankings?

A series of patent applications from Yahoo explore that topic. A number of these were published at the end of 2006, as well as some related older filings. I’ll be making a series of posts that explore these patent applications from Yahoo.

The first one I’m looking at includes a detailed explanation of why Yahoo is exploring this approach. But first, an explanation of what this method may add to ranking pages.

The Nature of the Problem

The document starts with a synopsis of how a search engine typically indexes web pages. Here’s a summary.

Search engines don’t usually search the web directly, but rather search through an inverted index. Steps taken to create an inverted index? One of them is tokenization.

A content item, such as a web page, is broken into a list of words. This process can be more or less complex based upon the language which the page is written. For example, tokenizing Chinese text is more difficult than tokenizing English text, since word boundaries in Chinese are not marked with spaces.

Another step is know as normalization, which may involve stemming or morphological analysis, in which plural endings and other suffixes may be removed. Again, this could be more complex for highly inflected languages. Stop words may also be omitted.

Every occurrence of each word may then be recorded in the inverted index. This process of transforming a content item from original form into a set of entries in an inverted index is known as indexing. An inverted index is a data structure which consists of a table of lists.

Each entry in the table can be accessed by a unique word, and each item in the list for a given word indicates a content item, or page, in which that word occurred. These items are called postings. Postings contain identifiers for the content item containing the word, and possibly also additional information about how often or where the word appeared in the page. Posting lists are the lists of content items (such as web pages) containing a word.

When someone enters a query into a search engine which employs an inverted index, the query is broken into words in much the same way that the system processes web pages. It will look in the table to find the posting list for each word. Each posting list represents the set of content items containing the word.

If the user’s query is interpreted as a Boolean AND, then the intersection of the sets for each word is computed.

If it is interpreted as a Boolean OR, then the union of the sets is computed.

In most search engines, a relevance score is computed for each candidate content item in the result set, and only the top-scoring candidates are retrieved. Many factors may go into a relevance score, including:

  • The frequency of occurrence of the query words,
  • Their statistical distinctiveness, and;
  • Properties of the web page such as its modification date.

In addition to the ranking features used in traditional search engines, web search engines also rely on information based on the connectivity of the page, such as the number of pages linking to it, in determining the relevance score of a search result.

Existing search engine indexes might not capture the precise verbiage that a user query comprises, raising issues of the relevance of pages in search results.

Disinformation, or irrelevant results, may cause a lack of trust involving the web pages show in search results.

The patent application tells us that we need new sources of information on which to base and rank searches. These could be used alone or together with existing searching and ranking techniques, leading to more reliable search results for users.

In addition to new sources, new techniques are needed to index this information…

The Patent Application

Using community annotations as anchortext
Invented by Daniel E. Rose, Jianchang Mao, Zhichen Xu, David Ku, Qi Lu, Eckart Walther, and Chung-Man Tam
US Patent Application 20060294085
Published December 28, 2006
Filed: August 2, 2006

Abstract

This invention is directed towards systems and methods for using community annotations to content items as anchortext for search and index purposes. The method In one version comprises generating one or more items of personalized information by a user for storage in a user profile, the one or more items of personalized information associated with one or more content items, the one or more content items and the one or more items of personalized information comprising one or more words. One or more items of personalized information is selected from a given user profile. The method further comprises indexing the words in the one or more content items and the words in the selected personalized information into an index, identifying one or more content items responsive to on or more query words in a query of the index and returning the identified content items as a result set to the user.

User Profiles, and Personalized Information as Anchor Text

This patent application attempts to improve the reliability of search results by incorporating of the users’ actions and possibly the actions of a social network of users.

One or more user profiles are created collecting personalized information describing interactions by a searcher with pages. Those interactions may include things such as:

  • saving pages,
  • annotating pages,
  • tagging pages, and;
  • other user interaction with content items.

Personalized information may be treated similarly to other information about a page for indexing, searching and ranking purposes. So, like the title of this patent application notes, annotations and tags may be treated similarly to anchor text from a web page.

That personalized information, like anchor text, includes descriptive text. It differs in that it is created by individuals other than the author of the page (of course, links from other sites are usually created by others, too).

Some other differences are that personalized information can provide descriptions, opinions and alternate forms of references (including spelling and word form variations) which may not be found on the original page.

Examples of personalized information from user profiles being used to improve indexing, searching and ranking of content items.

  1. When a searcher saves a page the first time, the text of the page (including any metadata) is added to a search engine’s inverted index;
  2. Any relevant personalized information is also indexed, and treated as separate fields of content from the page; and,
  3. When other searchers later save the page, it is not re-indexed, but relevant personalized information from the additional users is added to the inverted index.

Benefits to Searching an Index Which Includes Personalized Information

  1. Relevant pages may be located even though the pages don’t contain the exact wording or spelling in a user’s query.
  2. Relevance scoring and ranking of pages is improved, providing more relevant results to users.
  3. Personalized information may be aggregated and indexed according to communities or social networks of users, which can enable community-aware searches.
  4. Aggregate personalized information (from one or more user profiles), may also be used to rank search results according to community-based features exposed by the personalized information of individual users.

An example of that last one:

Ranking may be influenced by usage information from personalized information in user profiles, may be based on reputation or trust values for the information contained in individual user profiles or groups of user profiles, or by propagating reputation or trust values through social networks of related users.

Reputation or Trust Values

Reputation or trust values may also be propagated through implicit and explicit social networks.

1) An explicit social network is an explicit association between interconnected individuals, e.g., where a first user identifies an explicit relationship with one or more other users.

2) Implicit relationships in social networks, however, may be defined between two users based upon personalized information in the two user’s profiles.

Pages and personalized information (pages which a user tags, annotates, saves, etc., and those tags and annotations, etc.,) may be made available for searching in real-time.

Streamed Search Queue

One phrase I’ve seen for this, not mentioned in the patent application, is a “stop the press” index, which is used in addition to an inverted index.

An inverted index, which may be a word-location index, is generated for a collection of pages. As users provide personalized information, the information is added to a stream search queue, which provides for direct access to the information.

Information from the stream search queue would be indexed and written to the inverted index after a threshold is exceeded, which may be a time threshold, quantity threshold, etc.

When a user conducts a search, the system may conduct a search over the information in both the inverted index and stream search queue to identify content items that are fall within the scope of the query that the user formulates.

Social Network

One version of a social network is created by looking at relationships between individuals – friends, acquaintances, or family members. The search engine provides a place for people to create a profile, and identify others with whom they have a relationship.

Community based features contained within the user profiles may be used in a number of specific ways to influence the ranking of content items contained in search results.

For example:

Raw Numbers – ranking according to usage of pages such that the more users save a given content item, the more likely that the given content item is an important page – raw number of users who save, annotate or tag a given content item.

Proportion of interaction – Instead of looking at raw numbers of users who have saved, annotated or ranked a given page, instead it might see what percentage of visitors to that page interacted with it in some way.

Recent usage – Calculating how recently a user has saved, tagged or annotated a given page such that the ranking component assigns a higher rank to recently annotated, tagged or saved pages.

Reputation or trust – of users providing annotations and tags, or saving pages. Pages that are saved, annotated or tagged by high-reputation users are assigned a higher rank that those with lower reputations. A reputation-weighted average could be used, in which instead of starting with a raw count of the number of users who have saved, annotated or tagged a content item, ranking uses the sum of the reputation scores of each of the users that are saving, annotating or tagging a given content item.

Dual TrustRank

The Dual TrustRank technique takes advantage of two types of social structures that the search provider maintains:

  1. the link structure between pages, and;
  2. the social network that interconnects users as identified by relationship information contained in the user profiles in the profile data store.

The links between the two structures are the pages that the users view (e.g., navigation and search history), save (e.g., bookmark or save to the search provider), rate, share, etc.

Dual TrustRank value consists of:

  • a TrustRank value for users, and;
  • a TrustRank value for content items, or the domains that host the content items.

TrustRank:

  1. Trust ratings for a user provided by other users, or;
  2. Human experts may rate users on the basis of the content items that they are saving, or;
  3. A trust rating may be calculated for a user based on how the pages that user is saving are being used by other members with which the given user maintains relationships.

The patent application goes into a lot of detail on how a Trustrank score may be calculated based upon how people use and interact with pages. It also can look at the freshness of those interactions, based upon timestamps related to tagging, annotating, and saving pages. The relationships between people in the social network, and the ways that they interact with pages may also help to inform the search engine as to which pages should be more trusted.

Conclusion

You may have heard of Trustrank from a paper titled Combating Web Spam with TrustRank (pdf). That paper is worth looking at again, within the context of the addition of using individual and aggregated user annotations and tagging and other information as described in this patent application.

A followup paper that criticized the Trustrank method in the “Combating Web Spam” paper is also pretty interesting, and you may want to also explore it in the context of the process from this patent application: Topical TrustRank: using Topicality to Combat Web Spam

In May of 2005, Greg Linden made an interesting point about Trustrank in a post about the first paper I mention in this conclusion:

The paper describes a manual process for determining the seed set of trusted sites. I’m curious what we’d find by instead analyzing user behavior. For example, we could consider websites used over the past month by trusted people to be trusted. That is, trusted sites would be the sites the community uses and trusts.

That looks more than a little like where this Yahoo patent application is coming from. I’ll be exploring this more fully with posts about some of the related Yahoo patent applications.

Share

15 thoughts on “Social Trustrank and User Annotations as Anchor Text”

  1. Bill,

    What a very unique and well laid out concept. I do have two questions that gravitated to my mind while reading:

    1. What about spam?

    I know of many people that are not what they appear to be online and there are many people that like a niche topic (like SEO) that have little experience or actual understanding for the topic. So, what do they have to offer for any other user then a likelihood that they have no true understanding? This user that is searching, rather than going to relevant resources, is at a loss, because the wrong site is higher in the index. My opinion, this would reinforce ignorance on many topics.

    2. Would this be better suited for a social search engine rather than a search engine based upon link popularity, content and relevancy?

    User intent is a hot topic, I know it really gets me thinking about the “big brother” concept, however, this could be applied to influence user intent. My thoughts are that if you are a member of an online community or social community, your intent can be more aligned with other members of your community, so what is good for the Larry and Curly will more in line with what Moe is looking for. This is a great concept, however, the only way to implement this on a grand scale like the world wide web, would mean that the amount of information that an engine would be storing on users would be immense. Once again, this leads to the fact that if someone is storing relevant data about a user or their activity it would be extremely valuable. What happens if that information is hacked, leaked or subpoenaed by a court.

    So, in closing, is the idea of a socially based algorithm a good idea?

  2. Exceptional post Bill!

    Not just for the information regarding the patent but also for the elegant and easy-to-digest summary on the current state of document ranking.

    Concise, clear and comprehensible! Keep up the great blogging.

  3. Thanks, shor.

    Good to see you here.

    Very good questions, Steve.

    I’m hoping that my future posts on the related patent applications will add more to the discussion, and help with answers to those questions. They will look more at annotations, aggregated annotations, annotations from trust networks, explicit and implicit social networks, syndication, real-time indexing, bloom filters, inverse searches, and search results that only show up for authorized searchers.

    Part of the idea behind trustrank is to add user intent to find good pages, and push bad ones back. Hopefully the framework for that will push less relevant pages back further in rankings, even where the niche categories may be smaller ones.

    The amount of information would be immense. It may not be in a form that makes it easy to subpoena, or at risk if hacked or leaked. Definitely worth thinking about, though.

  4. I have similar concerns as Steve.

    First of all, the SE has to make sure the data collected is the same. On the web, it may be hard to do that (unless you are only using the Yahoo network), if several people use the same nicknames, names or have a similar circle of friends. How easily will it be to mix them up and how easily can it lead to misinformation and decrease the results quality for the both persons?

    Not to mention the personal information can easily forged and/or manipulated by spammers. Unless social networks can protect themselves from spam, it’d be easy to pretend to the SE that a certain topic is on the rise, website is popular or a person is an authority in the industry.

    I certainly see how manual trust rank can be assigned to established websites to propagate the trustrank among other websites, but I can only see it as an addition to the basic text search. Of course, Yahoo isn’t that good at this, so this may be a whole another problem here.

  5. I think I share some of your concerns, Yuri.

    I would venture to guess that the people working on a system like this would also share those concerns, and be aware of the potential problems behind such a system.

    I wrote today about Yahoo’s My Web 2.0 beta, and how it only has incorporated bookmarks and tagging, and not annotations. I don’t know if they are working on using annotations, or if concerns like the one’s that you have cited are causing problems.

    Regardless, the ideas involved are worth considering. I find it fascinating to see how search engines might be considering incorporating user input into ranking features.

  6. Yeah, I figured this too, but I haven’t noticed any concern on this from your quotes.

    After all, user created things are the future of the Web and the SEs will (hopefully) sooner or later grasp this field and learn to provide better results. It may lead to SE interface, too, after all, I guess (it has already started with the first spot box of Google, too).

  7. _________________

    THE Extreme influence of BackLinked Anchor Text in the SERPs of GYM, opens the door for any other means to democratically pool society for their interpretations of any given Web Page.

    Social sites, Tags, bookmarking etc. offers in essence, a larger statistical sample to ‘poll’ data from.

    Society is, in a sense, acting as directory editors

    This also aids in personalization, if any users are a part of the Yahoo network, their propensities will ultimately influence what ads they are shown and how biased the serps will be when they are offered.

    Yahoo’s MINDSET is a rudimentary example of this, as well as AlltheWeb’s LIVESEARCH.

  8. Social sites, Tags, bookmarking etc. offers in essence, a larger statistical sample to ‘poll’ data from.

    Exactly. I think they are moving in the right direction.

Comments are closed.