Hopw Does a Search Engine Rank User-Generated Content on Web Pages?
The term “User-Generated Content,” often abbreviated as “UGC,” covers a fairly broad range of the words and pictures, images and videos, and sounds that you see and hear on the Web.
One thing that tends to distinguish “User Generated Content” from other content on the web is that visitors to a site, possibly including a site’s owners, are the ones who help build the site, and add to it.
User-Generated Content can include message boards and forums, wikis and product reviews, public mailing lists and Q & A sites, blogs and blog comments, podcasts, and other kinds of content.
Would you consider twitter to be UGC? I would. When you visit Amazon.com and read reviews of books and music, and other content, you’re reading User Generated Content. When you visit Wikipedia, the human-created encyclopedia you see relies upon User Generated Content.
A patent application from Yahoo explores an approach for indexing UGC and including it in Web search results.
The inventors of the patent filing tell us that there’s some beneficial information that shows up in places like reviews on product pages, and other areas that the search engines haven’t been beneficial in showing to searchers.
Why might search engines have problems ranking information found in User Generated Content? Here are three reasons we are told that “typical ranking mechanisms for ranking of a document in a web search, however, are unsuitable for ranking UGC:”
- User Generated Content tends to be fairly short,
- There usually aren’t links to or from UGC; and,
- Spelling mistakes tend to be common in UGC.
The patent application introduces us to three concepts that might be helpful in ranking User-generated content (UGC,) so that it will show up in Web search results when useful. These concepts are:
- Document goodness,
- Author rank, and;
- Location rank.
I’ll describe those in more detail below. The user-=generated content patent application is:
Method and Apparatus for Rating User Generated Content in Search Results
Invented by Jaya Kawale, and Aditya Pal
Assigned to Yahoo
US Patent Application 20090271391
Published October 29, 2009
Filed April 29, 2008
Generally, a method and apparatus provide for rating user-generated content (UGC) concerning search engine results. The method and apparatus include recognizing a UGC data field collected from a web document located at a web location.
The method and apparatus calculate: a document goodness factor for the web document; an author rank for an author of the UGC data field; and a location rank for web location. The method and apparatus thereby generate a rating factor for the UGC field based on the document goodness factor, the author rank, and the location rank.
The method and apparatus also output a search result that includes the UGC data field positioned in the search results based on the rating factor.
The first step towards ranking User Generated Content is creating a score for document goodness of a review, or a blog post, or a forum post, or another piece of UGC.
Some of the attributes that a search engine might look at in determining document goodness can include:
- User rating (if available);
- Frequency of posts before and after a document is posted;
- Document’s contextual affinity with a parent document;
- Root of thread or subject;
- A number of page clicks/views for the document (if available);
- Assets in the documents such as images, links, videos and embedded objects;
- Length of the document;
- Length of thread in which document lies; and,
- Goodness of child documents (if any).
The next step towards ranking UGC is to create an author rank for the creator of UGC. An author rank is a “measure of the expertise of the author in a given area.”
Attributes a search engine might consider in generating an author rank may include:
- A number of relevant/irrelevant messages posted;
- Document goodness of all documents initiated by the author;
- Total number of documents initiated posted by the author within a defined time period;
- Total number of replies or comments made by the author; and,
- A number of groups to which the author is a member.
The first two steps consider the User Generated Content itself, and the creator of that content. The third step looks at where that content is located, such as a message board or forum, or group, and provides a rank for the location.
Attributes that a search engine might take into account in ranking UGC involving a location rank for that content include:
- An activity rate in the web location, for example a number of documents posted per hour;
- A number of unique users in the web location;
- An average document goodness factor for the documents in the web location;
- An average author rank of the users in the web location; and,
- An external rank of the web location.
The user-generated content patent filing provides a few ways that these three measures might be combined to help UGC show up ranking for queries in Web search results.
As I was reading this, I wondered if signals like those listed here might account for whether or not we see UGC like certain twitter posts for some search results, or reviews for some products, or other UGC in rankings.
It’s likely that if Google and Bing have been exploring how to rank UGC (and they probably have), they may be looking at some of the same signals.
59 thoughts on “How Search Engines May Rank User-Generated Content”
Looks like the search engineers have been playing with document meta data for a long time. They’ve been trying to understand the information about Web documents that can be inferred or extracted from sources that the SEO community normally ignores.
I always enjoy reading your posts, Bill.
Michael, I think that it would be interesting to somehow see/know how much of UGC is actually generated by folks in the SEO community and/or people who are paid to market something on some/many parts of the net. I’ve lost trust in more than a few sites over the years after reading UGC.
That being said, can fake/false UGC be detected through the use of an algorithm? There seems to be a buddy system that works quite well in many fields.
P.S. Kudos on the non-profit links, Bill. (just saw them) I’ve been supporting someone on kiva.org for a while now.
It depends on how badly people want to detect the faked user-generated content. I’m pretty sure all sorts of filters have been developed to help manage search engine crawling and link valuation. We’ll just never see that picture all in one piece.
So I am assuming this patented method is targeted specifically for UGC heavy sites such as forums? I was assuming the SEs where already doing this already, but that patent seems fairly new. And saying that, forums, as an example, would it be safe to assume that SEs are (or were) not putting much value to these UGC sites or the content itself? Does Google have an equivalent method too?
Good point. These are a slightly different set of signals than we are used to seeing when it comes to ranking content. I’m not sure that all of the attributes are appropriate to some of the different types of UGC that we see online, or that they listed all of the things they might look at in the patent filing, but it is interesting to see a different way of ranking some of this content, and thinking about how effective or ineffective it might be.
Thanks. There’s an incredible amount of UCG on the Web, and I would guess that only a very small percentage of it could be attributed to marketers. There are so many forums that have so little to do with marketing, and so much to do with giving people with shared interests a chance to have conversations.
I would guess that the development of any ranking system needs to consider the possibility that some people might attempt to manipulate results, and while this one doesn’t discuss methods for identifying fake/false UGC explicitly, things like the “author rank” described above makes it more likely that drive-by spamming in forums or on review sites, and in other places, is less likely to carry much weight. A location rank may also mean that higher quality sites where UGC may be found may carry more weight, and help to keep this kind of ranking from being manipulated as much.
Regarding Kiva, I just received my second repayment on a loan that I made to a machine shop in Peru, and it’s exciting to think that my small loan, along with money from others has helped someone’s business there. I’m looking forward to making some more loans.
Great stuff here. I too have been thinking about how they’ll rank user generated content, actually I was thinking more about how one would “game” the ranking of user generated content from facebook + twitter. In the end part of me feels like those that are pros and have a networked authority of supports in social networks will be rated higher than those that don’t- and that is less about the quality of the message and more about how well you played the game of SEO. Over time if exploited and if abused and mistrusted user generated content will become blind to end user much like banner ads are today.
I agree with you that it may be possible for people to generate user-content that might impact rankings, but that there are likely filters set in place to try to stop that kind of activity. We aren’t given too much detail in this patent filing on those, and that’s probably a good thing.
The attributes that the patent filing tells us about look like they would probably work best with message boards, forums, and news groups, but usually a patent application provides enough detail to give a sense of what it does, without providing necessarily giving others a roadmap of how to develop something similar on their own.
All of the search engines have been crawling sites like forums for years, but this patent filing tells us about some ways that might make ranking content in places like forums better. For instance, if you have a forum thread that has a single post in it that is very relevant for a certain query, it might be hard for a search engine to distiguish the post from the rest of the thread, and the post might not rank very high for a search on that query. The approach in this patent might help make that single post rank better than it does under a ranking system that doesn’t consider single posts the way that this one does.
It’s possible that Google and Bing have been working on something similar. Both have published patents and whitepapers that discuss segmenting different parts of pages, so that things like individual forum posts, or restaurant reviews, or blog comments might be distinquished from other content on the same pages. I’m not sure that I’ve seen a set of ranking signals like this that would apply to those different segments from Google or Microsoft, but that doesn’t mean that it isn’t something that they haven’t considered.
The Twitter-based service TipTop (http://www.feeltiptop.com) is based on the view that “goodness” (or usefulness) can be discovered from the UGC/data itself.
I think the use of microformats and their push for it speaks volume to their intent of understanding user interactions with the page.
For now, their signals may be limited to user ratings, reviews etc. Later on, I think they may be able to extract more data out of the page as RDF standards expand to allow for that.
Most of the time user generated content tends to be garbage. Many web sites have it, but it is not very reliable.
The fact that Yahoo filed the patent tells you immediately that it is worthless – they filed the patent right before they gave up on search engines.
I am wondering about author ranking. Considering factors like the retweet, which is overly used by some, autogenerated/scheduled tweets, where a producer of UGC uses a service to send out the same set of tweets during the course of the day, and the proper use of protocols/microformats, such as the hashtag or semantic protocols developed by a few- like the “has#” tag, can we find ways to assign value in which many feel that there is simply to much noise? Retweets are great, but which author is assigned what value, and if your percentage of retweets is greater than original content, does this effect the score? For myself, I begin to tune authors who are too heavy in the use of other content. Scheduled tweets have a purpose, but over use causes other users to ignore them, because they have read it before. When a new tweet is added to the cycle, they do not read it. The hashtag is fairly common and understood by regular users, but semantic protocols for Twitter are not too commonly used; although, there are people interested in semantic, real time search attempting to develop useful systems whereby UGC could teach a search engine definitions. I do think that search engines will find a way to determine this rank, but it may involve studying user behavior first. Clouding this calculation is the number of followers/friends a user acquires. Are the followers really interested in what is being said by the author, or do they just want a follower themselves to spread their message?
Thank you. I thought it was interesting seeing this approach from Yahoo, which does involve a different way of thinking about and ranking user generated content. I do believe that an element of quality will play a role in rankings, rather than how much one might attempt to manipulate the rankings of such content. If the search engines show low quality results for UCG, as you note, people will start ignorning it, and possibly start searching somewhere else.
We’ve been given some information from this patent, but there are aspects of that information that really hasn’t been spelled out in much detail, and it’s hard to tell how many of the parts of this might be implemented, as well as policed for abuse and manipulation. But, we should consider that those things most likely won’t be ignored.
Thank you. I’ll have to spend some time with TipTop. It does look interesting.
Microformats are interesting when it comes to User Generated Content. We have heard from some of the search engines that they are interested in exploring how those can be used in places like reviews – Google mentioned that during their Searchology presentation earlier this year when discussing smart snippets. I’m not sure that we will see more with microformats in all kinds of UGC, but it’s worth keeping an eye open for their use.
I’ve seen some very useful User Generated Content in some places – for instance, reviews of books at Amazon have often been helpful to me. I think one of the things this patent filing mentions that is useful to keep in mind is that the process behind it will try to distinguish between sources, and rank some higher than others based upon things such as the quality of content that comes from different sources.
I’m not sure that it’s a good idea to discount the possible value of this post based upon business decisions made by Yahoo. They still do have some very smart people working on search, regardless of how they might proceed forward with Microsoft.
It’s interesting that this patent filing broke ranking down into multiple components – document goodness, author rank, and location rank. Your questions raise some very interesting questions – how much value should be given to documents from an author who seems to have automated retweets? Or one that seems to schedule a set of the same tweets at certain times? Should people using tags be treated differently than people who don’t? And as you ask, “if your percentage of retweets is greater than original content, does this effect the score?”
Hopefully there are people at the search engines asking questions like this as they are trying to figure out what kinds of things they should look for in ranking this particulate kind of UGC. I suspect that if we spent some time asking this question and looking at just Twitter, we could probably fill a few pages with things that might be worth asking and exploring further.
I agree with you that studying user behavior might be an important aspect of building some kind of ranking system that goes beyond just what words appear in UGC.
Great analysis. I think that almost 80% of the content on the web is either user generated content or recycled content. Packaging one content in different formats and syndicating ot to different sources is matter of ease and exposure, both. As a marketer if I am not too smart and do not do that, someone else will. This may include niche and general article sites, PR sites, blog comments, social interaction platforms and guest posts. Will search engines be extending this to every domain possible? Will my guest blog post pass PageRank as I am a trusted author and Mr. X’s won’t as he is known for spamming and has been associated with low level content platforms?
One place sure is the local onebox results which might get a lot of ranking and visibility benefit from favorable UGC.
I can see where HP is coming from. A lot of UGC is worthless. But there are exceptions…the comments on this page for example. A useful article supported by thoughtful comments. I’ve seen a lot of quality content in certain forums as well. But as HP says, you do have to wonder if Yahoo filed the patent…
Agreed, Twitter is definitely UGC. While much is useless some isn’t. Google and Bing, and others, will have to weed out the useless and put a value on the truly genuine. Seems impossible, but you never know…it might just happen some day. 🙂
I have found a lot of UGC to contain lots of outbound links ( just look at your typical discussion forum, Digg news and your typical tweets). UGC also tends to be more current and timely than other content. Again, the problem for search engines is how to filter out the junk content from the gems in UGC.
most of the times what you really find in twitter is a bunch of junk which is a UGC and I am really not happy with Google’s latest update in their algorithm that includes tweets frankly I could see they are misleading a lot of times.
UGC can be ranked in terms of interaction frequency or popularity or the trust given to the source but it’s difficult to qualify the validity or to estimate the quality. So in a case where all other factors effecting ranking are considered equal… should some UGC help a page outrank another page with no UGC? Even if the source is unknown and quality unknown? I hope they set a minimum threshold for “goodness” so it does not lead to manipulation. Yahoo used to allow very negative comments in their Yahoo Local reviews even from “anonymous” post and run commentators… but their legal department recommended removing the trash UGC after it started having a negative impact on the internet reputation of professionals… I think all the SE’s started using the “was this review helpful” flag to eliminate these kind of reviews but it’s harder to identify bogus UGC from tweets and other areas.
Great Content, I agree with Mal, UGC can be ranked in terms of interaction frequency or popularity but I think SEO can help. Your blog is great, thanks for your informative post.
Thanks. I definitely agree – a lot of the information we find online is User Generated Content. It’s possible that a guest blog post associated with one author might carry a different amount of weight than that from the primary author of a blog. Google described something like this in their patent filing on AgentRank.
True – which is why a search engine would look at more than just the UGC itself, such as where it is published and what else is published there, and what else an author may have published in other places as well.
Attempting to ranking UGC individually, by the forum post or blog comment or product review, instead of just by the page that it appears upon with other UGC, is quite a challenge. Guess that’s what makes this interesting.
Hi People Finder,
There are some benefits and some perils to indexing UGC – Definitely filtering is a major issue.
I’m in agreement that there’s work that needs to be done on which tweets are being indexed and displayed. I guess this shows that search still is in its infancy.
I like the inclusion of “was this review helpful” in places like Amazon.com, but I question sometimes how useful it really is, and how much it should be trusted as well.
Thanks, Utah SEO
Interesting topic for sure, I’d like to respond to this part of your response to Dan:
“Weâ€™ve been given some information from this patent, but there are aspects of that information that really hasnâ€™t been spelled out in much detail, and itâ€™s hard to tell how many of the parts of this might be implemented, as well as policed for abuse and manipulation. But, we should consider that those things most likely wonâ€™t be ignored.”
In the last line there, I’d like to share something I’ve noticed in regards to ranking in the 10-box (maps.google). It seems that listings on other local search directories that allow user reviews, appear quite often in the top placements. While I’m sure Google won’t come out and say “yes this is a ranking factor we think is groovy”…I find it interesting that this UGC from other local directories find their way into the top maps.google listings.
Something that would make sense from a Google perspective is that any UCG that has ‘some’ sort of editorial functionality (ex: moderators being able to delete reviews or links) would lend to an additional bump in the credibility factor…and thus given a little more Google love from these UCG sources.
Nice observation. I believe that Google does value information more that has been published online that has a level of moderation and editorial control over its publication. I think that might be a factor that could reasonably be included in the list of “location rank” factors above, which are an indication of the quality of the source where UGC might be published.
I think the ranking should depend on the search term in question – for quite a few searches user generated content should have more importance – for example when searching for product reviews of a particular item – I’d prefer a website to come up showing 100 badly spelt quick reviews of a product, rather than one magazine style professionally written review – you tend to get a better idea of the strengths/weaknesses of the products based upon 100 people’s experiences rather than just ones.
If you are right I have been missing a huge part of my SEO knowledge. You are saying that the content itself is important. For example, a article about petting your dogs. YouÂ´re telling me that the quality of the article matters? Author, PR, sub-pages etc.
For the moment I have always followed that Google does not care at all. No matter what it is on the page for Google they are words and only that. How will Google know what article has more quality? It dosenÂ´t. How does it know what to rank? By regular SEO rules, keyword density, meta keywords etc. I mean if your concentrating a whole site for an article topic you better have a good reason.
Or have I completely missed the point here? Cause it is quite late and I really shouldnÂ´t think about internet marketing while this sleepy. :p
That may be something that search engines have considered. It’s possible that search engines could classify query terms differently, so that one type of query might be treated differently than another when it comes to the kinds of results that show up. For instance, query terms related to restaurants might show more results that include review type user generated content than query terms related to U.S. History, for instance. Someone searching for “Rays NY Pizza” might like to see links to reviews of the restaurant. Someone searching for “Declaration of Independence” is probably less likely to want to see reviews or other UGC.
This patent isn’t the first one I’ve written about that says that a search engine may attempt to measure the quality of content that it finds on web pages. PageRank itself, which has been around more than a decade now, is supposed to be a measure of quality rather than relevance, though it doesn’t look to the quality of content on a page, but rather of the links pointing to that page.
We’ve been told in a number of whitepapers from Microsoft, about their ranknet algorithm, that they may be looking at more than 500 different aspects of the content on a page to determine the quality of content found there.
Google’s Agent Rank is another example of a patent from the search engine, where something like the “author rank” in Yahoo’s patent looks to information about the author of content found on the Web.
Signals that measure quality can cover a very wide range, from the length of a page, to how readable it is, to whether or not it includes phrases on the same page that appear to be related, to how punctuation is used, and many other signals. These kinds of things appear in a large number of patent filings and whitepapers directly from the search engines, which is why I find myself spending so much time looking at those sources.
In the post above, I’m writing about what the Yahoo patent itself says – I’m not making any of it up, so it isn’t a question of whether I’m right or wrong, but rather whether or not Yahoo and the other search engines have adopted the processes described in their own patent filings, or something very similar to them. If anything, these patent filings are great opportunities for us to learn about possible approaches from the search engines, so that we can test, observe, come up with new questions, and learn. 🙂
As far as I know, Google has never used “keyword density,” or “meta keywords” to rank the content they find on web pages.
At the moment UGC, is less trusted by the engines in the case of pages of UGC (example: form posts). I am not sure if at the moment the engines treat UGC on the same page as other content differently or even if they can identify it. I would imagine so.
The danger of giving authors credit and even an area of expertise is that it would possible give scope to the idea of mimicking people who have authority. This would mean that an established name would need to be protected via a login or posting from a twitter account or similar.
This would hold back newer authors and could possibly create an “old boys” style club within areas of content creation – take a look at digg, the issues that hit the homepage are usually from the same bunch of people (power users) this makes the whole portal an uneven playing field and was the reason I stopped playing there…..
Lets hope that if this technology is brought in it is done in a way that does not create cliques.
A lot of UGC seems to get indexed with the rest of content found on the same pages, or is often treated as if it is from the same pages as other user generated content from other authors.
I agree with you that there are a lot of issues that would need to be worked out to make sopmething like what is described in the patent work well. Interesting times are ahead of us.
The system behind search engines really can be quite complex…
I would definitely agree that Twitter is UGC, but what about something like Facebook? It seems at first that is (to me, at least) but I don’t know – the site’s creators have provided a very structured framework for what gets created.
Definitely, search engine systems can be pretty complex. I guess the way that they might treat UGC could be considered fairly complex as well. The patent that I wrote about could be applied to a wide range of User Generated Content, but the features that they provided in the description of the patent filing might fit best with one type of UGC, such as Yahoo Message boards.
I would guess that a good way to start out with ideas around indexing User Generated Content might be to try to come up with a number of classifications, based upon features and attributes of that content, possibly with some examples of each.
We see some UGC as reviews on pages such as Amazon, where there is site-owner created content at the start of a page, descriptions from manufacturers/publishers/distributors in the middle of pages, and reviews and comments from visitors to a site.
Blogs tend to have more actual content from the owners of those blogs, but often also have comments and trackbacks which can be considered UCG. Some blogs can have more than one person posting, and some group blogs like Metafilter actually have thousands of authors.
Forums and message boards are often a different flavor of user generated content.
Twitter and other micro-blogging type sites tend to be simple, but with things like replies to tweets and retweets, there’s sometimes a connection between those short posts that a search engine might want to consider.
Sites like Facebook and Myspace use a more structured format, and can contain a lot more features, but they still contain content mostly generated by visitors to those sites. I would consider them to be mostly UGC.
I’m just brushing the surface – there are a lot of different types of UGC.
Interestingly, Bill, I have run across UGC sites that have very little content per post and are amazingly successful. Most of the time it is just little more than a funny or ironic picture or video referred to as a “fail” by one site in particular and it is very popular. The homepage, for what it is worth, is a PR7 and the visitors vote on the “fails” in manner of speaking. I would imagine that a search engine would use the title of the post and the integrated voting system or “user rating” you mentioned as a means to rank it in addition to relevant responses, activity rate and UV’s, etc…
There are a lot of UGC sites that offer only a limited amount of information, from microblogging and status update tools, to clipping and bookmarking services and more, and it is surprising how well many of those sites do.
This patent does describe some of the factors that a search engine might use to rank that kind of content, and reputation, voting, and “quality” signals of the source such as PageRank could play a role in how that content might rank. The patent gives us an idea of a framework that could be used, but it would probably need some flexibility to cover different uses of UGC, from reviews, blog comments, microblog posts, question answering sites, and other.
I agree with you. If google and all search engines can find a way to segregate the important information to the useless one, then that would be beneficial to all.
I thought that UGC can be ranked in terms of interaction frequency or popularity or the trust given to the source but itâ€™s difficult to qualify the validity or to estimate the quality. So in a case where all other factors effecting ranking are considered equalâ€¦ should some UGC help a page outrank another page with no UGC? Even if the source is unknown and quality unknown? I hope they set a minimum threshold for â€œgoodnessâ€ so it does not lead to manipulation. Yahoo used to allow very negative comments in their Yahoo Local reviews even from â€œanonymousâ€ post and run commentatorsâ€¦ but their legal department recommended removing the trash UGC after it started having a negative impact on the internet reputation of professionalsâ€¦ I think all the SEâ€™s started using the â€œwas this review helpfulâ€ flag to eliminate these kind of reviews but itâ€™s harder to identify bogus UGC from tweets and other areas.
Just a thought.
I think one of the reasons why the patent filing mentions so many different types of signals, related to the content, the author, and the page or place where it appears is to try to show that they wouldn’t give too much weight to any one factor, but would consider a range to try to determine “goodness.”
The “was this review helpful” feature is one that can allow other visitors and authors (with their own reputation scores) to help determine the weight that UGC might carry as well – though once again, it would just be one signal amongst many.
Very interesting post. It’s been quite a while since this patent filing, do you happen to know the result of it by any chance?
Definitely, and google seems to be doing a reasonable job at this, especially compared to the likes of Bing and Yahoo at the moment who are way behind.
This is Yahoo’s patent, and Yahoo’s database is now controlled by Microsoft to a large degree. But I think it does a good job of illustrating some of the different signals and processes that any of the search engines may be following in looking at and ranking user generated content.
Comments are closed.