Do Search Engines Hate Blogs? Microsoft Explores an Algorithm to Identify Blog Pages

A new Microsoft patent application has some interesting statements within it about blogs. First it tells us of the value of blogs and blogging:

Blogging has grown rapidly on the internet over the last few years. Weblogs, referred to as blogs, span a wide range, from personal journals read by a few people, to niche sites for small communities, to widely popular blogs frequented by millions of visitors, for example.

Collectively, these blogs form a distinct subset of the internet known as blogspace, which is increasingly valuable as a source of information for everyday users.

Then it goes on to tell us that search engines work to limit results from blogs in searches, and the difficulties that search engines sometimes have in identifying blogs:

Search engines are increasingly implementing features that restrict the results for queries to be from blog pages.*

The website www.blogcensus.net gives information on an effort to index blogs, though this was apparently discontinued in late 2003. At that time, the site stated that it had indexed 2.8 million blogs.

Currently, Technorati claims to be tracking 43.2 million blog sites. It is currently difficult for search engines to identify blog pages, regardless of the source of the content in a blog page.

* my emphasis

The patent filing is:

Identifying a web page as belonging to a blog
Inventors: Dennis Craig Fetterly and Steve Shaw-Tang Chien
United States Patent Application 20070294252
Published December 20, 2007
Filed: June 19, 2006

Abstract

A machine learning classifier is used to determine whether a web page belongs to a blog, based on a number of characteristics of web pages (e.g., presence of words such as “permalink”, or being hosted on a known blogging site). The classifier may be initially trained using human-judged examples. After classifying web pages as being blog pages, the blog pages may be further identified or categorized as top level blogs based on their URLs, for example.

In simplest terms, this patent application involves the use of a program that learns as it classifies pages either as a blog or a non-blog.

Some of the things that it might look at while doing that can include:

(1) Where the page is hosted, such as MSN Spaces, Blogspot, Yahoo 360, LiveJournal, Typepad, Xanga, MySpace, Multiply, or Wunderblogs or some other known blog hosting domain,

(2) Words and phrases from the page, such as “permalink”, or “blogroll”, or “powered by”, or “trackback”, “comment”, “comments”, “blogad”, and “posted at” or similar terms, including non-English ones, that are commonly found on the pages of blogs.

(3) The targets of outgoing links in the web page, such as links to WordPress.org, or movabletype.com, or blogger.com;

(4) What shows up in the URL for the page that might indicate that it is a blog, such as “http://www.example.com/blog/”;

(5) if the web page contains an ATOM feed or an RSS feed.

Other characteristics of pages may also be looked at.

The patent filing also attempts to identify or categorize whether a page that it believes to be from a blog is the “top level blog” or the main page for the blog, based upon the URL used. There’s no rationale given in the document for that determination.

Conclusion

I’m wondering if the two authors of this document have ever blogged before based upon the awkwardness of their language in writing about blogs. I searched a little for blogs from them, but didn’t find any. A blog home page as the “top level blog”? It might do them some good to follow Googler Matt Cutts lead, and actually blog for a while.

I hadn’t seen any statements from a search engine before, whether Google or Yahoo or Microsoft or Ask, that explicitly stated that they were working to “restrict the results for queries to be from blog pages.” Perhaps it wouldn’t look good for a search engine if all of the top results for many queries were all from blog pages, instead of pages that are less search engine friendly or less relevant or both.

If I were to take this site, and strip out common blog terms, remove the link to wordpress.com, rewrite my URLs so that they didn’t follow a typical wordpress pattern, and remove my feeds, would it be more or less relevant for the terms that it ranks for in search results? Should the relevancy of a page be determined by whether or not it is a blog post?

Do search engines hate blogs?

It’s really difficult to tell. I don’t think that any blogger should start removing indications that a site is a blog, though.

Most issues I’ve seen involving how a search engine might rank a blog usually appears to stem from other problems, for instance duplicated content across multiple URLs on the blog on pages for archives, categories, post pages, feeds, and main pages; whether page titles and meta descriptions are unique enough; poor use of heading elements, and other impediments.

Share

30 thoughts on “Do Search Engines Hate Blogs? Microsoft Explores an Algorithm to Identify Blog Pages”

  1. From the number of #1 ranked search results that belonged to bloggers, I don’t think Google, Yahoo or microsoft hate blogs. It’s always something to do with SEO, content and traffic. Just my 2 cents worth!

    All the best!

    Regards,
    Derrick Tan

  2. Search engines are increasingly implementing features that restrict the results for queries to be from blog pages.

    I read that to mean displaying search results to be only from blogs. I.e. blog search.

    Maybe I’m misreading this, but if you restrict something to be from something else that something is exclusively made up of something else. If you get what I mean :grin:

    Hope all is well Bill, and many happy returns for xmas. Really great shooting the breeze with you in Vegas.

    Best rgds
    Richard

  3. I also interpreted it the way that Richard did. MSN Live has a feed: operator, for example. I know this because I read many top level blogs from the MSN folks. :)

  4. Hi Richard,

    It was great getting a chance to meet you at Pubcon. Thanks for the Christmas good wishes. I hope that your holidays are wonderful ones.

    I’ve read through the patent application a few times, and spent a few hours on it, and there just really isn’t anything about a blog search in it. One of my suspicions was that they were partially doing this for a blog search, but it really just focuses upon making a distinction between whether a page is from a blog, or not from a blog.

    It may have implications for a blog search down the road – relying upon feeds only for a blog search means that a blog search engine would miss information from blogs that only published partial feeds or excerpts.

    The statement though, seems to stand on its own, that there’s a concern about blog pages showing up in search results – and those would likely be web search results.

    I’ll probably be writing about diversification of search results in the very near future.

    Hi Derrick,

    It would just really be hard to tell whether or not a search engine might be ranking blogs a little less heavy, only because they are blogs. There are so many other potential signals that may be part of a relevance and quality determiniation.

  5. Hi Matt,

    This has more of the feel of an effort to be able to categorize a page as a blog so that, for instance, web results can be diversified. I’ll have more on that this weekend, hopefully.

    I read a few top level blogs from the MSN folks, too. :)

  6. This is completely rediculous.
    They’re attempting to patent footprinting of scripts?
    Let me show you my effin LIBRARY of prior works.
    Besides, it’s not hard to eliminate these.

  7. Bill, really interesting article. I’m not sure what their intentions are here, but what about all of the full-blown websites out there that use WordPress and the like as Content Management Systems? Would the spiders consider those entire websites as ‘blogs’ or would they be able to parse out the blog components of the site? The former strikes me as a very slippery slope…?

  8. One can always hope that Microsoft is aiming towards developing a better blog search.

    7. The system of claim 1, wherein the at least one feature comprises whether the web page contains an ATOM feed or an RSS feed.

    If they can take identifying the presence of something RSS-like further, and filter out scraped splogs, they’ve got my vote.

    1. Is it on a blog?

    2. Is it duplicate content?
    a. On the same site
    b. On multiple sites
    b-1. If on multiple sites, which page originated first?
    b-2. Do some copies contain a large amount of outgoing links, especially to unrelated topics?

  9. Hi Bill,
    My question was the same as David’s, above, after reading this.

    I also think it’s interesting that what’s come up over at Sphinn about this is the concept that blogs are getting an unfair advantage in the SERPs. But I wonder – it’s an old SEO mantra that updating content frequently on any site, any static site, is a good idea. Instead of updating our own site as frequently as we used to, all new articles are published via our blog as it did, indeed seem that they were getting indexed a few days quicker than regular static pages.

    But, on the other hand, before we had a blog, we weren’t publishing a couple of new articles a week. More like once a month or so.

    I wonder if anyone has done a test to see a comparison between indexing and ranking of a static article every day vs. a blog post every day.

    Happy Holidays, Bill! Thanks for the oodles of wonderful information you’ve given us all this year at SEObytheSEA!
    Miriam

  10. Hi SlightlyShadySEO,

    Good to see you here.

    You raise some good points. This is still a pending patent application, and it may or may not make it to become a granted patent. I guess the interesting part of it technically might be that it uses machine learning to perform the task.

    But, frankly, that’s not what’s interesting about this to me. What makes it become important is when we ask why Microsoft is trying to patent something like this. Why is it important to come up with an automated method of identifying blogs, and distinquishing them from non-blogs? What other efforts might it work with.

    I have one part of the answer to that coming this weekend I hope.

    I agree that it might become easy to remove some of the indicia of a page being a blog. I’m not sure that it is necessary, and it may be beneficial in some instances to have your page identified as a blog.

  11. Hi David,

    I’m not sure what their intentions are here, but what about all of the full-blown websites out there that use WordPress and the like as Content Management Systems?

    That’s one of the things I wondered while I read through the patent filing. I’m presently building a site that will use WordPress as a CMS, and I’m not sure that I’m going to use any blog posts at all.

    Is it a good idea to focus on whether a page is a blog or not? Does it make a difference when someone writes article-sized blog posts on a blog, or writes blog-post sized articles on a static site? I’m questioning the assumption that blog versus non-blog is a good delineator.

    Could the Microsoft people be more concerned with form over substance here? When Princeton professor Edward Felton writes blog posts at Freedom to Tinker, it’s possible that they contain some of the best information available online about issues related to computers and privacy.

    Using blog software as a CMS is going to make it more difficult to distinquish blog from non-blog, but is it a good idea in the first place?

  12. Jordan,

    Content is content. Who the hell cares if it’s in a blog posting or static page? That’s lame.

    That’s part of why I questioned whether the folks who wrote this one had ever blogged before. When I perform a search, I want a good result. If it comes from blog posts instead of article pages, I have no problem with that.

    In fact, I think I might prefer it these days – folks who write blogs are often more likely to allow you to comment, and to respond to those comments.

  13. Liz,

    Thanks. I like asking questions like that when reading through something like this. The patent may just be aimed in another direction, but there could be collaterial benefits to developing a process like this, and I think you’ve defined possible side-effect or offshoot that it would be really good to see them develop.

    I’ve wondered why we don’t see a blog search from Microsoft.

  14. Thank you very much, Miriam.

    It’s been a fun and extremely interesting year, and I really appreciate the comments and questions that you, and other folks have left here. So, thanks in return. :)

    The Sphinn thread has developed in an interesting direction.

    I wonder if anyone has done a test to see a comparison between indexing and ranking of a static article every day vs. a blog post every day.

    An interesting experiment. I think one of the benefits of a blog is that it may be easier to focus upon creating the content than the container for it. But there are HTML editing tools that make it easy, too.

    I guess one of the advantages that a blog might have is that the homepage might be more likely to change more substantially than a site that adds an article daily. A search engine that monitors changes to determine how frequently it should spider may note those changes to the main page more easily, and set a higher spidering frequency.

    My post from a couple of weeks ago, Google Patent on Anchor Text and Different Crawling Rates, discusses some ways that different crawling rates might be calculated based upon frequencies of change.

    Since the homepage of a blog could have a new excerpt or full post placed upon it daily, with an older excerpt or full post being pushed off, that may be seen as a more substantial change than perhaps a static page with a daily new link to a new article, and the increased amount of new content appearing on the blog page front page may result in a blog being recrawled more frequently than a non-blog. Pure speculation on my part, but not unreasonable – worth testing. :)

  15. Great post Bill. I’ve actually turned away people who call and are interested in ranking their blogs. I’ve always found Blogs don’t rank that well as there content is always changing so quickly. They also have inheritantly bad keyword densities. I hope it will be separated from the main SERPs, or give a way to increase there importance of blogs in their sliders.

  16. Bill, I completely agree with you that this is NOT a good distinction to make; I was just trying to point out that WordPress/blogging platforms work perfectly well as “static” websites.

    I actually think for certain kinds of topics that there is ALREADY plenty of value to coding everything “by hand” rather than using a CMS because it allows so much more flexibility wrt internal anchor text and linking patterns. So I think that whatever bias is inherent against blogs for those topics should be more than enough to keep them from ranking, if that is the end desire on the part of the engines. But you are absolutely right that blogs are often the BEST source of information for newer, more rapidly changing topics.

    I hope that Richard and Matt Cutts are right; that the point of this is only to separate rather than denigrate blogs in the SERPS. At the very least blogs will have a fighting chance if the search engines are consistent with how they present what is blog content and what is “static” content.

  17. this is crap from Microsoft (so whats new?).
    one of my news websites runs wordpress, is a google news source with content provided from credible organizations and individuals. Who is Microsoft to treat me like a blog when I am actually a content website?

    The bottom line is I dont care about what Microsoft does. They may very well lower the rank of ALL wordpress based content sites, but it just doesnt matter. Ask any webmaster which engine gives the most traffic.
    Googlebot I believe reads the content of the page (difficult) rather that determine what technology it is built on( even i can write a spider to do this. tells me nothing about the page and can easily be tricked.)

  18. Hi Sajal,

    Please, think about this patent more as an idea that Microsoft could follow, rather than one that they absolutely will follow.

    The idea doesn’t appear to be to lower the rank of all blogs, or wordpress-based websites, but rather to be able to try to understand if a page belongs to a blog, so that if Microsoft wants to try to show a diverse set of search results to people (see the trackback above your post for a link to my post on diversification of results), they stand a better chance of presenting diverse results.

    A highly ranking blog post would still rank highly if Microsoft implemented this and the patent filing on diversification of results, but it might be more likely because of these patent filings that the top ten search results would be all blog posts, or news articles, or web pages.

    It’s possible that while some blog posts might move down in rankings, under diversificatnion, it’s also just as possible that some might move up.

    The idea isn’t to penalize, but rather to diversify.

  19. Daryl,

    I like working on blogs, but I agree with you that the philosophy behind optimizing them can sometimes be different.

    I hope it will be separated from the main SERPs, or give a way to increase there importance of blogs in their sliders.

    I don’t know if Microsoft will ever return to trying to use sliders, like the ones that they had hidden in the old MSN. But, the patent application on diversification of results does hint at the possibility of a searcher having some control over what kinds of results show on a search results page.

    I’d be interested in Microsoft attempting to create a blog search.

  20. David,

    Excellent points. Static content, whether hand coded, or through the use of an HTML editor may sometimes allow a more finely grained control of content and linking on a site.

    It can take some work to use wordpress as a “static” site, with the use of a number of different page templates if necessary. But, I like the ability to quickly generate new content and make edits, by people who are focused more upon content creation than coding.

    I hope that Richard and Matt Cutts are right; that the point of this is only to separate rather than denigrate blogs in the SERPS.

    I think that’s the point behind this – being able to make a distinction between blog page, and non-blog page, so that if there is a need to make a search results page more diverse, some blog pages listed might be pushed down while others might be pushed up in results.

  21. Doing something like creating a better blog search would at least be beneficial to Microsoft’s attempts at search. I can see where identifying certain types of blogs, like MFA blogs, might be beneficial for more relevant search results but I do not see where there is an abundance of these sites getting any traffic from search engines.

  22. Pingback: » Do Search Engines Hate Blogs?
  23. I think it is logical to separate the two. Also, if you look at blog search alone for keywords like ‘homes in california’, you will find that there are many blog posts about it but it is not necessarily the objective of the query.

    Therefore, SE must limit number of results from blogs compared to other sites based on keyword relevance.

    Rajat

  24. The MSN guys are referring to all the search engines or just MSN?
    I am not sure about this from what they are saying..

  25. Google is always telling us they love content based sites, as blogs are based around content, how can they be disliked? Blogs give to the online community by populating it with useful interesting unique content.

  26. Hi Pete,

    Good question. One thought I have might be that the ease of publishing to a blog has created a dramatic increase in the amount of people publishing to the Web. Being able to distinquish between a blog, and a site that isn’t a blog may a search engine to filter the kinds of sites that are shown in search results so that searchers can decide whether they want to see content from sites that are blogs, sites that are news, sites that engage in ecommerce, and others.

    Another use for this patent filing, that could be helpful, is if the search engine decided that it wanted to display a diverse mix of results to searchers, from different kinds of sites. So, instead of just showing blog results, or news results, or ecommerce results, the search engine could display a diverse mix of results.

    So, I’m not sure that it’s a dislike for blogs that fueled this patent application from Microsoft as much as it was an ability to distinquish between the types of sites they are indexing.

Comments are closed.