Indexing Recent Content in Search Engines

For many search queries, very recent search results (such as from the last 6-12 hours) are preferred over older and more stale results that might rank well based upon popularity signals, including significant past user traffic that might cause them to have been assigned a high ranking. That may work fine if you think of search engines as a repository of pages that might be relevant as references, like a library.

But with the Web becoming a place where people frequently tweet social networking updates, with news sources striving to be the first to publish about breaking topics, bloggers publishing on new topics, merchants offering new products and discounting old ones, and other content online appearing with an emphasis on freshness, search engines are becoming increasingly a near real-time monitor of the World around us.

An old Linotype type setting machine that had possibly more moving parts when it was built than anything else.

A sign on the Linotype type setting machine above notes that it had more moving parts when it was constructed than anything else ever built by man. It didn’t produce fresh content that quickly either, but it was state of the art at the time.

Towards the end of last year, I wrote a post on the topic of Google’s New Freshness Update: Social Media Has Changed the Expectations of Searchers.

About a month before that, I wrote about how Yahoo might look to social media to discover new URLS on bursty and fresh topics in Do Search Engines Use Social Media to Discover New Topics?

With both Google and Yahoo exploring new ways of discovering fresher content for search results, that leaves us wondering what Bing might be doing in that area.

One of the things that I really like about Google’s search results is the ability to refine my search results to content from the past hour, past 24 hours, past week, past month, and past year, or a custom date range. Yahoo also offers the chance to filter searches by past day, past week, and past month. And even though Yahoo uses Bing’s crawling data, Bing doesn’t provide that kind of filtering by recent time periods.

A Microsoft patent granted this week discusses a strategy they may use to try to include more fresh content in their search results.

The process involves using an “in-memory” index in addition to Bing’s inverted index to return results from by the search engine. The in-memory index would be updated during the course of a day, and includes fresher content than Bing’s inverted index of the Web. Content added to the in-memory index might be folded into Bing’s inverted index on a daily basis, or some other set amount of time.

Searches would be responded to by the inverted index, and then the in-memory index would be checked for additional relevant results, which includes fresher content added during the course of a day. After that, the results returned would be ranked and would include very recent results if there are any.

We don’t know if the process described in this patent is one that Microsoft implemented already, is one that they explored and decided upon a different approach, or may be already obsolete. We do know that Google’s Caffeine update, which introduced the Percolator system to Google’s index to move from a batch update of their index to an incremental, one took place a couple of years ago.

The process described in this patent appears to provide updated content to searchers while still retaining a batch process that folds that new content into the older database on a periodic basis.

So, what kind of content gets added to the in-memory index?

Significant user behavior centered around a document may trigger the addition of content to that in-memory index. That behavior might come from a pre-determined recent time frame, such as within the last 12 hours or within the last seven 7 days. Significant means activity from enough different users during that time frame.

Another signal that might be looked at is whether the behavior is tied to a modification made to a page, such as a change to content that alters at least one term on a page, such as a new price at a retail-based site.

The search engine might learn about those modifications of content and signs of significant user behavior through update files from sites, like product submission feeds and XML sitemaps and potentially even something like Twitter’s data feed of new tweets, and by crawling the pages of a site and comparing them to earlier versions.

A search of Bing for some very recent topical subjects, like a search for [earthquake] doesn’t show the kind of recent results that I would expect, so it’s possible that they haven’t incorporated this change into their results.

The Microsoft patent is:

Using behavior data to quickly improve search ranking
Invented by Walter Sun, Jay Kumar Goyal, Pratibha Permandla, Yinzhe Yu, and Jingfeng Li
Assigned to Microsoft
US Patent 8,244,701
Granted August 14, 2012
Filed: June 27, 2011

Abstract

Systems and methods for applying user behavior data to improve search query result ranking are provided. Upon receiving an update file indicating that recent, significant user behavior data is available for a document associated with an inverted index, the update file is published periodically and frequently to an index server. After filtering out the relevant update information from the update file, the index server extracts identifiers of the documents having the associated user behavior data. The update file and the identifier of the documents are utilized to update an in-memory index containing representations of metadata indicative of the user behavior.

The in-memory index is continuously updated and utilized to serve search query results in response to user search queries. Search query results from the in-memory index are ranked using the user behavior data prior to serving. Thus, results associated with recent, significant user-behavior metadata receive prominent placement on the search results page.

Take Aways

One of the areas that Bing seems to fall behind Google and Yahoo is in showing search results filtered by the past day, week, and month. I’m not sure why Yahoo offers this feature and Bing doesn’t. A look at search results for a term like [earthquake] at Google and Bing shows some slightly more timely results from Google than from Bing, but the real-time results that Google used to show that included data from Twitter’s data stream are missed in this area.

The process described in this Microsoft patent shows a step towards the incremental indexing of search results that Google achieved with their Caffeine update, but it doesn’t seem like Bing has implemented this process yet in a way that would surface more recent content. It’s possible that such a change might bring lower quality search results to Bing, and that might be keeping this process from being used.

The faster pages and content go from being published online to being included in a search index and search results, the less time there is to classify, categorize, and determine the quality of those results.

At a Q&A session yesterday at SES San Francisco, Google’s Matt Cutts answered some interesting questions about what Google is doing in search these days. One of the points he made was that “You shouldn’t put a lot of weight on +1s just yet”. It’s clear that Google is still experimenting with how much weight they should give to social signals relating to positions in search results. It’s possible that social signals might be helpful in ranking very recent content on the web, especially since most very recently published pages haven’t had a chance to accumulate links as a quality signal that might help a search engine to determine rankings for pages.

News content does tend to show up quickly and to be highly ranked in Web search results, but Google’s news results are limited to sites that have been accepted as news sources, and are likely continually monitored in terms of quality of content, and judged on a different set of algorithms than other genres of web pages to determine rankings.

Chances are that Bing is struggling to find some of the same answers to how to rank very recent content. This patent shows an attempt to move in that direction.

Share

13 thoughts on “Indexing Recent Content in Search Engines”

  1. Pingback: Indexing Recent Content in Search Engines - Inbound.org
  2. Bill,

    That was an interesting comment by Matt Cutts stating “You shouldn’t put a lot of weight on +1s just yet”.

    But at the same time, I can see how he would say that. Any one metric that could potentially influence search results more than any other would certainly be the target of manipulation by SEOs.

    Additionally, I can see how high-quality web pages wouldn’t necessarily get tons of +1s right out of the gate.

    Personally, I never +1 anything simply because I am not big into Social Media and I just never think of doing it. I am sure others are the same.

    I guess therein lies the weakness of using a voting system like the G+1 to influence results…but the idea behind it seems reasonable enough.

    Mark

  3. Mark, concerning the +1 thing, I don’t think many people are big fans of it either. I tend to view g+ as a ghost town, whenever I see a page that has a ton of +1s, I immediately think that it’s spam. Social likes and sharing aren’t good metrics in my opinion as they can be manipulated very easily.

  4. Great post. As Mark (previous comment) noted, the comment by Matt Cutts is interesting but it makes me wonder if Google’s initial plan for +1 isn’t turning out like they thought? Unfortunately using it for any sort of search ranking can be easily manipulated or spammed and in it’s current form, only seems good to build a personal search engine. The trouble with that is that when I search for something, it’s unlikely that I am looking for a page I have been to previously, or I would have went there directly.

    Regarding Google’s fresh search results. It’s great when searching for the latest news stories that Google is indexing them almost immediately. I am glad that Google understand the difference between news and other sites, otherwise this would be another loophole; Constantly posting content (which is still a good idea) but it would just encourage junk content.

  5. “So, what kind of content gets added to the in-memory index?”

    Do you really think they are looking at such small updates as a change in price other than “last updated” date? That’s crazy.

    The Bing vs. Google fight will be won by the party that utilizes open graphs the quickest to measure impact of social impact, especially for news and blog articles. To me, it’s a true form of measuring popularity. Doing it for news/recent updates would be far easier than at the domain level & there is no denying the fact that the technology in the algo isn’t REALLY developed to work properly yet.

  6. While many forums continually spruik the importance of having a blog and updating regularly, and how social websites will take over the seo world, I am still not convinced of the relevance or longevity of their influence on seo or site rankings. Many because many forums and blogs are full of automated comments, while a well thought out original article on an authority site should make more sense regarding how google originally wanted to rank information.

    Is this just me and wishful thinking?

  7. I agree with Jonathon here, +1′ doesn’t seem to be turning out like Google thought. It’s something that appears to be used by the majority of the tech/online marketing and SEO communities, but I’m doubtful this extends to any great use beyond this and into the “joe public” domain. Certainly like any type of indicator, whether it is review based, star-ratings etc. social indicators can be manipulated. It wouldn’t be long before people manipulated on-page content to influence the in-memory index, in relation to “Another signal that might be looked at is whether the behavior is tied to a modification made to a page, such as a change to content that alters at least one term on a page, such as a new price at a retail-based site.”

    In terms of fresh content, my personal preference is Twitter and I think it’s a real shame that Google lost the realtime feature last year.

  8. I agree that +1 doesn’t really seem to be turning out as well as Google originally planned. They made it seem like it was going to blow Facebook out of the water and help with SEO. But it’s so easy to manipulate +1s as others here have said, that it’s practically worthless in that area. Maybe MS will find a better way at ranking new content.

  9. I’m still amazed how important twitter has become, the content is so short and non-descriptive but it’s immediacy and speed is redefining the web and of course the search engines.

  10. Interesting- there are other Mat Cutt’s quotes indicating they do put store in social signals- It’s very difficult to know what to believe on this front, but I certainly agree with you that +1’s aren’t a significant factor in the algorithm at the present time. Your point about using social signals or likes as a means of finding fresh content is well made. :)

  11. I’m seeing more and more Google + author posts in the top 10 of lots of search phrases. I think Google is giving less relevant results higher priority just for the simple fact they use their Google + page as an author page within the html. Although unfair to a certain degree, it does give hard workers that develop good content a leg up on their competition.

  12. It was good to see earlier that google used to show live tweets in their search results. It was a good social media integration to startoff showing live tweets. But unfortunate that they pulled it off to promote their +1.

    Just wondering whether earlier method of ranking where in social bookmarking sites like digg, stumble upon, delicious, etc were given good preference in rankings a better idea ?

Comments are closed.