Microsoft Tracking Search and Browsing Behavior to Find Authoritative Pages

Between December 2005 and April 2006, researchers from Microsoft collected information about the searching and browsing activies of hundreds of thousands of Windows Live Toolbar users, with permission, to learn about the sometimes unranked and unindexed final destination pages that searchers ended up at in response to queries entered at Google, Yahoo, and Microsoft’s

So much of what search engines try to do when presenting relevant results to searchers is based upon assumptions found in algorithms like PageRank.

Can tracking actual user search and browsing behaviors better help a search engine understand which pages may best answer queries posed by searchers at search engines?

Microsoft on Final Destination Pages

Continue reading Microsoft Tracking Search and Browsing Behavior to Find Authoritative Pages

Yahoo on Segmenting Web Sites into Topical Hierarchies

On one level, a search engine indexes a web site by crawling that site one URL at a time, collecting information about what it finds at that address, and indexing the information found so that it can be served to visitors later.

But, the process can be more complicated than that.

For instance, a search engine may try to understand more about specific sites by collecting information on a site wide basis.

Site Wide Information about Web sites

Information that a search engine might look at about a web site on a site wide level might include:

Continue reading Yahoo on Segmenting Web Sites into Topical Hierarchies

New Google Process for Detecting Near Duplicate Content

A new patent application on near duplicate content from Google explores using a combination of document similarity techniques to keep searchers from finding redundant content in search results.

The Web makes it easy for words to be copied and spread from one page to another, and the same content may be found at more than one web site, regardless of whether its author intended it to be or not.

How do search engines cope with finding the same words, the same content, at different places?

How might they recognize that the pages that they would want to show searchers in response to a search query contain the same information and only show one version so that the searchers don’t get buried with redundant information?

Continue reading New Google Process for Detecting Near Duplicate Content

The Oracle at Yahoo: Using Yahoo News to Search the Future

Imagine exploring millions and millions of news pages and other documents to find information about events that are scheduled to happen in the future, to help predict the future.

The oracle Sibyl at Delphi

This kind of future search, or future retrieval, might be able to support the making of decisions in many different fields.

News information could be used to obtain information about possible future events, and that information could be made searchable, so that it can help people plan for the future.

The Yahoo patent application is:

Continue reading The Oracle at Yahoo: Using Yahoo News to Search the Future

Google Omits Needless Words (On Your Pages?)

A lot of web pages and documents reuse the same text in sidebars and in footers at the bottoms of pages, like copyright notices and navigation sidebars.

Computer programmers will sometimes use the term “boilerplate” code to refer to standard stock code that they often insert into programs. Lawyers use legal boilerplate in contracts – often the small print on the back of a contract that doesn’t change regardless of what a contract is about.

It might be a good step for a search engine to ignore boilerplate text when it indexes pages, or uses the content of pages to create query suggestions for someone using a desktop personalized search. Ignoring boilerplate in the same documents could be helpful when using those documents to rerank search results in personalized search.

New York Times Boilerplate
navigation information at the top of the New York Times homepage

Continue reading Google Omits Needless Words (On Your Pages?)

Microsoft on Personalized Phone Portals

Do you use MyYahoo as a portal page every time you access the Web? If you were away from your computer, and had a few minutes to spare, would you consider calling that page, and listening to it over the phone, to get your stock picks, or horoscope, or sports scores?

A newly granted patent from Microsoft describes a way of creating a personalized profile and phone system that can provide unique content tailored to you and your tastes, or allow you to access portal sites by phone such as MyYahoo.

The mix of Web and phone has the potential to provide some interesting and unique experiences beyond attempting to browse the Web on a small screen, or check email.

Microsoft’s patent enables a number of personalized approaches to using a telephone in a manner which interacts with information on the Web.

Continue reading Microsoft on Personalized Phone Portals

Yahoo Phrase Based Indexing in a Nutshell

Search engines are getting smarter about the phrases that they see and understand online, and Yahoo recently published a patent application that describes a number of the ways that they learn about and understand the use of phrases in documents on the Web.

Exploring how Yahoo might use phrases to rerank search results may show how they may try to understand data from published documents on the Web, and from log files that collect information about the queries that people use when they search for information about different concepts.

From Keyword Matching to Phrase-Based Indexing

A page’s placement in search results for certain queries can involve looking at ranking criteria and algorithms applied to documents involving keywords in search queries for things like:

Continue reading Yahoo Phrase Based Indexing in a Nutshell

Microsoft on Reranking Search Results Based Upon Your Calendar

Many patent filings and papers from the search engines discuss ways that they might shuffle around search results to try to provide more relevant responses to people’s searches.

Imagine a search engine changing around the results that you see, not based upon the time that a page is published, but rather on some estimate of the importance of a page to you, and how that importance might vary with time and your personal calendar. Sounds like a tricky proposition, doesn’t it?

Microsoft adds an element of time to search results by introducing a search system that pays more attention to what is happening on your computer and within your company intranet.

This more complex search system could be used to index both search results and information found on a person’s desktop and local network. The search system would pay more attention to the context of searches and add personalization to those searches by building a user profile to distinquish how important different information might be to each individual searcher.

Continue reading Microsoft on Reranking Search Results Based Upon Your Calendar