Microsoft Tracking Search and Browsing Behavior to Find Authoritative Pages

Between December 2005 and April 2006, researchers from Microsoft collected information about the searching and browsing activies of hundreds of thousands of Windows Live Toolbar users, with permission, to learn about the sometimes unranked and unindexed final destination pages that searchers ended up at in response to queries entered at Google, Yahoo, and Microsoft’s

So much of what search engines try to do when presenting relevant results to searchers is based upon assumptions found in algorithms like PageRank.

Can tracking actual user search and browsing behaviors better help a search engine understand which pages may best answer queries posed by searchers at search engines?

Microsoft on Final Destination Pages

Continue reading “Microsoft Tracking Search and Browsing Behavior to Find Authoritative Pages”

Yahoo on Segmenting Web Sites into Topical Hierarchies

On one level, a search engine indexes a web site by crawling that site one URL at a time, collecting information about what it finds at that address, and indexing the information found so that it can be served to visitors later.

But, the process can be more complicated than that.

For instance, a search engine may try to understand more about specific sites by collecting information on a site wide basis.

Site Wide Information about Web sites

Information that a search engine might look at about a web site on a site wide level might include:

Continue reading “Yahoo on Segmenting Web Sites into Topical Hierarchies”

New Google Process for Detecting Near Duplicate Content

A new patent application on near duplicate content from Google explores using a combination of document similarity techniques to keep searchers from finding redundant content in search results.

The Web makes it easy for words to be copied and spread from one page to another, and the same content may be found at more than one web site, regardless of whether its author intended it to be or not.

How do search engines cope with finding duplicate websites – the same words, the same content, at different places?

How might they recognize that the pages that they would want to show searchers in response to a search query contain the same information and only show one version so that the searchers don’t get buried with redundant information?

Continue reading “New Google Process for Detecting Near Duplicate Content”

The Oracle at Yahoo: Using Yahoo News to Search the Future

Imagine exploring millions and millions of news pages and other documents to find information about events that are scheduled to happen in the future, to help predict the future.

The oracle Sibyl at Delphi

This kind of future search, or future retrieval, might be able to support the making of decisions in many different fields.

News information could be used to obtain information about possible future events, and that information could be made searchable, so that it can help people plan for the future.

The Yahoo patent application is:

Continue reading “The Oracle at Yahoo: Using Yahoo News to Search the Future”

Google Omits Needless Words (Boilerplate)

Computer programmers will sometimes use the term “boilerplate” code to refer to standard stock code that they often insert into programs. Lawyers use legal boilerplate in contracts – often the small print on the back of a contract that doesn’t change regardless of what a contract is about.

A lot of web pages and documents reuse the same text in sidebars and in footers at the bottoms of pages, like copyright notices and navigation sidebars.

It might be a good step for a search engine to ignore the boilerplate text when indexing pages, or using the content of pages to create query suggestions for someone using a desktop personalized search. Ignoring this similar text in the same documents could be helpful when using those documents to rerank search results in personalized search.

New York Times Boilerplate
navigation information at the top of the New York Times homepage

Continue reading “Google Omits Needless Words (Boilerplate)”