Fighting web spam with algorithms

A new patent application from Microsoft describes some ways to identify some of the spam pages that show up in search engine results. The research that led to the application started off by looking at something else completely, but a chance discovery turned up some interesting results.

The initial research began with something Microsoft calls Pageturner. Pageturner is a project that looks at how often web pages update, and how frequently they might need to be crawled. It also looks at identifying duplicate and near duplicate content on web pages.

The Microsoft researchers on that project found themselves being drawn to some very different research after looking at some of their results, especially from some pages located in Germany, which changed too quickly. Here are a couple of papers that describe some of the results of the original research:

On the Evolution of Clusters of Near Duplicate Web Pages (pdf)

Continue reading Fighting web spam with algorithms


Search roundup

Some blog posts and articles that I came across in the last week that I thought were interesting.

Jared Spool, over at UIE Brainsparks, writes about collecting penultimate referrers in Identifying Missing Trigger Words from Search Logs. Collecting information about what people search for on your site through an online search function can be a good way of finding what people might want to see on your site. But, isn’t it also interesting to see what search might have brought them to the page where they conducted that search? Those next-to-last, or penultimate, searches might contain some useful information about what people expect to see on your site but might be missing. Nice idea.

This one has been pointed to by a number of people, but it’s a good one to see if you missed it. Matt Cutts posted a Question and Answer post a couple of days ago where he discussed the recent “Big Daddy” infrastructure update to Google, as well as answering questions on a number of other topics.

Greg Linden has been writing some great posts about his days at Amazon lately on Geeking with Greg. But, those Amazon posts only add to the many other excellent posts there, including a recent one on mandatory registration in forums, Removing registration and traffic.

Continue reading Search roundup


New Google patent applications

Some new patent applications assigned to Google, which were published yesterday at the US Patent and Trademark Office.

These are not granted patents, and they only describe possible ways that a search engine can fulfill some objective, but they can provide some insight into possible processes that the search engine could follow, and some of the issues surrounding the problems they are intended to address.

Adjusting ad campaigns based upon business objectives

Interested in having your online advertising campaign adjust itself in some manner when a pre-defined business goal has been met? The first patent application describes a process that will estimate or track (or estimate and track) a business metric , such as: ROI, profit, gross profit, etc., for an ad campaign, or part of the campaign.

Continue reading New Google patent applications


IBM and Shadow Pages

International Business Machines appears to be embracing search engine optimization in a big way. That’s a positive sign from a company that is seen as a leader in many areas of information technology. A new patent application, and a couple of new articles from IBM point towards a growing commitment towards helping people build web sites that are easier for people to find through search engines.

New articles

I wrote a previous post about IBM publishing two of a series of four articles on search engine optimization. The final articles in the series are now out, and they are worth a look.

The first two articles in the series were written by L. Jennette Banks, who is an organic search optimization expert for IBM. The last two are by Mike Moran, who is IBM’s Manager of Site Architecture, and Bill Hunt, who is the President and CEO, Global Strategies International, LLC. You may have seen those two names together before if you’ve conducted some research on books about SEO. They are co-authors of a book on the subject – “Search Engine Marketing, Inc.: Driving Search Traffic to Your Company’s Web Site” – and they’ve received a lot of positive reviews for their joint publication.

Continue reading IBM and Shadow Pages


Customizing Travel Directions with Google

I remember when one of my co-workers once asked me if I would help her plot out a map, and driving directions, so that she could go on a roadtrip to two of the largest shopping malls in the area.

I’m not always very happy with the driving directions that I get from one of the mapping services on the web, and this was something of a challenge, because the trip would pretty much be a big triangle – Point A to Point B (Plymouth Meeting Mall) to Point C (King of Prussia Mall) to Point A.

I pretty much had to plot three sets of courses, and try it in at least three mapping programs, until I got some directions that seemed like they would work best. At some point it went from challenging to painful.

I guess I’m not the only one who wished that driving directions could be a little more customizable.

Continue reading Customizing Travel Directions with Google


Google Book Search Patent Application

There’s a newly published patent application from Google, and on its face, it looks like a good match for the way books could be displayed in Google’s Book Search.

Parts of it do appear to be included in what Google has developed, but I don’t see them using the “image distortion” described in the document.

If you haven’t spent any time with Google book search, you may not have seen how they handle some sources differently than others. For a few books, it appears that you can look at a number of pages that include your query terms. For other books, where the search terms may appear on a lot of pages, you need to log into Google to look at some of the pages, so that they can track how much of the book you’ve seen.

For shorter works, instead of providing full pages, it seems that Google’s Book Search only delivers snippets of relevant text. This is where the patent application seems to point to the use of a full page with the parts that aren’t relevant appearing distorted and even unreadable.

It may be worth skimming over the patent application if you are interested in seeing a detailed description on how to handle the issues that the process described within it was intended to address.

Continue reading Google Book Search Patent Application


Microsoft and localized media delivery based upon site location

One approach to providing advertisements on a website is to try to gauge the subject matter of a page, and provide advertising related to that content. But many advertisers are interested in delivering advertising that will go to an audience in a certain area or region.

Ads based upon content can be targeted to specific geographical locations by looking at the IP address of a visitor, though that approach often results in serving the visitor with advertisements based upon the location of the owner of the IP address. One problem with doing that would be when a visitor uses a large ISP located a distance away from where he or she is viewing a site from.

An alternative would be to attempt to collect some geographical location information from the visitor, relying upon a user of a personalized web service to supply that information, such as a phone number, or zip code, or something else that can tell where they are at. But people don’t always provide that type of information, or may supply it to get something like information about local weather and then move and not change the information that they supplied.

A different way of delivering ads based upon location would be to attempt to understand the location of the site, if it has one associated with it, and serve yellow pages styled ads on that page. A patent application assigned to Microsoft, and released this week explores ways to display ads related to what it believes is the location of a site.

Continue reading Microsoft and localized media delivery based upon site location


IBM tackles multilingual web searching

I’ve been enjoying visiting a number of sites that are written in languages other than English, such as and Référencement, Design et Cie, and others. I often rely on some of the translation services available online to read those sites, but I have trouble when searching the web in finding some information that isn’t written in English.

It would be nice to have a way to search non-English sites without having to try to translate queries into other languages first.

IBM has a patent filing, published as a patent application last week, which tries to help people find sites in other languages that are relevant to their searches, and might be authority sites on those subjects.

Continue reading IBM tackles multilingual web searching


Getting Information about Search, SEO, and the Semantic Web Directly from the Search Engines