Search Engines, Classifications, and Assignment of Categories

If you have a web site that classifies products or services or pages into different areas, and your offerings might be offered in a shopping search engine or other services that draw information from multiple web sites, how you classify what you offer may play a role in how that shopping search engine classifies, or creates new classifications when it displays your products or services or pages.

A Yahoo patent application describes an automated process, where items entered into different sets of categories can be categorized in other broader categorization schemes.

These broader category schemes could be for product search, for advertisments, for user-tagged items such as photos, for services such as job listings, as well as other areas where there are many web sites that have their own unique categorization systems.

The Value of Categories

Continue reading Search Engines, Classifications, and Assignment of Categories

A Yahoo Approach to Avoid Crawling Advertisement and Session Tracking Links

A newly published Yahoo patent application describes a couple of ways to filter out some of the URLs that it might crawl, to keep those pages from being indexed and presented to searchers.

Those URLs are referred to in the patent filing as “transient” links because they change from visit to visit, often because they are advertisements that have URLs with tracking codes included within them, or contain session IDs to track visitors.

An approach is provided for identifying transient links on a Web page. The approach ensures that transient links are not crawled and archived, thereby saving resources for crawling valid links leading to useful information.

Outgoing links on a web page are identified, and after a period of time, a new copy of the web page is obtained and the outgoing links identified. The respective sets of links are compared and links which do not appear in both sets of links are identified as transient.

Continue reading A Yahoo Approach to Avoid Crawling Advertisement and Session Tracking Links

An Expansion of Importance Scores for Web Page Rankings?

When a judge looks at evidence entered into court, he weighs a number of factors. One of them is whether the evidence offered is relevant to the case at hand.

The other is how important that evidence might be.

Now, a piece of evidence by itself doesn’t have to be groundbreaking to important, but for example, testimony related to the character of a 40 year-old defendant in a criminal proceeding by his third grade teacher may be somewhat relevant, but probably not all that important.

Importance Scores in Search Engines

When a search engine ranks pages for a set of search results, it also usually looks at two different and distinct types of calculations, which it combines together to serve pages to searchers. Those scores likewise focus upon how relevant a result might be to a query entered, and how important that page or picture or video might be.

Continue reading An Expansion of Importance Scores for Web Page Rankings?

How Does Congress Use Google? Mentions on the Congressional Record

I had read that a hearing regarding the proposed Google-Doubleclick merger was going in front of the Senate Judiciary Committee’s Subcommittee on Antitrust, Competition Policy and Consumer Rights on Thursday.

In looking for information about the hearing, I also decided to take a look at when and how Google was mentioned by Congress on the Congressional Record.

A search on Thomas for the 110th Congress yielded 43 results, and I’ve listed those below, along with the quotes where the word “Google” was used. It’s difficult to link directly to documents found through Thomas, so if you want to see more context for those, the best way might be to go to the Thomas page that I’ve linked to in this paragraph, and search for “Google.”

Congress has used Google as an example (good and bad), and as a verb. The number of results found for a search in Google has been used as proof of a position, and as a clever retort of that position.

There are mentions of Google the company, and Google as a new employer of a former agency member. One Congressman pleaded for others to look at Google Maps for proof of the size of a particular land mass.

Continue reading How Does Congress Use Google? Mentions on the Congressional Record

Microsoft on Javascript Redirection Spam

A paper prepared by Microsoft researchers at the AIRWeb’07 conference this past May explores some methods that a few people use to try to trick search engines. The paper, A Taxonomy of JavaScript Redirection Spam (pdf), provides a nice overview of those methods.

In this paper, we study common JavaScript redirection spam techniques on the web.

Our findings indicate that obfuscation techniques are very prevalent among JavaScript redirection spam pages.

These obfuscation techniques limit the effectiveness of static analysis and static feature based systems.

Continue reading Microsoft on Javascript Redirection Spam

Google on Multi-Tiered Indexing and Multi-Staged Query Processing

I discussed one of the more interesting patent applications from Google last year in Google looks at multi-stage query processing. What made it so intriguing was that it described different stages and aspects of ranking results by the search engine.

A related patent application was published this week, Document compression system and method for use with tokenspace repository, goes back to that multi-staged query processing system, and makes claims for some of the more technical aspects of how information is contained within the indexes used during that process.

The abstract for the patent filing provides a high level look at some of the techniques used:

The disclosed embodiments enable multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme.

Continue reading Google on Multi-Tiered Indexing and Multi-Staged Query Processing

20 More Ways that Search Engines May Rerank Search Results

Last October, I made a list of 20 Ways Search Engines May Rerank Search Results, which was well received (thank you!), and it was suggested recently that I come up with an updated list. There’s also a followup post to this one now at Another 10 Ways Search Engines May Rerank Search Results.

When someone searches at a search engine, the conventional approach a search engine might take is to try to find pages that contain the keywords searched for, and rank and serve them in an order which combines a relevancy score for each of the pages with some kind of importance metric, such as a PageRank Score.

This list contains links to a number of patent applications and a few papers involving ways to rerank search results. Most of these were published after the creation of my previous reranking list.

Continue reading 20 More Ways that Search Engines May Rerank Search Results

Intel’s Mash Maker Now in Technology Preview Release

Back on July 4th, I wrote about Intel’s Mash Maker, which was featured in a paper presented at SIGMOD’07 in June, and was coming out in a private beta release in July.

Robert Ennal, one of the paper’s authors was kind enough to send me an email letting me know that people can now sign up to try out Mash Maker in a technology Preview Release (Thanks, Robert). I’ve signed up, and if you want to try it out, you can submit your email address on the Intel Mash Maker site.

How does Mash Maker work?

Intel® Mash Maker is an extension to your existing web browser that allows you to easily augment the page that you are currently browsing with information from other websites. As you browse the web, the Mash Maker toolbar suggests Mashups that it can apply to the current page in order to make it more useful for you. For example: plot all items on a map, or display the leg room for all flights.

Continue reading Intel’s Mash Maker Now in Technology Preview Release