Search Engines, Classifications, and Assignment of Categories

Search Engine Categorization

If you have a website that classifies products or services or pages into different areas, and your products might be offered in a shopping search engine or other services that draw information from multiple websites, how you classify what you offer may play a role in how that shopping search engine classifies or creates new classifications when it displays your products or services or pages.

A Yahoo patent application describes an automated process, where items entered into different sets of categories can be categorized in other broader categorization schemes.

These broader category schemes could be for product search, for advertisements, for user-tagged items such as photos, for services such as job listings, as well as other areas where there are many websites that have their own unique categorization systems.

Continue reading “Search Engines, Classifications, and Assignment of Categories”

A Yahoo Approach to Avoid Crawling Advertisement and Session Tracking Links

A newly published Yahoo patent application describes a couple of ways to filter out some of the URLs that it might crawl, to keep those pages from being indexed and presented to searchers.

Those URLs are referred to in the patent filing as “transient” links because they change from visit to visit, often because they are advertisements that have URLs with tracking codes included within them, or contain session IDs to track visitors.

An approach is provided for identifying transient links on a Web page. The approach ensures that transient links are not crawled and archived, thereby saving resources for crawling valid links leading to useful information.

Outgoing links on a web page are identified, and after a period of time, a new copy of the web page is obtained and the outgoing links identified. The respective sets of links are compared and links which do not appear in both sets of links are identified as transient.

Continue reading “A Yahoo Approach to Avoid Crawling Advertisement and Session Tracking Links”

An Expansion of Importance Scores for Web Page Rankings?

When a judge looks at evidence entered into court, he weighs a number of factors. One of them is whether the evidence offered is relevant to the case at hand.

The other is how important that evidence might be.

Now, a piece of evidence by itself doesn’t have to be groundbreaking to important, but for example, testimony related to the character of a 40 year-old defendant in a criminal proceeding by his third grade teacher may be somewhat relevant, but probably not all that important.

Importance Scores in Search Engines

When a search engine ranks pages for a set of search results, it also usually looks at two different and distinct types of calculations, which it combines together to serve pages to searchers. Those scores likewise focus upon how relevant a result might be to a query entered, and how important that page or picture or video might be.

Continue reading “An Expansion of Importance Scores for Web Page Rankings?”

How Does Congress Use Google? Mentions on the Congressional Record

I had read that a hearing regarding the proposed Google-Doubleclick merger was going in front of the Senate Judiciary Committee’s Subcommittee on Antitrust, Competition Policy and Consumer Rights on Thursday.

In looking for information about the hearing, I also decided to take a look at when and how Google was mentioned by Congress on the Congressional Record.

A search on Thomas for the 110th Congress yielded 43 results, and I’ve listed those below, along with the quotes where the word “Google” was used. It’s difficult to link directly to documents found through Thomas, so if you want to see more context for those, the best way might be to go to the Thomas page that I’ve linked to in this paragraph, and search for “Google.”

Congress has used Google as an example (good and bad), and as a verb. The number of results found for a search in Google has been used as proof of a position, and as a clever retort of that position.

There are mentions of Google the company, and Google as a new employer of a former agency member. One Congressman pleaded for others to look at Google Maps for proof of the size of a particular land mass.

Continue reading “How Does Congress Use Google? Mentions on the Congressional Record”

Microsoft on Javascript Redirection Spam

A paper prepared by Microsoft researchers at the AIRWeb’07 conference this past May explores some methods that a few people use to try to trick search engines. The paper, A Taxonomy of JavaScript Redirection Spam (pdf), provides a nice overview of those methods.

In this paper, we study common JavaScript redirection spam techniques on the web.

Our findings indicate that obfuscation techniques are very prevalent among JavaScript redirection spam pages.

These obfuscation techniques limit the effectiveness of static analysis and static feature based systems.

Based on our findings, we recommend a robust counter measure using a light weight JavaScript parser and engine.

Continue reading “Microsoft on Javascript Redirection Spam”

Google on Multi-Tiered Indexing and Multi-Staged Query Processing

I discussed one of the more interesting patent applications from Google last year in Google looks at multi-stage query processing. What made it so intriguing was that it described different stages and aspects of ranking results by the search engine.

A related patent application was published this week, Document compression system and method for use with tokenspace repository, goes back to that multi-staged query processing system, and makes claims for some of the more technical aspects of how information is contained within the indexes used during that process.

The abstract for the patent filing provides a high level look at some of the techniques used:

The disclosed embodiments enable multi-stage query scoring, including “snippet” generation, through incremental document reconstruction facilitated by a multi-tiered mapping scheme.

Continue reading “Google on Multi-Tiered Indexing and Multi-Staged Query Processing”