There may be more than one URL for a single page on a website, which can cause problems when a search engine attempts to crawl and index pages on that site.
If the search engine can figure out some rules on how these different versions of URLs for a page come about, and identify only one version of a URL to index for the different versions, then it can save time and processing power by only crawling and indexing that one version.
The “canonical” version of a URL would be a standard single version, when there may be more than one way to represent the URL (or address) of a page.
Web crawlers can download only a finite number of documents or web pages in a given amount of time. Therefore, it would be advantageous if a web crawler could identify URL equivalence patterns in multiple different URLs that reference substantially identical pages and download only one document, as opposed to downloading all the substantially identical documents addressed by the multiple different URLs.
Continue reading Microsoft Creating Rules for Canonical URLs
I posted on Tuesday about the use of Topic Familiarity in Reranking Search Results. Stephen Pitts, of Build a Better Website asked if I thought the patent application being discussed in that post had anything to do with personalized search.
My response was that the inventors listed in the document stated within it that they weren’t looking at specific search terms, user queries, or user behavior when ranking pages to determine whether those were “introductory” or “advanced” pages. The question stuck with me though, and while looking at some other papers on the web, I noticed a Yahoo paper from the beginning of 2005 that shared an author with the patent, Omid Madani. The paper was an update of Yahoo’s personalization efforts for a January, 2005, conference titled Beyond Personalization 2005.
I decided that the paper might be worth writing about here, but since it was from early 2005, I thought I should look for some other documents from the company about personalization. A search showed up some job listings at Yahoo that I thought were interesting.
Yahoo Job Listings
Continue reading A Peek at Personalization at Yahoo
Unfamiliar with a topic, and want to find a simple page on a subject – one that didn’t require background reading or knowledge to understand the page?
More familiar with that subject, and you want to find an advanced page on the web?
Could a search engine help you find pages and rerank them based upon how familiar you may indicate that you are with the topic related to your query? It’s possible.
A search engine might pay attention to the following when indexing pages:
- Reading levels for the page,
- Word lengths of sentences and other features of text on the page,
- How simple or complex the stopwords* used upon a page may be.
Continue reading How to Use Topic Familiarity to Rerank Search Results
A good percentage of all searches upon the major search engines involve people looking for something in a specific geographic location.
Understanding how search engines look for and extract geographic information on web sites, and handle that information about locations, and present it to searchers is one of the most important and possibly fastest growing areas in search. Especially with the growth of web access on phones and PDAs.
Mike Blumenthal has started a new blog, Understanding Google Local & Yahoo Local, to focus on issues around how search engines handle local search. I’m looking forward to his posts.
His latest asks Does Google Maps have a sandbox?
Continue reading New Blog on Local Search
Rand, over at 14th Colony asked about the ruling against Google by the Court of First Instance in Brussels (Belgium), and its translation into English. I found a copy of the ruling at ChillingEffects.org in an image pdf file. I’ve transcribed part of it which details the ruling of the Court in English.
Some interesting points, before the transcription:
Continue reading Belgian Copyright Ruling Against Google News
If you are a site owner or builder or designer who isn’t taking a mobile audience into consideration with your site, you may find the search engines routing traffic for mobile users around your pages. A new patent application from Microsoft shows one way that could be done.
We saw Google release an accessible search not long ago that reordered sites based upon how accessible those were. Reranking pages for people using mobile devices during a search, based upon whether or not they are using a handheld device, doesn’t seem out of the question, not something that we would wait a long time to see appear.
Some sites are better viewed on mobile devices than others, and some sites are completely unusable on a web capable mobile phone or PDA. Imagine an addition to a search engine index that looks at how friendly pages are to handheld devices, and reorders pages in search results based upon mobile-friendliness indications. That’s the focus of this patent filing from Microsoft.
Mobile friendly internet searches
Invented by Frank S. Serdy, Jr., Rainer J. Romatka, Tyler E. Hennessy, Neil A. Brench
Assigned to Microsoft
US Patent Application 20060212451
Published September 21, 2006
Filed: March 15, 2005
Continue reading Reranking Search Results Based Upon Mobile Friendliness and Removing Unfriendly Sites
I came across a short paper, presented at the SIGMOD/PODS’06 in Chicago in June, and some other resources on some of the major data projects happening at Google that I wanted to share. Data Management Projects at Google (pdf) covers Google’s Big Table, Google Base, and SAWZALL. It doesn’t go into a significant amount of depth regarding any of them, but it’s a nice short overview of three of the Data projects underway at Google these days.
A GoogleTech Presentation from May 31, 2006, also provides a nice look at Building Large Systems at Google:
(Link to this at Google Video: Building Large Systems at Google )
Continue reading Data Projects At Google