Learning from the Spanish Web

In Characteristics of the Web of Spain (pdf), by Ricardo Baeza-Yates, Carlos Castillo, and Vicente López, the authors take a close look at the web sites of Spain, and find a number of interesting results.

The paper was published last year, but I don’t see a lot of citations to it from English language sites listed in Google, and it probably deserves a lot wider readership.

One of the hurdles that the authors faced was identifying which sites were from Spain. The cost of a .es domain name is considerably more expensive than a .com name, and to use a .es domain name, a site owner needs to “prove that the applicant owns a trade mark, or represents a company, with the same name as the domain being registered.”

By taking sites that had IP addresses from networks physically located in Spain and sites with an .es top level domain (tld), these researchers were able to look at over 16 million web sites.

Continue reading Learning from the Spanish Web


Duplicate Content Issues and Search Engines

There are a number of reasons why pages don’t show up in search engine results.

One area where this is particularly true is when the content at more than one web address, or URL, appears to be substantially similar at each of the locations it is seen by the search engines.

Some duplicate content may cause pages to be filtered at the time of serving of results by search engines, and there is no guarantee as to which version of a page will show in results and which versions won’t. Duplicate content may also lead to some sites and some pages not being indexed by search engines at all, or may result in a search engine crawling program stopping the indexing all of the pages of a site because it finds too many copies of the same pages under different URLs.

There are a few different reasons why search engines dislike duplicate content. One is that they don’t want to show the same pages in their search results. Another is that they don’t want to spend the resources in indexing pages that are substantially similar.

I’ve listed some areas where duplicate content exists on the web, or seems to exist from the stance of search engine crawling and indexing programs. I’ve also included a list of some patents and some papers that discuss duplicate content issues on the web.

Continue reading Duplicate Content Issues and Search Engines


Google predicting queries

Speeding Up the Web

The web is transforming from its earlier days when every bit of information was carefully considered by a webmaster before adding an image or some text to a web page.

I remember spending hours and hours optimizing images so that they were small and still decent looking, and squeezing white space out of html to make pages faster for phone modem transmissions. Learning as much as possible about Cascading Style Sheets was important because they could help shave a lot of html out of a page. Not everyone considers this stuff, and it feels kind of odd going to sites that get millions of views a month, and seeing them using tables and font tags.

Bandwidth has improved, and more images, pictures, video, and content comes across the screen than ever before. And, that kind of use will probably only grow. Which means that we will need more bandwidth.

Google’s Need for Speed

Continue reading Google predicting queries


Pagerank Patent Updated

A new version of one of the pagerank patents was published today.

There are some changes to the document. Many of them appear to be bringing parts of the first two patent applications involving pagerank together.

The New Patent

Method for node ranking in a linked database
Inventor: Lawrence Page
Assignee: The Board of Trustees of the Leland Stanford Junior University
US Patent 7,058,628
Granted June 6, 2006
Filed July 2, 2001


A method assigns importance ranks to nodes in a linked database, such as any database of documents containing citations, the world wide web or any other hypermedia database. The rank assigned to a document is calculated from the ranks of documents citing it. In addition, the rank of a document is calculated from a constant representing the probability that a browser through the database will randomly jump to the document. The method is particularly useful in enhancing the performance of search engine results for hypermedia databases, such as the world wide web, whose documents have a large variation in quality.

Continue reading Pagerank Patent Updated


Microsoft Reranking and Filtering Redundant Information

How much variety should you see in search results? Should search engines mix things up a little, so when you look for something with a search, you don’t see results that are all exactly about the same thing? Should pages in those results that seem too far off topic be pushed back in the results?

Imagine performing a search at MSN, with the search engine responding to your query by gathering links to pages, and descriptions of those pages, then it takes a closer look at the content of those documents, and sorts them in a different order

For example, searching for “Abraham Lincoln,” you might see pages about the following within returned documents:

  • About someone’s cat, named Abraham Lincoln,
  • The Abraham Lincoln Theme Park,
  • A website selling Abraham Lincoln memorabilia, and;
  • And other pages only somewhat related.

Continue reading Microsoft Reranking and Filtering Redundant Information


Contextual Ads on Parked Domains

Nice article from a few days ago over at CircleID titled Questioning Parked Domains and Google AdNonSense.

The author starts off by asking if contextual advertising is helping or hurting the web. He notes that on first blush, it appears to be a good idea. But, he digs a little deeper to see how it is being used in some instances, and decides that maybe it isn’t such a good idea:

To make money with contextual advertising you want your content to be bad. Yes, you want it to be bad. You do not want the user to like what you have on the webpage or find what they are looking for in hopes that after not finding it, they will either do another search in your embedded Google search box or they will click one of the contextual ads on the page in hopes of finding what they came there to find

I wonder how the advertisers feel about appearing on pages like these. Google recently published on patent application that described a method for advertisers to find good advertising partner, looking at such things as the quality of the content on the advertisers’ sites:

Continue reading Contextual Ads on Parked Domains


Researching Corporate Acquisitions

Since writing about Google acquisitions a few months ago, and Yahoo Acquisitions, I’ve received more than a couple of requests from people asking about some strategies and methods for researching corporate acquisitions online.

There are a lot of potential sources of information that you can look at, but a few that you might want to start with first.

1. Web-based searches for reference sites, news articles, blog posts.

It’s possible to find lists of acquisitions on reference sites, on blog posts by people whose companies have been acquired, and through news stories about purchases. Search for things like “Google Acquired” (with the quotation marks) or “Google acquisitions” (again, with the quotation marks). Make a list of all of the companies that you can find that way, and then conduct searches for those companies to see if you can find out more about the acquisitions.

Continue reading Researching Corporate Acquisitions


Web Decay and Dead Links Can be Bad for Your Site

How harmful are dead links to search engine rankings? Or pages filled with outdated information? Can internal redirects on a site also hurt rankings? What about the redirects used on parked domains?

A new patent application published last week at the US Patent and Trademark Office (USPTO), and assigned to IBM, Methods and apparatus for assessing web page decay, explores the topics of dead pages, web decay, soft 404 error messages, redirects on parked pages, and automated ways for search engines to look at these factors while ranking pages. I’ll explore a little of the patent application here, and provide some ideas on ways to avoid having decay harm the rankings of web sites.

The authors of the patent filing include:


Getting Information about Search, SEO, and the Semantic Web Directly from the Search Engines