Thanks to David Laws

You may have noticed some tweaks to the design of SEO by the Sea over the last day or so.

I have David Laws, of 1 Cog, to thank for doing some fine tuning of the CSS file here. David graciously volunteered his services to make those changes primarily to make links stand out a little more than they had in the past.

He also made some changes to colors for parts of the blog. I’m not sure that I initially liked the highlighting on links that had been visited, but they’ve been growing on me this morning.

Thank you very much, David.

On a related topic, I updated the version of “Did You Pass Math”, which had at least two important changes. One was that it now allows trackbacks, which were broken under the old version. The other is that it moves the question and the field for an answer to the top of the comment form.


How Google Manages Large Amounts of Data?

If you get excited over thoughts of how large amounts of data may flow from one part of a network to another, with multiple master and slave machines, you might find getting a glimpse of how Google might handle infomation interesting. A patent application published yesterday may provide some ideas on how Google shares terabytes of information across a very widely distributed network.

The inventor listed in this document is Arvind Jain, who is the the Centre Head of Research and Development for Google in Bangalore. According to his profile at the 2005 International Conference on High Performance Computing, held in December:

At Google, he has worked on various infrastructure projects including the crawl and indexing system, distributed file replication system, and compression techniques for large scale storage systems.

Here’s the patent application:

Continue reading How Google Manages Large Amounts of Data?


Innovating Product Reviews at Google

Some sites on the web do reviews of products and services pretty well, such as or

Imagine Google wanting to provide reviews. One of the mantras that we often hear coming from their Mountain View offices is that they wouldn’t get into a field unless they can do something innovative.

So, how might Google handle reviews? A new patent application from the company gives us some insight into what they might do. It includes reviews of things such as:

  • consumer products,
  • business products,
  • movies,
  • books,
  • restaurants,
  • hotels, and;
  • travel packages.

Why bother with this patent application?

Continue reading Innovating Product Reviews at Google


Sébastien Billard’s Interview with Danica Brinton of Ask

Sébastien Billard sent me a heads up on an interview that he has conducted with Danica Brinton, who is the head of International Product Management and Localization for – Interview avec Danica Brinton (

While the original is in French, Sébastien has also translated it into English. There’s a link to the English version in pdf format in the first paragraph of the French version. There’s some great information here on how Ask’s blog search works. Here’s a snippet:

Because Bloglines is the largest and longest established major blog reading community online, Ask Blog & Feed Search also has the most robust index of content on the Web: articles are indexed from 2001 through five minutes ago (or less). New posts are added at a rate of four to six million per day, with a total index in excess of 1.5 billion articles, with 4 to 6 million added every day.

The interview also describes different ways that search results can be sorted, and provides some insight into the ExpertRank algorithm.

Continue reading Sébastien Billard’s Interview with Danica Brinton of Ask


Spam Email Filtering Based Upon Links

Can links in emails help reduce email spam? Possibly.

A patent application from Google last week, that I missed until I checked carefully through the patent assignment database, describes an interesting approach to checking for the presence of spam in emails.

If an email has a link within it, the page that it is linked to can be looked at using a concept categorization of that linked content.

When an electronic message is received, hyperlinks within the document are indentified, and information about the link is categorized based upon “semantic relationships” from that information. That categorization, and other information can then be used to determine whether or not the message is undesired and should be filtered.

Method and system to detect e-mail spam using concept categorization of linked content
Invented by Johnny Chen
US Patent Application 20060122957
Published June 8, 2006
Filed: December 3, 2004

Continue reading Spam Email Filtering Based Upon Links


Interviewed, and Some Other Random Musings

Aaron Pratt, of SEO Buzz Box asked me recently if I would answer some questions for him in an interview. He asked me some easy questions, like if I do live near the sea, and some tougher questions, such as how Google might be able to tell if a link is natural or unnatural. It was an enjoyable experience, and I would like to thank Aaron for giving me the chance to share some answers to questions on his site.

Aaron Wall of SEO Book has an excellent new article on Search Relevancy Algorithms: Google vs Yahoo! vs MSN, in which he takes a closer look at the business models and search algorithms of Google, Yahoo, MSN and Ask. As a stand-alone article it’s very good. It might be the excellent start of a book on search, if Aaron would consider expanding it.

Jaron Lanier, over at Edge, wrote an essay published at the end of May on some of the problems with the Wisdom of Crowds and the harnessing of collective intelligence, titled DIGITAL MAOISM: The Hazards of the New Online Collectivism. Interesting comparison of some of the similarities and differences between the Wikipedia, My Space, and Google, and how those sources rely upon the interactions of massive amounts of people.

Continue reading Interviewed, and Some Other Random Musings


Learning from the Spanish Web

In Characteristics of the Web of Spain (pdf), by Ricardo Baeza-Yates, Carlos Castillo, and Vicente López, the authors take a close look at the web sites of Spain, and find a number of interesting results.

The paper was published last year, but I don’t see a lot of citations to it from English language sites listed in Google, and it probably deserves a lot wider readership.

One of the hurdles that the authors faced was identifying which sites were from Spain. The cost of a .es domain name is considerably more expensive than a .com name, and to use a .es domain name, a site owner needs to “prove that the applicant owns a trade mark, or represents a company, with the same name as the domain being registered.”

By taking sites that had IP addresses from networks physically located in Spain and sites with an .es top level domain (tld), these researchers were able to look at over 16 million web sites.

Continue reading Learning from the Spanish Web


Duplicate Content Issues and Search Engines

There are a number of reasons why pages don’t show up in search engine results.

One area where this is particularly true is when the content at more than one web address, or URL, appears to be substantially similar at each of the locations it is seen by the search engines.

Some duplicate content may cause pages to be filtered at the time of serving of results by search engines, and there is no guarantee as to which version of a page will show in results and which versions won’t. Duplicate content may also lead to some sites and some pages not being indexed by search engines at all, or may result in a search engine crawling program stopping the indexing all of the pages of a site because it finds too many copies of the same pages under different URLs.

There are a few different reasons why search engines dislike duplicate content. One is that they don’t want to show the same pages in their search results. Another is that they don’t want to spend the resources in indexing pages that are substantially similar.

I’ve listed some areas where duplicate content exists on the web, or seems to exist from the stance of search engine crawling and indexing programs. I’ve also included a list of some patents and some papers that discuss duplicate content issues on the web.

Continue reading Duplicate Content Issues and Search Engines


Getting Information about Search, SEO, and the Semantic Web Directly from the Search Engines