Would search engines be better if they started web crawls from sites like Twitter or Facebook? Wikipedia or Mahalo? DMOZ or the Yahoo Directory?
The Web refreshes at an incredible rate, with new pages added, old pages removed, and words pouring out from blogs, news sites, and other genres of pages. Ecommerce sites showcase new products and eliminate old ones. New sites launch and old domains expire.
Search engines attempt to keep their indexes of the Web as fresh as possible, and send out crawling programs to find the new, update changes, and explore disappearances. Failure to do so means outdated search engines that deliver people to deleted pages, overwritten content, and stale indexes that miss out on new sites.
When a search engine starts crawling the Web, it often begins by following URLs from chosen seed sites to explore other pages and other domains. But how does a search engine choose those seed sites?
Would it surprise you if searches on the Web make up around 10 percent of all pageviews on the Web, and indirectly led to more than 21 percent of the pages viewed online? It surprised a couple of researchers from Yahoo.
That’s the result of a study conducted by Ravi Kumar and Andrew Tomkins from a sample of over 50 million user pageviews that they collected during 8 days in March, 2009. The information was captured through the Yahoo toolbar from people who agreed to the collection of data for this kind of analysis. Additional information was added by looking at the search logs from Yahoo.
While the data is limited to users of the Yahoo toolbar who agreed to the use of the data, and doesn’t include mobile searches or searches that used AJAX to display results, it does capture how people browse the Web and search at a number of search engines as well as searches at sites like eBay and Amazon.
The study is described in a paper titled A Characterization of Online Search Behavior (pdf), and is being presented tomorrow at the WWW2010 Conference in a session dedicated to User Models on the Web.
Google and Yahoo on Faster Web Pages
Earlier this month, Google announced that they would start considering the speed of a site as one of the ranking signals that they use to rank pages in search results.
Yahoo published a patent filing last year that also described how they might use page load and page rendering times as ranking signals as well. I wrote a post soon after it was published, Does Page Load Time influence SEO? exploring how Yahoo and other search engines might look at different factors regarding the speed of pages, including the experience of users on web pages.
Google’s Matt Cutts wrote about the recent Google announcement, and provided some more details, telling us that it’s likely that less than 1 percent of queries would be affected by this change.
A newly granted Google patent on phrase-based indexing calls for a new look at that approach to indexing phrases on the Web, including a process referred to as phrasification.
Say you want to find out who the chief of police is in New York City. You might type the following words into a search box at Google:
When Google attempts to find an answer for you, it may break your query into individual words to find all of the documents that might be a best match for your search:
- New AND York AND police AND chief
Google may then take all the documents that are returned, and see which ones contain all of the terms you used, and then rank those based upon some of the ranking algorithms the search engine uses to try to show you the best matches for your query.
There are a number of ways a search engine may decide upon how important a web page might be. That measure of importance might be used by search engines, along with a determination of relevance, as one of the ranking signals used to decide which pages to show first in lists of results shown to searchers. That importance might also be used to decide which pages a search engine crawling program should crawl and index, and revisit to see if content on those pages have changed.
A search engine might view the links between web pages, and decide that pages linked to frequently are more important than pages that aren’t. It might also determine that web pages that are linked to by important pages are more important than pages linked to by less important pages. Google’s PageRank is one approach for determining how important pages might be based upon looking at links between pages.
There are other ways that a search engine might use to decide how important a web page might be, including actually attempting to see how many people actually use that page.
A recently granted Google patent from the founders of Applied Semantics discusses a search interface that could help searchers find web pages based upon the meanings of their queries rather than just pages that include those keywords.
In the late 90s, Adam Weissman and Gilad Elbaz decided to start a search engine that would search on meanings or concepts instead of keywords. Along with a few friends and family, they formed a company named Oingo, and along the way filed for a patent on a search based upon meanings rather than keywords.
The technology they developed could be used in a number of ways in addition to search, and provided an interesting alternative to keyword based search that would lead to some significant developments in the world of search engines.
Oingo Changes Directions
There are a lot of pages on the Web that conventional search engines can’t find, crawl, index, and show to searchers. The University of California (UC), funded partially by the US Government, has been working to change that.
When you search the Web at Google or Yahoo or Bing, you really aren’t searching the Web, but rather the indices that those search engines have created of the Web. To some degree, it’s like searching on a map of a place instead of the place itself. The map is only as good as the people mapping it.
Map makers have consistently worked to develop new ways to get more information about the areas that they survey. For example, a New Deal program in the 1930s under the Agricultural Adjustment Administration led to the creation (pdf) of a $ 3,000,000 map. Continue reading
Paul Boag wrote a post at his site Boagworld asking a number of questions about SEO. I started writing a comment at his blog, but it quickly grew to become longer than his post and the questions and comments that he had about SEO, so I decided to post my response here.
In Paul’s post, Why I don’t get SEO, he came up with five reasons why he had doubts about SEO. My response doesn’t address his concerns in the order that he asked them, and it touches upon some of the comments written by others as well. If you have questions or concerns about SEO that aren’t addressed in this response, please feel free to ask them in the comments below.
What is Good SEO?
Good SEO is not “cheating the system,” or “manipulating search results.” Good SEO is part of a marketing plan that makes it more likely that the good content you create will be found by people who might be interested in what your web site has to offer.