Optical Character Recognition, or OCR, is a technology that can enable a computer to look at pictures that include text, and translate those visual representations of text into actual text. If you have words within images on your web pages, there’s a good chance that search engines are ignoring those words, when it comes to indexing your pages.
But that might change sometime in the future.
While OCR has been around for a while, search engines haven’t been using the technology when crawling and indexing the content of Web pages. Google’s webmaster guidelines tell us:
Try to use text instead of images to display important names, content, or links. The Google crawler doesn’t recognize text contained in images. If you must use images for textual content, consider using the “ALT” attribute to include a few words of descriptive text.
Yahoo’s page, How to Improve the Position of Your Website in Yahoo! Search Results provides the following tip:
Earlier this month I wrote about a granted Google patent, and a continuation of that patent filed earlier this year, that describe How Google Might Suggest Topics for You to Write About, by providing information to web publishers on queries and topics that are either under-represented in search results or where there’s more demand for information about those topics or queries than there are search results to meet that demand.
The topic struck home with a number of people, especially journalists, and I had a chance to have a conversion with Financial Times (FT.com) reporter Kenneth Li about Google’s patents. The Financial Times ran with two different stories on the topic (Google shadow over new media groups, and Google eyes Demand Media’s way with words), focusing primarily on how the technology involved in the patents could bring Google into competition with companies such as Demand Media, Associated Content, and AOL.
While searching through patent filings this morning, I came across an interesting newly published patent application from Demand Media. In the FT.com article on Demand Media, we’re told that:
When a search engine shows you results for a search, the pages shown are likely in order based upon a mix of relevance and importance.
But a search engine doesn’t usually stop there. It may look at other things to filter and reorder search results.
In 2006, I wrote 20 Ways Search Engines May Rerank Search Results, which described a number of ways that search engines may rerank pages. I followed that up in 2007 with 20 More Ways that Search Engines May Rerank Search Results.
I decided that it was time for a sequel or two in this series. I came up with another 25 reranking methods, but decided to stop at 10 in this post.
Many of the following are described in patents, and some of those patents were originally filed years ago – prehistoric times in Web years. The search engines may have incorporated ideas from those patents into what they are doing now, adopted those methods and since moved on to something new, or put them in a filing cabinet somewhere and forgot about them (I’d like the key to that filing cabinet).
This post may get you thinking about the benefits of using heading elements and lists on web pages for SEO purposes from a slightly different perspective than you may be used to.
Google uses a large number of signals to decide upon the order of pages shown in search results. Some of those signals measure the quality or importance of a web page, while others may indicate how relevant a page is for a particular search query entered into a search engine’s search box.
One fairly obvious relevancy signal is whether or not the words in a query actually appear upon a page that might be a search result for that query. If those words appear on the page more than once, the page might be considered even more relevant for that particular query than other web pages where the terms only appear once, or not at all.
Another factor that might indicate how relevant a page is for a particular set of terms is how close those terms might be on a page. While you could easily count the number of words between individual query terms to determine how close they are to each other, the formatting of web pages presents some challenges to the approach of simply counting words between terms, such as in a list like the following:
Not every link from a page in a link-based ranking system is equal, and a search engine might look at a wide range of factors to determine how might weight each link on a page may pass along.
One of the signals used by Google to rank web pages looks at the links to and from those pages, to see which pages are linked to by others. Links from “important” pages carry more weight than links from less important pages. An important page under this system is one that is linked to by other important pages, or by a large number of less important pages, or a combination of the two. This signal is known as PageRank, and it is only one of a large number of Google ranking signals used to rank web pages and determine how highly those pages show up in search results in response to a query from a searcher.
An early paper by the founders of Google, The Anatomy of a Large-Scale Hypertextual Web Search Engine, tells us:
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank.
Would search engines be better if they started web crawls from sites like Twitter or Facebook? Wikipedia or Mahalo? DMOZ or the Yahoo Directory?
The Web refreshes at an incredible rate, with new pages added, old pages removed, and words pouring out from blogs, news sites, and other genres of pages. Ecommerce sites showcase new products and eliminate old ones. New sites launch and old domains expire.
Search engines attempt to keep their indexes of the Web as fresh as possible, and send out crawling programs to find the new, update changes, and explore disappearances. Failure to do so means outdated search engines that deliver people to deleted pages, overwritten content, and stale indexes that miss out on new sites.
When a search engine starts crawling the Web, it often begins by following URLs from chosen seed sites to explore other pages and other domains. But how does a search engine choose those seed sites?
Would it surprise you if searches on the Web make up around 10 percent of all pageviews on the Web, and indirectly led to more than 21 percent of the pages viewed online? It surprised a couple of researchers from Yahoo.
That’s the result of a study conducted by Ravi Kumar and Andrew Tomkins from a sample of over 50 million user pageviews that they collected during 8 days in March, 2009. The information was captured through the Yahoo toolbar from people who agreed to the collection of data for this kind of analysis. Additional information was added by looking at the search logs from Yahoo.
While the data is limited to users of the Yahoo toolbar who agreed to the use of the data, and doesn’t include mobile searches or searches that used AJAX to display results, it does capture how people browse the Web and search at a number of search engines as well as searches at sites like eBay and Amazon.
The study is described in a paper titled A Characterization of Online Search Behavior (pdf), and is being presented tomorrow at the WWW2010 Conference in a session dedicated to User Models on the Web.
Google and Yahoo on Faster Web Pages
Earlier this month, Google announced that they would start considering the speed of a site as one of the ranking signals that they use to rank pages in search results.
Yahoo published a patent filing last year that also described how they might use page load and page rendering times as ranking signals as well. I wrote a post soon after it was published, Does Page Load Time influence SEO? exploring how Yahoo and other search engines might look at different factors regarding the speed of pages, including the experience of users on web pages.
Google’s Matt Cutts wrote about the recent Google announcement, and provided some more details, telling us that it’s likely that less than 1 percent of queries would be affected by this change.