Teaching Computers to Read Newspapers: How a Search Engine Might Use OCR to Index Complex Printed Pages

Optical Character Recognition, or OCR, is a technology that can enable a computer to look at pictures that include text, and translate those visual representations of text into actual text. If you have words within images on your web pages, there’s a good chance that search engines are ignoring those words, when it comes to indexing your pages.

But that might change sometime in the future.

While OCR has been around for a while, search engines haven’t been using the technology when crawling and indexing the content of Web pages. Google’s webmaster guidelines tell us:

Try to use text instead of images to display important names, content, or links. The Google crawler doesn’t recognize text contained in images. If you must use images for textual content, consider using the “ALT” attribute to include a few words of descriptive text.

Yahoo’s page, How to Improve the Position of Your Website in Yahoo! Search Results provides the following tip:

Continue reading Teaching Computers to Read Newspapers: How a Search Engine Might Use OCR to Index Complex Printed Pages

How Demand Media May Target Keywords for Profitability

Earlier this month I wrote about a granted Google patent, and a continuation of that patent filed earlier this year, that describe How Google Might Suggest Topics for You to Write About, by providing information to web publishers on queries and topics that are either under-represented in search results or where there’s more demand for information about those topics or queries than there are search results to meet that demand.

The topic struck home with a number of people, especially journalists, and I had a chance to have a conversion with Financial Times (FT.com) reporter Kenneth Li about Google’s patents. The Financial Times ran with two different stories on the topic (Google shadow over new media groups, and Google eyes Demand Media’s way with words), focusing primarily on how the technology involved in the patents could bring Google into competition with companies such as Demand Media, Associated Content, and AOL.

While searching through patent filings this morning, I came across an interesting newly published patent application from Demand Media. In the FT.com article on Demand Media, we’re told that:

Continue reading How Demand Media May Target Keywords for Profitability

Google as an Internet Archive?

Interested in what people were saying the day after Barack Obama was elected president in 2008? Or how people reacted on the Web to the Chicago Whitesox winning the World Series in 2005? Or the early news on the Gulf oil spill on April 20, 2010?

When you search at Google, you can click on “more search tools” in the left column, and enter a “from” and “to” date in the custom range section. If you want to see what pages were showing up on Google on a search for Barack Obama on the day after the election, you can enter 11/4/2008 in the from and to fields. To see what pages were ranking on Google on the day after the Whitesox series ended, entering 10/28/2005 into the date range text boxes.

A custom date range search at Google for Barack Obama on November 4, 2008.

If you click on any of the results that appear, you see versions of pages listed in the results as they appear today. If you click on the Google cache links for those entries, you see the most recent cached versions of those pages. But, what if you saw a copy of the page as it appeared within the date range selected? What if Google decided that it would create an archive of the Web, where it showed older copies of web pages, and used the custom date range to help you find those pages?

Continue reading Google as an Internet Archive?

Another 10 Ways Search Engines May Rerank Search Results

When a search engine shows you results for a search, the pages shown are likely in order based upon a mix of relevance and importance.

But a search engine doesn’t usually stop there. It may look at other things to filter and reorder search results.

In 2006, I wrote 20 Ways Search Engines May Rerank Search Results, which described a number of ways that search engines may rerank pages. I followed that up in 2007 with 20 More Ways that Search Engines May Rerank Search Results.

I decided that it was time for a sequel or two in this series. I came up with another 25 reranking methods, but decided to stop at 10 in this post.

Many of the following are described in patents, and some of those patents were originally filed years ago – prehistoric times in Web years. The search engines may have incorporated ideas from those patents into what they are doing now, adopted those methods and since moved on to something new, or put them in a filing cabinet somewhere and forgot about them (I’d like the key to that filing cabinet).

Continue reading Another 10 Ways Search Engines May Rerank Search Results

How Google Might Suggest Topics for You to Write About

If a search engine suggested topics for you to write about because those topics weren’t represented well in their search results, would you? Would a search engine take that step?

A Google patent application published this week explores that topic as well as describing some approaches that they might use to gauge the quality of their search results.

The patent is a continuation of a patent granted to Google in February of this year, Identifying inadequate search content, and is interesting for a number of reasons such as Google’s Chief Economist (Hal Varian) and the head of Google’s Webspam Team (Matt Cutts) being amongst the listed inventors.

The newer version was filed shortly before the original patent was granted, and the claims sections from each present the invention in somewhat different manners but the descriptions for both patent filings are very similar.

Both explore how the search engine might use statistics associated with queries, and a review of how relevant and important the search results are for those queries to determine the quality and quantity of search results that appear for them. I did notice an addition dealing with results for queries in more than one language added to the new version. The patent application is:

Continue reading How Google Might Suggest Topics for You to Write About

How a Search Engine Might Rerank Search Results Based upon Time-Based Data in Query Logs

If you search at Yahoo for the phrase “world cup” (without the quotation marks), chances are good that the search engine will show you mostly pages about the 2010 World Cup, even though the tournament is held every 4 years and there may be many pages relevant for the phrase that don’t focus specifically upon a particular year.

How likely is it that when someone searches for “world cup,” they are looking for information about the upcoming tournament, taking place in South Africa between June 11th, and July 11th, 2010? On the other hand, how likely might it be that they want to find information about the world cup held in 2006? Or just general pages about the sporting event?

If I told you that the search engine was likely reordering those search results based upon time-based data, would it surprise you? Would you expect a Yahoo or Google or Bing to focus upon rerank search results in a manner like this, when they have some temporal aspect to them, such as a search for the Olympics, or the World Series, or the World Cup?

It’s quite possible that a search engine would look through its query logs, and see if a particular query is often included in more specific searches that include some kind of temporal data such as a year, or month, or day or time of day, and rewrite a searcher’s query to include that time-based information. A recent Yahoo patent application explains one fairly simple approach towards showing such information. The patent application is:

Continue reading How a Search Engine Might Rerank Search Results Based upon Time-Based Data in Query Logs

How a Search Engine Might Identify Possible Query Suggestions

Like the information architects who organize the content on websites, search engine designers should aspire to provide users with scent at every step of their information-seeking process. Techniques like query suggestions, faceted search and results clustering all offer users the opportunity to make progress on their next step, rather than always having to restart the information-seeking process from scratch. Indeed, faceted search is a popular technique for offering users such guidance.

While users are ultimately responsible for expressing their information needs, it is the search engine’s job to act like a reference librarian and help the users in this process.

Reconsidering Relevance and Embracing Interaction
by Daniel Tunkelang

Continue reading How a Search Engine Might Identify Possible Query Suggestions

Google Studies How Search Behavior Changes When Searchers Are Faced with Difficult Questions

A paper by Google researchers Anne Aula, Rehan M. Khan and Zhiwei Guan published last month asks the question How does Search Behavior Change as Search Becomes More Difficult? (pdf)

The paper describes two studies in which participants were given informational tasks to perform – a mix of hard and easy questions – to see if searchers adopted different strategies for searching when they were faced with questions where there were definite answers where answers to those questions might be difficult to find. An example of one of the difficult tasks (can you find the answer?):

You once heard that the Dave Matthews Band owns a studio in Virginia but you don’t know the name of it. The studio is located outside of Charlottesville and it’s in the mountains. What is the name of the studio?

The first study had 23 people performing searches, finding answers to questions like the one above, and examining the searches they performed and the pages they visited to see how they went about finding answers. The second study expanded to 179 searchers, and based some of the processes used on things they learned from the first experiment. A general conclusion from the second study:

Continue reading Google Studies How Search Behavior Changes When Searchers Are Faced with Difficult Questions

Learn SEO Directly from the Search Engines