Teaching Computers to Read Newspapers: How a Search Engine Might Use OCR to Index Complex Printed Pages

Optical Character Recognition, or OCR, is a technology that can enable a computer to look at pictures that include text, and translate those visual representations of text into actual text. If you have words within images on your web pages, there’s a good chance that search engines are ignoring those words, when it comes to indexing your pages.

But that might change sometime in the future.

While OCR has been around for a while, search engines haven’t been using the technology when crawling and indexing the content of Web pages. Google’s webmaster guidelines tell us:

Try to use text instead of images to display important names, content, or links. The Google crawler doesn’t recognize text contained in images. If you must use images for textual content, consider using the “ALT” attribute to include a few words of descriptive text.

Yahoo’s page, How to Improve the Position of Your Website in Yahoo! Search Results provides the following tip:

Keep relevant text and links in HTML. Encoding your text in graphics or image maps can prevent search engines from finding the text or following links to your website’s other pages.

The Bing Webmaster Central pages gives this warning:

Don’t put the text that you want indexed within images. For example, if you want your company name or address to be indexed, make sure it isn’t displayed only inside an image of your company logo.

While search engines may not use OCR for indexing the content of web pages now, that doesn’t mean that they might not in the future, and there are some indications that the search engines are developing a much greater proficiency in the use of optical character recognition.

For example, The Google Books Library Project involves the scanning of a very large number of printed books and periodicals, and the development of Scanning technology to undertake a project of that magnitude. A Google patent filing from a few years ago hints that Google might use OCR to look at the text in some images on web pages to reject some advertisements. Another Google patent filing describes how the search engine might use OCR with StreetViews videos to improve business address location information.

One limitation of using OCR in indexing information is that it works best with fairly simple documents and printed materials, and less well with documents that have complex formatting, such as newspapers. Newspapers often include multiple columns of text, headlines, images with captions, varieties of font sizes and types, and other challenges that can make indexing that kind of content.

An image from the Google patent filing showing different headlines and body text blocks on a newspaper

Articles in newspapers are also often continued on other pages, and those storylines may be continued in orphan text blocks which may need to be associated with the earlier pages.

Another image from the Google patent filing showing orphan text blocks continuing stories from other pages

A Google patent filing from February describes how they might meet some of the challenges involved in understanding the layouts of complex documents like newspapers when using OCR to “read” those documents.

The patent application is:

Segmenting Printed Media Pages Into Articles
Invented by Ankur Jain, Vivek Sahasranaman, Shobhit Saxena, and Krishnendu Chaudhury
Assigned to Google Inc.
US Patent Application 20100040287
Published February 18, 2010
Filed: August 13, 2008

Abstract

Methods and systems for segmenting printed media pages into individual articles quickly and efficiently. A printed media based image that may include a variety of columns, headlines, images, and text is input into the system which comprises a block segmenter and a article segmenter system.

The block segmenter identifies and produces blocks of textual content from a printed media image while the article segmenter system determines which blocks of textual content belong to one or more articles in the printed media image based on a classifier algorithm.

A method for segmenting printed media pages into individual articles is also presented.

While this patent application focuses upon using OCR for printed documents such as newspapers that have been scanned, that is only one example of how the processes described might be used. The patent does go into a fair amount of detail on how the different features and aspects of a newspaper page might be interpreted, looking at things such as;

  • Headlines,
  • Gutters alongside columns and above and below rows,
  • Separating lines,
  • Headline paragraphs and bodytext paragraphs,
  • Associating headlines with blocks of paragraphs,
  • Determining all blocks/paragraphs that fit into an article

Conclusion

What’s interesting about this patent filing isn’t so much the ability of the search engine to translate text within images into actual text, but rather how the search engine might handle complex pages which may contain many different articles and images, with some articles even continued on other pages, associate headings with body text, and segment those different articles so that they can be indexed separately.

Search engines might start reading and indexing text within images on Web pages sometime in the future, and it will likely run into complex images when it does so. The steps that search engines may take to be able to do so go beyond recognizing characters within images accurately, and a process like the one described in this patent filing brings them another step closer.

Share

28 thoughts on “Teaching Computers to Read Newspapers: How a Search Engine Might Use OCR to Index Complex Printed Pages”

  1. OCR..amazing potential, ey? Loved your concluding lines mostly..Relevance factors and similarities..”rather how the search engine might handle complex pages which may contain many different articles and images, with some articles even continued on other pages, associate headings with body text, and segment those different articles so that they can be indexed separately.”

    I’m thinking..”Tony Taco Boy” name badge

    Taco Boy snatches elderly wallet on 4th and Main.

    Taco Boy disappears. Taco Boy changes clothes until next shift.

    G sends Bad Boys posse & cam to handcuff Taco Boy on spot.

    Thanks to OCR, U2 will have to start singing Where The Streets HAD No Name and revise the Joshua Tree Album. What. Is. Happening. Here!

  2. Pingback: Optical Character Recognition (OCR) : Learning SEO Basics
  3. Hmmm…I wonder if this OCR technology is based on a format similar to that which identifies characters on PDF files??? I would imagine that this technology is not new being that it is hard for even humans to decipher some of today’s Captcha images. Computers and scripts written by spammers are pretty good at reading them.

  4. Hi Mark,

    This patent filing isn’t so much about identifying different characters using OCR as it is about understanding the formatting of complex documents when using the technology, including ones that might have multiple articles, advertising, images, and more on the same pages, that are unrelated to each other. Imagine if Google were to be able to index each of those articles and ads and images separately from one another.

  5. Hi Kimberly,

    Thanks. There is a lot of potential in the future of OCR, and if the search engines start reading the text in images, it could change around the way many sites are ranked by the search engines.

    Your Tony Taco Boy example has me wondering if designers like Tommy Hilfiger might want to reconsider putting their names on their shirts anymore. “Tommy’s been up to no good again.” :)

  6. If the search engine will be able to read text within images then I guess using images with text would be popularized because people or readers love images or pictures. So, if you can put a main keyword inside an image..and be read by the search engine…I think that would really be great!

  7. Never heard of the OCR technology before, but I think there is already stuff like this on application. I remember having a conference from Google, and they were doing a demo of the android. It was unbelievable what they could do!

    1) During the demo, he took a picture of a business card of a random person with his phone. Android automatically recognized, Telephone, adress, fax, e-mail… and put everything in the contacts list of the phone. It only took a picture to be taken to fill everything.

    2) He had a picture on the wall, a reproduction. He took a picture of it, and automatically Android asked their Cloud and recognized it as a painting from Kandinsky…. NUTS! He did the same thing with an interior picture of a church, which was recognized easily.

    Sure the OCR is not really about the recognization itself, as we are pretty much there already. But more about how a technology can understand the layout of a newspaper which can sometimes even be tricky for us.

  8. I’ve heard, and I think Matt Cutts has even said, that Google has no plans to incorporate OCR into the index in the future. Pretty interesting stuff, though.

  9. OCR is such an amazing example of the progression of technology. I believe one day it will be used for Search indexing to improve search results and I love how you showed the possibility.

  10. I’ve considered this to be one of the next major milestones in search engine technology. When Flash sites – not Flash + HTML – were being put up by the thousands and businesses were starting to wonder why their sites were not ranking or even being indexed, I think the SE’s realized they had to invest in OCR at some levels. So flash code and Adobe .pdf’s turned out to be not that hard to crack… it’s the stuff that will require almost artificial intelligence – like reading newspapers as you suggested Bill, that will require a big leap in technology. I think all this will land up being a legal battle more than anything else. Google and everyone else has a legal right to use public domain text for any purpose they like, but newspapers have a legal right to the content they manufacture and SE’s can only use a snippet generally. So although the concept is fascinating, the reality is that the newspapers will protect their content fiercely and sell it electronically to whomever is willing to pay. Particularly the archived content. When you find Google working on a patent for face scanning technology and they want to install scanners on every street corner in the world, please let us know.

  11. Interesting stuff, Bill!

    In some special cases I create text as graphic. It can be cases where I would like to put in text slightly off topic, and I don’t want this text to interfere with the overall keyword density on that particular page.

    From what you are describing in the future (if the principle is implemented) I will have to think twice before doing that :-)

  12. Hi Andrew,

    I think the main reason why search engines would want to be able to use OCR to read images in text wouldn’t be to encourage people to use it in that way, but rather to be able to capture information on pages where site owners don’t recognize that that image-based text isn’t indexable presently.

    I tried to find the web site of a local building contracter last night, and when I finally found it, I found a site with a home page that couldn’t be indexed based upon its content, because it was just an image, with no actual text. It didn’t have a snippet in the search results, and instead of Google showing a page title, it showed the URL for the page. I’ve been wondering if they realize what they’ve done to themselves.

  13. Hi Philippe,

    There’s some pretty amazing stuff that can be done with mobile devices and the ability to recognize and understand text and objects and landmarks as well. There’s still a lot of work that needs to be done on some of the things that you describe, but there’s a reason that Google acquired Neven Vision a few years ago.

    Have you seen Google’s pages on Google Goggles?

  14. Hi John,

    I believe I’ve heard Matt say something similar, but when he makes statements like that, it’s often with some kind of quantifier. I’m sure that in most cases, Google would prefer that people use actual text to show important content on their pages rather than images of text. Prefering text over pictures of text requires less processing power, less effort to crawl web pages, and less effort to make sure that your interpretation of the layout of text within an image is correct.

    As I mentioned above, it seems like Google is showing that they have some interest in developing solutions to specific problems with the use of OCR, such as better understanding the locations of buildings when using street views, scanning individual documents like books, and complex documents like newspapers. Being able to solve those problems effectively means that they are developing the technology to enable them to better scan images found on the Web.

    At what point do you say, “Hey, we’re good enough at using this technology in other ways, we should start using it on web pages?” I would expect that you would do a lot of testing to make sure that you can do it well, and do it in a manner that still encourages people to be more likely to choose actual text over text in images.

    Using that kind of techology on web pages can mean that you need more storage space, larger and faster indexing approaches, and processes for doing OCR and understanding formatting that also work very fast. Is Caffeine the kind of infrastructure update that can enable Google to start doing something like that, or will we be waiting to see if the next version of the Google File System (GFS3) might bring them there?

    And if Google does start, would it make sense for them to do so incrementally, like using OCR only for PDF files first, for instance? It might. Matt mentions that possibility in this interview from March:

    Matt Cutts Interviewed by Eric Enge

    From the interview:

    Matt Cutts: I don’t believe we can index password protected PDF files. Also, some PDF files are image based. There are, however, some situations in which we can actually run OCR on a PDF.

  15. Hi Matt,

    Thanks. I think that in many ways it will help how the web is indexed. There still are just too many sites that include important text in images only. It’s something that I’ve seen sites for well known brands do just as often as new businesses just starting out on the Web.

  16. Hi Per,

    Interesting question, and it raises some important issues.

    I’m not sure that you need to do something like that for purposes of overcoming keyword density – I don’t believe that search engines use keyword density as a ranking signal.

    But, if you have text on a page that might be considered off topic from the main points on the page, it might cause the search engine to classify the page a little differently than it otherwise would for a number of purposes, such as the choice of ads they might show in Adsense. It could also impact how a page might be viewed from a reranking approach like phrase-based indexing.

  17. Hi Mal,

    There are a lot of complex documents published on the Web that the people uploading them do want indexed, even though it might provide some serious technical challenges. In the case of newspapers publishing in that kind of format, there may be some copyright issues, but I would expect that if a search engine and a newspaper came up with either a licensing agreement, or some kind of profit sharing arrangement such as a payment of some kind for full access to a document that could be found through the search engine, we could start seeing something like the process described in the patent being used.

    Not sure if Google is working on a patent for facial recognition, but they did buy a company back in 2006 that already had at least one. See my post from back then:

    Google Acquires Neven Vision: Adding Object and Facial Recognition Mobile Technology

    As for scanners on every street corner, see my post:

    Google on Reading Text in Images from Street Views, Store Shelves, and Museum Interiors

  18. I think there is one more thing to take into consideration and that is processing the OCR. Google does not have unlimited resources and it would take a lot to process all the OCR docs out there.

  19. Hi Isaiah,

    Right – we do have to think about the impact of Google processing OCR.

    Of course, this is something that Google can do at anytime rather than in response to a specific query from a searcher, which could help. It could also possibly be done at the time when the search engine captures a copy of a page to put in its cache, rather than independently, which could make the process a little more efficient.

    We do see Google becoming more efficient at crawling and indexing pages with Google Caffeine, which supposedly uses the second version of the Google File System (or GFS 2). I’ve seen a number of references to the development of GFS 3, so maybe there are more efficiencies in the future that might make processing of OCR on web-based images more likely.

  20. OCR?
    Why waste all the CPU cycles converting images to text when they have the original text sitting right there.
    Well for websites anyway. For their own things like street view, it would really help them get the addresses right if they could OCR the home numbering.

    Before the big G goes to the trouble of running OCR routines I suspect they will continue to tell people that images are information mediums.

    As for flash, I have seen flash presentations with the text contents shown in the code.

    best,
    Reg

  21. Oh, and I do not think Google uses keyword density either.
    Keyword use depends entirely on the content. You cannot set an “ideal” amount.
    Reg

  22. Hi Reg,

    We both agree on Keyword Density. This article describes a number of reasons why search engines wouldn’t be using such an approach – The Keyword Density of Non-Sense

    Google has invested a considerable amount of time an energy in OCR in their book scanning project, which includes magazines, newspapers, and other periodicals, and the use of OCR in streetviews could potentially be very useful.

    The question, as you present it, is whether or not Google would take the step of trying to understand text as it appear in images on the Web. I think I agree with you that it’s something they might hold off on for a good while.

  23. They have been discussing OCR in search engines for a while but I think they still have not incorporated it. Or have they? I haven’t seen anything up on

  24. Hi Scott,

    It’s hard to tell. Google has published a number of patents that describe how they may be using OCR for their book projects. They’ve published another that describes how they may be using it in Streetviews videos to improve the quality of Google Maps. The post above details how Google might use OCR in pdf and other digital documents. But I haven’t seen anything specifically that says that Google is going around the Web and reading text within images it finds on web pages.

  25. Thanks for your response, Bill. It seems that text in images still doesn’t really count for search engines.

  26. Hi Scott,

    I don’t think we should expect search engines to start OCRing text found on web pages for a while, but Google’s definitely working on developing an expertise in the technology with their Book Scanning project. There have been a number of patents and whitepapers that hint at the use of OCR in reading street signs for Google Street Views and Google Maps, as well as using it for Google’s visual search (aka Google Goggles).

    Given enough data from those other projects, the ability to quickly and accurately perform OCR for text in images on web pages may not be all too far away.

    That’s a good thing considering how frequently I see sites where important text, such as the addresses of small businesses, is displayed as an image.

  27. I fully agree Bill, having had to find some decent OCR software for a scanned document conversion a few months ago, i remained unconvinced that OCR is not yet ready for the primetime. I was confrunted with numerious errors and had to proof read everything that wasn’t standard times new roman for spelling and interpritation mistakes, i don’t imagine google would be game to risk sensitive information such as company phone numbers and address etc. and get them wrong, the damage to their reputation as an accurate search method would be comprimised, also i think if they could do it now, they would so actions speaks volumes about the current state of OCR.

  28. Hi Andy,

    When you have lots and lots of data to work with, and the incentive to get things right, processes like OCR do get better. Google’s book scanning project involves many millions of books in many different font types, and I suspect that this point that the level of sophistication that it possesses is well beyond what you might see in commercially available OCR software.

    The ability for people to take photographs of documents, and either have searches performed on the text or images in those documents, or if the documents are forms, to enable people to automate the process of filling out those forms are some of the kinds of features that an improved OCR might bring. Google acquired some patents from Exbiblio not long ago that point to those types of capabilities.

    Google also has published at least one patent that describes using text from StreetViews videos to better map the locations of actual businesses in their Maps system based upon using OCR.

    How much better might Google’s OCR possibly be than what you might be able to buy? Probably a lot better. Not sure that it’s ready for primetime yet, but that book project does have them processing a lot of text, in many different forms and shapes. Google’s image similarity algorithms have recently graduated from beta and been incorporated into Google’s image search, and a large part of OCR involves understanding written characters.

Comments are closed.