One of the standard rules of search engine optimization that’s been around for a long time is that “search engines are not reading text in images.” What if that changed?
How easy or difficult is it for a search engine to recognize text within digital images and video, and index that text?
Three new Google patent applications explore how Google might start reading text in images and describe some types of images that Google might try to do that with.
Capturing Text from Street View Images
This patent filings don’t address text found within headings and logos, but rather much more complex pictures, including street scenes of the kind that might be taken for instance, when filming streets for something like Google’s Street Views (video).
They also discuss the use of picture taking robots inside stores and museums.
The documents lay out some of the obstacles faced in reading text in images like those:
The text within images can be difficult to automatically identify and recognize due both to problems with image quality and environmental factors associated with the image. Low image quality is produced, for example, by low resolution, image distortions, and compression artifacts.
Environmental factors include, for example, text distance and size, shadowing and other contrast effects, foreground obstructions, and effects caused by inclement weather.
Recognizing text in images
Invented by Luc Vincent and Adrian Ulges
CUS Patent Application 20080002893
Published January 3, 2008
Filed June 29, 2006
Methods, systems, and apparatus including computer program products for recognizing text in images are provided. In one implementation, a computer-implemented method for recognizing text in an image is provided. The method includes receiving a plurality of images.
The method also includes processing the images to detect a corresponding set of regions of the images, each image having a region corresponding to each other image region, as potentially containing text. The method further includes combining the regions to generate an enhanced region image and performing optical character recognition on the enhanced region image.
These other two patent filings also cover aspects involving reading text in images:
The patent applications describe in detail how images might be processed to make it easier to identify and extract text within many different types of images.
An example of how this process might be used is that in an urban scene, text recognition could be used to “identify such things as building addresses, street signs, business names, restaurant menus, and hours of operation.”
Images from Digital Cameras and Video Recordings
Images used might include those captured from conventional digital cameras or video recording devices. Those pictures might include panoramic images, still images, or frames of digital video. A system for capturing some of the images might also incorporate the use of three-dimensional ranging data as well as location information.
That sounds like some information that might be captured when pictures are taken for a project like Google’s Street Views program.
Associating Locations with Images
A panoramic image of a street scene might capture more than one street address, like a city block, or a string of locations on a street. This might be done using a moving camera. Locations could be associated with those images using GPS coordinates.
While there are other options presented to collect GPS information, here’s a description of how it might be determined for something like the Street Views project:
Additionally, exact GPS coordinates of every image or vertical line in an image can be determined.
For example, a differential GPS antenna on a moving vehicle can be employed, along with wheel speed sensors, inertial measurement unit, and other sensors, which together allow a very accurate GPS coordinate to be computed for each image or portions of the image.
The text detection and classification (text versus non-text) process is also presented in some detail within the patent filings. Part of that process involves looking at similar patterns within images that might be similar to each other. So, known city street scenes are looked at when classifying other city street scenes to try to determine if the text appears within those images.
The three-dimensional range information might also help detecting false positives when this system believes that it has identified an area that may contain text, and it hasn’t.
Solving Problems Reading Text
Some of the problems involved with characters that are difficult to read, because of small size or blurriness or distortions or other problems, might be solved by looking at more than one image, from pictures that are slightly offset from each other:
For example, a high-speed camera taking images as a machine (e.g., a motorized vehicle) can traverse a street perpendicular to the target structures. The high-speed camera can, therefore, capture a sequence of images slightly offset from each previous image according to the motion of the camera.
Thus, by having multiple versions of a candidate text region, the resolution of the candidate text region can be improved using the superresolution process.
Additionally, a candidate text region that is partially obstructed from one camera position may reveal the obstructed text from a different camera position (e.g., text partially obscured by a tree branch from one camera position may be clear from another).
While the text recognition part of this process will try to use variations of character recognition, they may also try to find certain specific business names that are kept in a database, such as McDonald’s, Fry’s Electronics, H&R Block, and Pizza Hut.
They could also try to find the text from images at certain locations by looking at information from places like Yellow Pages listings.
Some Specific Applications Under this Process
Where this gets interesting is in the descriptions of some of the ways text recognition and extraction from images might be used by the search engine, including the use of robots within stores and museums.
Image search – The text taken from images can be indexed and associated with the image. That can then be used in different search results applications like image search, and mapping, or other applications.
Images are Associated with a Mapping Program – Extracted text from street scene images can be indexed and associated with a mapping application. People can then search for a location by business name, address, store hours, or other keywords.
The mapping application can also retrieve images matching the user’s search – like looking for a McDonald’s in a particular city or near a particular address – the mapping program would create a map showing the location of the McDonald’s as well as a picture of the restaurant.
Images Near Specific Locations are Associated with Each Other – Since the images are associated with location data, the mapping program can provide images of other businesses near a searched location, and show their locations on a map.
Images of Similar Businesses Presented as Alternatives – Images of businesses that offer similar goods or services may be presented to the searcher as alternatives. So, a search for McDonald’s might show other nearby fast food joints.
Advertisements Shown with Images – Advertisements can be presented along with images. When a business is shown in an image, an ad for the business may also be shown. Or, ads for alternative businesses could be displayed. And ads can be shown for products associated with the business.
Google Interior Images – While this patent filing describes many images from street scenes, this indexing can be applied to other image sets. One of the more interesting sections of this patent application:
In one implementation, a store (e.g., a grocery store or hardware store) is indexed. Images of items within the store are captured, for example, using a small motorized vehicle or robot. The aisles of the store are traversed and images of products are captured similarly as discussed above.
Additionally, as discussed above, location information is associated with each image. Text is extracted from the product images. In particular, extracted text can be filtered using a product name database to focus character recognition results on product names.
Kudos to Sandra Niehaus, and her post-Google Interiors – the day my house became searchable. I know your post was satire, Sandra, but a good call. 🙂
Searching Museums for Images – Images associated with museums could also be indexed. Many museums include text displays associated with exhibits, artifacts, and objects. Reading text in Images from museum items and their associated text displays can be captured using a process like that involved in indexing a store.
Location information can be associated with each captured image. Museums can be searched, or browsed, to learn about the various objects.
31 thoughts on “Google on Reading Text in Images from Street Views, Stores and Museums”
Thanks. Those are some interesting suggestions with the cell phone camera search for local listings, kiosks in museums, and information from the museums in a customized Google magazine.
There’s a lot of potentially useful crossover that could develop between these technologies.
Looking forward to some good stuff from you this year.
Like I was saying this morning, it seemed interesting when I scanned it. It is that and more. I can see a variety of usages/implications.
So I take my cell phone, snap the intersection and voila! Local listings… he he…. How about we go back to the whole kiosk thang and put that in the museum with integrated internal museum search (and map) add on a print out of your fav exibits (and accompanying info/links) via PDF in your inbox (a la Google Magazine)… man, so many angles to play with….
Anyway, late again (as always).. bagged. Thanks for the post, I sure get lazy with you around… make my life easy… I have been too busy ranting to start the year, time to settle in and make some more useful posts :0)
You’re welcome. I don’t think that Google will ever stop making a distinction between text and images. There is a curious statement in the patent applications about ranking text in images:
It seems to be saying that text within an image that may be somewhat relevant to text within a caption for the image, or text around the image, may result in an image ranking more highly for similar terms.
But, text from an image that may indicate that the image doesn’t relate well to a caption or text surrounding the image could lead to a lower ranking.
Google has shown us that they can at least determine whether an image contains a face or not. 🙂
Some of the technology that they acquired from Neven Engineering was facial recognition software. Some of the other technology that they acquired in that deal could possibly be fueling some technolgy that might be similar to what we find described in these patent applications.
We do live in interesting times. 🙂
These comments regarding ‘Searching Museums for Images’ is insane. That would be the same as taking photographs of the phonebook. I’m sure google has a better strategy. because cell phone images of art gallery installations aren’t something that id like to see or most galleries would like to be represented by.
anyone have more insight into google strategies regarding museums? i’d be very interested to hear more.
2010games of g mail . com
I found the references to indexing art objects and the text displays that might be associated with them to be unusual, too. But I could envision some museums being interested in efforts that might draw more visitors through their gallery doors.
If I could sit at my desktop computer, and spend a few hours looking at the artifacts and display information from the Louvre, or MOMA, I likely would. I’d imagine that schools might find that enticing, too – especially ones where a field trip is unlikely, but being able to see the art would be attractive.
These images described in the patent filings wouldn’t be limited to being displayed upon cell phones, and could be used on desktop computers in local search, product search, Web search, and image search, and possibility in other offerings from Google.
I’ve been really curious about image recognition for a while now; thanks for putting some of these possibilities on my radar. It’s amazing how much potential there is. How long do you think it will be until google stops making a distinction between jpegs and text all together?
On an extreme tangent – I remember several years back reading an interesting prediction by Steve Forte about how image recognition would make card counting (or ace prediction) impossible in Vegas Casino’s. In theory, detecting card counters could be an automated process; as could taking their picture and sending it to every other casino on the strip (all of which are equiped with facial recognition). I wonder if it came true or not.
Fascinating and clearly if face recognition works, as it seems to do, text recognition is not an order of magnitude more difficult. Scary implications but that is inevitable unfortunately.
Hi Search Engines Web
I’m not sure that we will need to wait for the next decade to see Google start reading text within images on the Web. But we’ll see.
A little shy of 5 million page views from a little more than 3 million visits over 2007.
Of social site referrals, stumbleupon holds a vast lead over any others. A good precentage of visits were by type-in or bookmark. Top search term in volume was Google Acquisitions.
The Google of the next decade will probably take this one step further.
It will probably read the text logos or photoshopped text headers or even the flash ‘text images’ – thus, allowing those elements to influence the SERPs by bringing them up during keyword search queries if the ALT tag is not used.
On another note:
Please share your stats for 2007. What keywords and social sites are bringing in the most traffic
thanks for the shout out to my Google Interiors post. I’ve appreciated your blog for quite a while, so it’s an honor.
There are so many reasons for search engines (and other related technologies) to continue improving recognition of all types of data. One reason is that people are notoriously bad at accurately tagging and categorizing things – one of the hurdles facing development of the semantic web – so the thinking goes that ‘objective’ software needs to perform this in our place. Software is already able to pull words from audio – and understand it – locations and meaning from photos and video, etc.
The potential is exciting and chilling all at once. Take the judgment that people are ‘bad’ at tagging things. Is being inconsistent and inaccurate really a failing? or is it simply human? And how much humanity are we wiling to cede in search of information?
On the exciting side, for a mind-blowing look at software that’s able to join unrelated photos by position, correlate all the data they contain, and make the result interactive, see this recent TED talk:
I think this talk shows the potential for using ‘socially-generated’ images to gather massive amounts of data – text and otherwise – making the museum scenario very possible. And you thought Flickr was benign 😉
Makes you wonder why Google hasn’t implemented these sort of technologies LONG ago on Web images. They seem a whole lot easier to read that text in photographs/video. Meanwhile, the whole Web has to revolve around Google’s capabilities.
Thank you, Sandra.
I really enjoyed your Google Interiors post, and it immediately sprung to mind when I was reading through those patent filings.
One of the issues with tagging, as I see it, is that people don’t always use tags to describe what they watch or hear or read, but rather to describe their relationship with the object of their attention. That’s partially why we see so many “toread” tags. 🙂
I agree with you on the limitations of objectivity in automated tagging, and the loss of humanity.
The TED presentation was intriguing. Thanks for the link. The potential that exists in how we can capture information, and what we might do with it, is amazing.
Appreciate your stopping by, and leaving your thoughts.
I’m wondering how much testing Google has done of reading image text on the Web, and how much more or less relevant it might make search results if they could. It would be interesting to experiment with.
Makes you wonder why Google hasnâ€™t implemented these sort of technologies LONG ago on Web images. They seem a whole lot easier to read that text in photographs/video. Meanwhile, the whole Web has to revolve around Googleâ€™s capabilities.
we already have database/file management/note taking software that can read text from images.. should be cake for google within the next couple years. bad for people who try to get around having the words cut n pasted on the fly by random surfers, that’s about it.
I know that I’ve struggled with using some of the Optical Character Recognition (OCR) software programs that have been released over the years, because of less that ideal copies of documents, or unusual fonts. I know that the technology behind interpreting text in images is becoming better, but the idea of moving from documents to reading signs and labels and other words found on the streets seems like a pretty impressive and challenging endeavor. I suspect that as Google gets better at doing that, the challenges of understanding text in images on the Web will become much easier for them.
Comments are closed.