Google on Automatic Video and Image Annotation

Sharing is caring!

Added 6/20/2020 – This image annotation patent application was granted as a patent to Google on November 22. 2011 – Method and apparatus for automatically annotating images

How effectively can a search engine automatically create annotations for images and videos, so that they can be good responses to searcher’s queries? How much of that image annotation can be done without human intervention and review?

A newly published Google patent application explores the topic and comes up with a method of image annotation by comparison to similar images found on the Web, and the text surrounding those similar images.

Method and apparatus for automatically annotating images
Invented by Jay N. Yagnik
US Patent Application 20080021928
Published January 24, 2008
Filed July 24, 2006


One embodiment of the present invention provides a system that automatically performs Image annotation. During operation, the system receives the image. Next, the system extracts image features from the image.

The system then identifies other images that have similar image features. The system next obtains text associated with the other images and identifies intersecting keywords in the obtained text. Finally, the system annotates the image with the intersecting keywords.

Problems with Image Annotation

Connections to the Web have become faster and faster, with many higher bandwidth options available to people. This has led to a large increase in the use of pictures and videos on web pages.

Many of those images don’t have accompanying text-based information, such as labels, captions, or titles, that can help describe the content of the images.

Search is predominantly text-based, with keyword searches being the common way for someone to look for something – even pictures. It can be difficult to search for images through a search engine. The creation of annotations for images such as a set of keywords or a caption can make those searches easier for people.

Traditional methods of annotating images tend to be manual, expensive, and labor-intensive.

There have been some other approaches to the automatic annotation of images, like the one described in Formulating Semantic Image Annotation as a Supervised Learning Problem (pdf). While that kind of approach can make it much easier to remove manual efforts in annotation, they still require some human interaction and review.

An Approach to Automate Annotation of Images

The annotation system in a nutshell would go as follows:

  1. An image is received
  2. Image features are extracted
  3. Other Images which have similar features are identified
  4. Text associated with the other images is obtained
  5. Keywords from that text is identified
  6. The image annotation is made with those keywords

A Google Image Annotation Process

A more technical approach might call for:

  1. Generating color histograms
  2. Generating orientation histograms
  3. Using a direct cosine transform (DCT) technique
  4. Using a principal component analysis (PCA) technique; or,
  5. Using a Gabor wavelet technique.

Some other variations include:

  • Identifying image features in terms of: shapes, colors, and textures
  • Identifying the other images by searching through images on the Internet
  • Finding other images with similar image features by using probability models
  • Expansion of keywords in the obtained text by adding synonyms for the keywords
  • Using images from videos

Using this Process with Videos

Videos without titles or descriptions can benefit from the use of the same approach.

They may be partitioned into a “set of representative frames,” and each of those can be processed as images, using the process described above. After those images are annotated with keywords, they can be analyzed to create a set of common annotations for the whole video.

Sharing is caring!

8 thoughts on “Google on Automatic Video and Image Annotation”

  1. Good idea, especially given the huge amount of accurate labels they have from the google image labeller game.

    What I’m interested to find out is how they break down their image analysis – they must have huge variation between different images in such a massive collection.

  2. Thats’ probably why I get so many hits for The Jonas Brothers. Instead of just posting the picture of the band with my daughter, I post the URL of the images (after uploading them on photobucket). The keyword on the hyperlink I use is The Jonas Brothers. That makes it easier for the post to be indexed and for it to be found by major search engines.
    The result?
    The particular site mentioned above(1 of 22)that is mainly about my 7 kids, autism and everyday life, gets many hits from Jonas Brothers fans. Oh and yes, I found a way to maonetize that traffic.

  3. I think another interesting question about all of this is:

    How can Google tie their image search to geocoding in photographs and then compare the digital image and the place it was taken ( Longitude, Latitude, Time, Direction and Perspective ) using the tons of information already found on Google Local, Google Earth and Google maps, to assess the place, context and “topic” of a particular photograph?

    From Wikipedia on Geocoded Images–

    A geocoded photograph is a photograph which is associated with a geographical location. A geocoded image can be associated to geographical coordinates such as latitude and longitude or a physical address.

    In theory, every part of a picture can be tied to a geographic location, but in the most typical application, only the position of the photographer is associated with the entire digital image. This has implications for search and retrieval. For example, photos of a mountain summit can be taken from different positions miles apart.

    To find all images of a particular summit in an image database, all photos taken within a reasonable distance must be considered. The point position of the photographer can in some cases include the bearing, the direction the camera was pointing.

    Some digital cameras support GPS and record the time and location of the photographer each time a photo is taken. It is important to understand this distinction because, for example, the geo-coordinates of a house which stands as the subject of a photo taken by a photographer standing just in front of it can be of such relative closeness to the geo-coordinates of the photographer that the discrepancy is inconsequential; however, a photo taken of a mountain in a horizon can be a great distance from the photographer’s recorded GPS coordinates.

    In other words, the subject of a geocoded photo is how it appears from the recorded GPS location. The most accurate definition of a photograph’s geocode information, the location and time stamp respectively, is the identification of a photographer’s position on the planet and the photographer’s perspective from that position at the exact time the photo was taken. A photo’s relevant GPS data is stored in the photo’s EXIF file.

    If you can start comparing the growing use of geocoded information found in digital images ( video and pictures ) to the available and growing information already found in all of Google’s local, mapping and satellite databases, I think that is a major step forward in having one hell of a powerful image search.

    Have you read any mention in any of Google’s patent filings on image searching regarding any of this? I think it has major potential. Thanks.

  4. Geocoded picture does seem like a perfect application for the technology. It could allow Google to pack Google earth with millions of local pictures.
    The problem is that many sites tweak their labels to help with the search result. I know someone who labels their picture of the Atlanta skyline with “Atlanta Houses For Rent”. This seems so common that there could be some really bad results.
    It would really be to see totally inappropriate search results for the king of search but if anyone can do it, Google can.

  5. Hi Supermom,

    Congratulations on doing so well on that search. Making it easier for the search engines to index information like that makes it easier for you to do well. The annotations described here are for when people don’t make it as easy as you do.

    Hi Tim,

    Thanks. I didn’t go into a lot of detail on the different image analysis processes that Google uses, but they do mention a number of possibilities, which are worth looking at. The labeling game, based upon the ESP game, is something I thought about mentioning here. I think it probably does help them a lot.

  6. I’m sure Google could come up with a user contribution system in that if users tag an image with the appropriate description, they get “Google Points” to be used for Adwords or other Google product purchases.

    Some of those user contributions could be validations of other users’ descriptions.

  7. Pingback: Searching the Invisible - Advances in video and Audio search » Closed Loop Marketing Blog

Comments are closed.