Chances are that when you search for a video on Google or at YouTube, the results that you receive are based upon text about the video rather than the content of the video itself. The search algorithm involved might look at the title of the video, as well as a description and tags entered by the person who uploaded the video as well. Annotations on the video may also play a role in determining what terms and phrases the video may be determined to be relevant for as well.
For example, the video below announces Google’s new food recipe search option, and provides a detailed description about the new feature. But none of the text accompanying the video mentions that the person providing details about Google’s added functionality is one of Google’s executive chefs, Scott Giambastianai. If you search for [Google executive chef], you wouldn’t see this video appear in YouTube’s search results and you probably should.
Other factors may play a role in how highly ranked a video could be in search results as well, including things like how many views and comments and likes the video has received, how often it was added to a playlist, and more.
There are some problems with relying upon just the textual content that is associated with a video. One is that a description probably doesn’t do a very good job of describing a long video that might contain a large number of scenes and a variety of content. Another is that on a site containing a lot of videos, the numbers of results received in response to a query may be on the large side.
While the search engine might show a screenshot taken from the first frame, center frame or last frame of the video to help people decide which video might best meet their query, that thumbnail might not be very representative of the actual content of the video itself either.
All of those ignore the actual content of the video itself. What if a search engine could use the actual audio and visual content of the video to decide what search terms it might be relevant for?
It might be easier to get an idea of what the video is about if the search engine created an index from videos that would store keyword association scores between frames of a number of videos and keywords associated with those frames.
Those frames might be associated with keywords based upon what’s contained within images or audio on each video. Google may also choose to use images from those frames as thumbnails show in search results instead of just choosing the first, middle, or last frame of the videos.
A patent application published by Google this past week describes how the search engine might improve the indexing of videos by identifing and indexing both images and audio clips associated with specific keywords in videos. The patent filing is:
Relevance-Based Image Selection
Invented by Gal Chechik and Samy Bengio
Assigned to Google
US Patent Application 20110047163
Published February 24, 2011
Filed: August 24, 2009
A system, computer readable storage medium, and computer-implemented method presents video search results responsive to a user keyword query. The video hosting system uses a machine learning process to learn a feature-keyword model associating features of media content from a labeled training dataset with keywords descriptive of their content.
The system uses the learned model to provide video search results relevant to a keyword query based on features found in the videos. Furthermore, the system determines and presents one or more thumbnail images representative of the video using the learned model.
A number of whitepapers from Google authors also provide some hints at the possible future of video indexing:
- Large Scale Online Learning of Image Similarity through Ranking
- Sound Ranking Using Auditory Sparse-Code Representations
- Large Scale Content-Based Audio Retrieval from Text Queries
- Large Scale Image Annotation: Learning to Rank with Joint Word-Image Embeddings (pdf)
- Discriminative Tag Learning on YouTube Videos with Latent Sub-tags
The system described in the patent relies upon a video annotation index to help searchers find videos, or parts of videos that may be relevant to their queries.
For example, a video that contains a clip or image of a dolphin swimming in the ocean might have that part of the video labeled with keywords such as “dolphin,” “swimming,” “ocean,” and so on.
There are a number of methods mentioned in the patent that might be used to help rank a part of a video for a particular query.
Click-through data may help to determine whether a keyword is appropriate for a particular video. If the same thumbnail image from a video gets chosen on a search for a particular query by a number of searchers, that may indicate a positive association between the query terms and the video.
A similarity search between images and audio from a video, and a labeled training dataset, which contains stock images and audio clips which have meta data associated with them can help to identify the unlabeled images and audio from the video. An example of Google using similarity searches can be found in the Google Similar Images search.
The patent filing and the whitepapers provide a much deeper look at the technology behind the similarity searches that could be used to associate images and audio from videos with labels that could be used to match up with keywords.
It’s possible that the metadata associated with a video, such as title and description may continue to be used by the search engine, but additional data from the content of a video itself can improve the results of a video search considerably.
And it might make it easier to find Google’s executive chef on YouTube when he’s featured in a Google Video.