Images on a web page can provide a chance to express ideas in a visual way that can convey a considerable amount of information, and may also add to the attractiveness and perceived quality of a site.
When search engines rank pages in search results, images may have some impact in those rankings.
A search engine might look at the captions associated with pictures, or alt text provided as an alternative for when people browse the Web without images turned on or when those browsers are using screen reading software.
Search engines might also look at text surrounding an image, especially within the same HTML container, or block or segment.
Those indexing services could also associate other content on a page with an image, including the page’s title.
There are a number of uses that people might put images to when creating a Web page, such as:
- Page and layout decorations
- Bullets for lists
- Site logos
- Spacer images to help keep a layout in place
- Topical images – pictures that add to and illustrate concepts discussed on pages
- Galleries of pictures of people and places and things
- Advertising images
Some images are more important to the content of a page than others, so how might a search engine use the text associated with some images and not others? What does a search engine look at when deciding how important an image might be to a page?
A new Microsoft patent application recently published provides some approaches to creating and using a score for images that may impact Web search results in some very interesting ways.
This scoring system identifies text associated with an image in a document, referred to as “image text,” and determines an “image score” based upon the image text for an image in that document.
The image score can be used as an indication of the relevance of the document to the image text. The image score may be used in a number of ways, including a signal in ranking web pages of a search result based on their image scores.
The patent application is:
Scoring Relevance of a Document Based on Image Text
Invented by Qing Yu, Shuming Shi, Zhiwei Li, Ji-Rong Wen, Wei-Ying Ma
Assigned to Microsoft
US Patent Application 20080215561
Published September 4, 2008
Filed March 1, 2007
A method and system for determining relevance of a document having text and images to a text string is provided. A scoring system identifies image text associated with an image of the document.
The scoring system calculates an image score indicating relevance of the image text to the text string. The image score may be used in many applications, such as searching, summary generation, and document classification, image search, and image classification.
Applications that Can Use an Image Score
The image score may be used in many applications. Some examples mentioned in the patent application include:
1) A search engine may rank web pages of a search result based on their image scores. The image score may be combined with a text relevance score and a static ranking score (like PageRank) to provide an overall ranking.
2) A document summary system may look at an image score to determine whether an image of a document should be included in a summary, or snippet, of the document.
3) A document classification system may use an image score calculated from a comparison of image text to a textual description of a class as an indication of similarity between the document and the class.
4) A vertical search system may factor in an image score to search for items, such as products or news stories.
Calculating an Image Score
A web page may have a large image positioned at the top center of the web page and small images at the bottom of the page acting as links to other pages.
The large image may be more important to the web page, and the image text of that image may better represent the overall topic of the page than the image text of the other images.
The scoring system may increase the image score of that larger image, and decrease the image scores of the smaller pictures.
The importance of an image may be based on:
- Image level features – taken from the image itself
- Page level features – based upon the relationship of an image to a page, and,
- Site level features – based upon the relationship of the image to the web site of its web page
A table from the patent gives us an idea of what might be looked at in creating an image score based upon image level features, page level features, and site level features:
|Feature Level||Feature Description|
|Image Level||Size – Area of the image|
|Width/height ratio – Ratio of the width of the image to the height of the image|
|Blurriness – Degree of blur of the image|
|Contrast – Contrast within the image|
|Colorfulness – Measure of color within the image|
|Face – Flag indicating whether the image contains a face Photo vs. graphic|
|Flag indicating whether the image is a photograph or computer generated graphic|
|Page Level||Relative position X – Relative horizontal position of the image within the web page|
|Relative position Y – Relative vertical position of the image within the web page|
|Relative size – Percentage of the web page occupied by the image|
|Relative width/height – Ratio of the width-to-height ratio of the image to the ratio width-to-height ratio of the page|
|Site Level||Inner site – Flag indicating whether the image is contained within the web site of the web page|
|Frequency in site|
|Number of times the image appears on different web pages of the web site of the web page|
Image Text – What the Picture is About
Image text may be the anchor text (if the image is also a link), URL text, alt text, and surrounding text associated with an image.
The scoring system may use various techniques for identifying surrounding text of an image:
- Rendering a web page in memory and analyze its layout to identify the surrounding text based on distance from the image.
- Using rules to identify surrounding text from the HTML document representing a web page (e.g., passages consisting of 20 terms before or after the image).
- Using a Document Object Model (“DOM”) based technique for identifying surrounding text.
Discarding Some Images
Pictures that aren’t important may be removed to improve the accuracy of the scoring, such as really small images, or ones below a certain threshold in terms of importance.
It has been a fairly common belief that a search engine will look at text associated with web pages when ranking those pages, but information from the search engines about how they might actually use text related to images has been somewhat scarce.
This patent application provides us with an opportunity to think about how a search engine may place more importance upon text related to some images over other images based upon how important an image might seem to be to a page.
It also describes how a score associated with images might be used in other ways, such as including an image with a snippet of text that shows up in search results for News and Web searches.
How important are the different images that appear on your site, and what do they tell search engines about your pages?