Many words found on a web page are much easier to understand given the context of the page itself, as described in a Google patent granted last week. For example, take the word “bank,” which can mean a financial institution, one side of a river, or the turning of an airplane. Without the context of the word itself within the setting of a page, it’s fairly impossible to determine what the meaning of the word might be with any certainty.
I usually include a section within site audits that dealt with the structure and organization of a site. This looks at how things are connected together by virtue of links from one page to another, and the use of anchor text to describe those sections and sub-sections within the sections.
It explores the use of a hierarchy of categories nested into subcategories, and sometimes into even smaller groupings of categories, and how those might be linked together.
This can mean that bigger (and more general) categories get the benefit of more links to them, and yet smaller categories also have some interlinking between them. This has always been an essential part of building a site to me, and in helping content to be found and understood.
The patent provides examples of sites that are organized into hierarchical structures, and how those can help give a search engines clues about the taxonomies that make up the structure of a site.
One set of sites described in the patent are manufacturers’ sites, such as the Apple site, and how it’s structured to show off the different products that Apple makes. Another site explored in the patent is an electronics products review site, organized by product type and then by specific products. Another is a car site, where Cars are broken down into specific makes and then into models.
These examples are used to show how a site might be used to update a taxonomy model when a new site contains information than previously visited sites, or when new information is added to a taxonomy, like when a new smartphone model might be released.
So why look at the structure of a site to come up with a model taxonomy that it might fit into?
The patent tells us that doing this makes it more likely that the search engine will pick relevant advertisements for pages of the site if the site shows advertisements. Advertisers can have their ads shown upon more meaningful pages in this manner. Like a financial institution wishing to place an ad on a document, using the word “bank” within the context of finance, but not river banks or banking planes.
In addition to advertisements, this kind of taxonomic classification of pages enables Google to label words and/or pages with meaning based upon hierarchical categories and/or other features of a taxonomy.
Labeling Taxonomies in an Automated Manner
It’s impossible to say how long Google has been assigning taxonomies to pages, but it makes sense for them to better understand words on a page by getting a sense of how those words are used on pages of a site, and their context within a site itself.
While some of this labeling might be done through the crawling of pages, some may also be done with the use of human readers who may also apply labels to pages. The use of human classifiers in a manner like this is a bottleneck that could significantly slow down or delay the process, so the aim of this patent is to automate this process as much as possible.
Once a large enough set of documents have been appropriately labeled, the resulting set might be referred to as a “golden set,” or a “training set.” This type of training site enables Google to create a classifier model, representing rules or criteria that can be used to label other words, including looking at the proximity of a labeled word to other words within a document to assign a probability that the word within a specific context might have a probability of having a certain meaning in that context.
So the focus of this patent is a way for a search engine to automate this process, so that it doesn’t have to rely upon human readers and classifiers. The patent is:
Updating taxonomy based on webpage
Invented by Philo Juang, Christopher Testa, and Nicolaus Mote
Assigned to Google
US Patent 8,645,384
Granted February 4, 2014
Filed: May 5, 2010
According to an example implementation, a computer-implemented method may include extracting, by a computing device, structured content from a website, determining a recent taxonomy by applying category rules to the structured content, the recent taxonomy including multiple categories and a new category, and updating a stored taxonomy based on the determined recent taxonomy by adding the new category to the stored taxonomy.
The updating of taxonomies may be done in part by looking for structured content on websites or webpages, such as HTML within structured formats (like tables) or in other formats. Categorical rules might be applied to that structured content to determine a recent taxonomy, including multiple categories.
Crawling might be done in a focused manner that periodically crawls and revisits specific pages to update taxonomies.
There’s some discussion of an “Administrator” directing crawling actions to update a taxonomy, and that this approach could be used on the public internet and on private and/or corporate networks as well.
A taxonomy extractor might create category rules based upon visiting pages and looking for content from sources such as:
- Items associated with hyperlinks within an area of a webpage
- Objects within a div
- Items within a menu or drop-down menu
- Items within a same row or column of a table
- Items included in a section of a list or outline
- Options within a webpage may be categories within a same supercategory
The patent also tells us that “options or categories which become available after a selection has been made, such as after an item has been selected from a menu or drop-down menu, or after a hyperlink has been clicked, are subcategories of the selected option.”
The patent also describes options that a programmer or administrator might have to develop or choose classification rules for pages on a private or corporate network as well.
I recently wrote a post titled, Will Keywords be Replaced by Topics for Some Searches?, and the ability for Google to do something like that requires that the search engine be better able to understand the concepts and topics covered by a page and found within a query. This patent provides a hint at one way that Google can get an idea of what a page is about, and I’ll have some posts coming up soon that explore that in more depth.