Building Google’s Knowledge Base and Identifying Locations in Web Pages

When we talk about indexing and crawling content on the Web, it’s usually within the context of pages being ranked on the basis of a number of signals found on Web pages that might be ranked in response to queries. Google has told us that the future of search involves Knowledge Bases, and the indexing of Things, Not Strings. Gianluca Fiorelli explored Google’s ideas of Search in the Knowledge Graph Era earlier this week.

A few years back, I wrote some posts about some Google Patents that explored how Google might be extracting and visualizing facts, and using Data Janitors to process that information and clean it up and sort it. Google was granted another patent this week that’s very much related, looking at how Google might understand locations for places collected from Web pages. One of the inventors, Andrew Hogue, gave this Google Tech Talk presentation last year:

The presentation and the patent talk about how difficult it might be to extract locations from documents (and other types of information as well).

As the patent notes, place names can be difficult because they might be presented in many different formats, they might have typos or errors or omissions or ambiguous language. Does a word such as Turkey refer to a place or a meal? Is Newark mentioned on a page located in New Jersey or Delaware?

Extracting Facts About Places

The patent attempts to understand when a web document is referencing a place, and where that place may be located, and apply tags to the page with the location.

The patent is:

Determining geographic locations for place names in a fact repository
Invented by David Vespe and Andrew Hogue
Assigned to Google
US Patent 8,347,202
Granted January 1, 2013
Filed: March 14, 2007

Abstract

A system and method for tagging place names with geographic location coordinates, the place names associated with a collection of objects in a memory of a computer system. The system and method process a text string within an object stored in memory to identify a first potential place name. The system and method determine whether geographic location coordinates are known for the first potential place name. The system and method identify the first potential place name associated with an object in the memory as a place name.

The system and method tag the first identified place name associated with an object in the memory with its geographic location coordinates, when the geographic location coordinates for the first identified place name are known. The system and method disambiguate place names when multiple place names are found.

My post linked to above about Data Janitors describes how they might work. This patent shares an inventor with that one, and covers some of the same ground in describing what a janitor is, and what they do, including “data cleansing, object merging, and fact induction.”

Facts are About Named Entities

When you perform a Q&A type query on Google for something like,”When is Britney Spears Birthday,” possible answers found on Web pages are 12/2/1981 and Dec. 2, 1981. They are the same date, but not written in the same format. Janitors are responsible for recognizing that these are the same dates in different formats. This is part of how a web of things operates – cleansing and cleaning and organizing facts, and the attributes and values associated with them.

Facts are information about named entities (specific people, places, and things) on the Web. These named entities can be specific people (many of who can be found in Google +). They can be about places, or points of interest, or businesses at specific locations, many of which can be found in Google Maps. Knowledge Bases that Google may use can include Wikipedia, Freebase, the IMDB, NetFlix, and other large databases about people, places, and things.

Even Google’s query and click logs can provide information about named entities, and attributes associated with them that people are interested in, especially when those logs are associated with query sessions.

Bill Clinton playing a saxophone at a state sponsored event.

A fact might be a name fact, such as a name for an object such as “Bill Clinton.” A fact may also be a property fact, such as the string of text, “Bill Clinton was the 42nd President of the United States from 1993 to 2001.”

A Geopoint Janitor

This patent introduces a geopoint janitor, which works to try to understand if a string of text is referring to a place name by following rules about places. It may also attempt to apply known geographic locations associated with potential place names. It might look at a list of “existing annotated place names and/or through a coordinate lookup service.” The geopoint janitor decides, for instance, if Turkey is a place or a bird. The geopoint janitor is also what tags a document with a place name and a location.

How does the geopoint janitor do this?

It follows rules, like looking for whether or not certain words are capitalized in sentences like, “”I visited the Empire State Building in New York City.” It might ignore capitalization in the first word of a sentence when doing so. It might also ignore capitalized words like, “I”, which the patent calls a “noise” word.

A workman helping to build the Empire State Building.

It might also look for words that commonly precede or follow locations, such as, “in.” It might learn about these types of words based upon locations it’s already aware of, such as “New York City.”

The geopoint janitor may also learn when there are different variations of a potential place name. It might learn about such alternative names by looking at sentences such as, “I love visiting the Empire State; New York is a fabulous place to vacation.”

The patent provides some other examples of rules that might be followed to understand when a place name is referred to on a Web page.

Take Aways

The future of search may be in having a search engine understand what a web page is about, instead of matching words on a page to words within a query. By following rules to understand what a page may be actually about, including the places referenced within it, we’re moving steps closer to that kind of understanding.

Google’s knowledge base results give us information about named entities associated with queries. By looking in Google’s log files and click files, the search engine may also include information and links to related queries and search results.

As this patent shows, Google is looking at as many web pages as possible to find information about entities on those pages. The kinds of rules that a geopoint janitor follows to understand places and pages on the Web, and to tag those pages is a start towards a web of things instead of strings.

Share

16 thoughts on “Building Google’s Knowledge Base and Identifying Locations in Web Pages”

  1. Isn’t that what a lot of SEO is about though? Conveying what our different pages are actually about.

    Though it’s definitely cool research they are doing with natural language processing. We have Schema.org and similar of course but further unlocking the human language format will be a big boon going forward.

  2. Local SEO for me is more about being a small fish in a smaller pond, rather than competing globally. Its not about having people find my place of business. Since I do web design from my living room I really wouldn’t want people to come knocking.

  3. Bill:

    Isn’t this similar to what Facebook did when it purchased Instagram? That is, Facebook didn’t want to purchase Instagram just for its photo repository, but it really wanted its ability to geolocate those photos (and then create their algorithms around that). Google is doing the same thing with their patents – just not directly. Instead, they’re using related data to create a whole “picture” instead of just a piece-part.

    Thanks.

  4. Hi Gary,

    Interesting question.

    Google has acquired companies for a number of different reasons. For some, they wanted to hire the employees and use their technical expertise. For others, they wanted the features and technology, like YouTube. Some acquisitions, including pure patent acquisitions, were to acquire intellectual property they can use to proect themselves, from others doing something similar.

    For some, they liked the technology a lot, but didn’t know quite what to do with it at the time, like Dodgeball.

    Google does seem to be filling in some gaps, but you have to wonder in many cases if that’s the direction they were traveling in before they tried to acquire those companies or their intellectual property.

  5. Great post as always.

    I think this raises some interesting question. What does it do or how does it tell things like “bath” as in a hot tub and “Bath” the city in Somerset UK apart? Or worse still someone in “Bath” who is a plumber of sells “baths”!

    I have noticed a much wider variation in the UK google search results based on your location in the last two week. I wonder if this is part some location based language indexing working through the data. Some of the results are terrible with sites in the US being promoted over UK sites on .co.uk TLDS. All interesting stuff

  6. “I have noticed a much wider variation in the UK google search results based on your location in the last two week. I wonder if this is part some location based language indexing working through the data.”

    Hi Tony. I’ve seen exactly the same thing. The results have been spotty at best, with some really weird stuff coming up. It’ll be interesting how it all pans out.

  7. Bill, as always, you seem to be on top of all the Google patents and changes. I think Google is concentrating on locations to take over the local market. Social media is the turn they are hoping to control. Facebook and Twitter are seeing that Google+ is growing. Keep the great information coming.

  8. Isn’t this similar to what Facebook did when it purchased Instagram? That is, Facebook didn’t want to purchase Instagram just for its photo repository, but it really wanted its ability to geolocate those photos (and then create their algorithms around that). Google is doing the same thing with their patents – just not directly. Instead, they’re using related data to create a whole “picture” instead of just a piece-part.

  9. Well I guess I’m just going to have to battle more content spam that outranks me. I get frustrated following all the rules and guidelines when terrible sites continue to outrank mine. My client was hacked and now after a complete server change, brand new website, and 301 redirects, he still cannot rank for the least competitive terms.

Comments are closed.