In March, one of the more interesting patent filings from Google was granted, Information retrieval based on historical data.
I had discussed it on forums when the original patent application came out in March of 2005, but didn’t provide a write up of the document here. I realized a few weeks ago that I probably should.
The historical data patent is important because it discusses a large number of techniques that a search engine might use in fighting “spamming techniques” that might artifically “inflate” the rankings of web sites, and it works to identify “stale” sites that may be ranked higher than fresher sites that might contain more recently updated information.
I’ll be writing a few posts over the next few weeks about the patent, and try to include some updates that have happened since it was first published. This first post looks at how the “freshness” of a page or document might influence its rankings in search results.
The Web is filled with page after page after page of data. That data is usually organized differently from one site and one page to another, and contained in text, in pictures, in videos, in audio, in columns, in rows, in frames, and many other formats.
When a search engine spider comes to a page on the Web, it will try to go through all of the text it finds, make note of links to other pages, consider alt text for images, and view meta data tags.
Search engines spiders will decide whether or not the content of pages should be indexed by the search engine, and determine which links to follow next.
The ability to annotate pictures, songs, video, and web pages is growing as site owners attempt to find new ways to build communities and attract visitors to their sites.
Enabling people to add their thoughts and opinions, their words to a page means that a site can grow and evolve with the community that it was built for, and that it becomes more interesting and valuable to other visitors.
Some very popular sites that allow for annotations include Flickr, Delicious, StumbleUpon, and YouTube, and the original intent behind Google may have been to build an annotation system. Knowing what other people are saying about a site or a photo or a video can be pretty interesting.
A number of questions can be raised about annotations:
- How do you encourage people to leave an annotation of their own?
- Will offering annotation suggestions help?
- If so, how do you rank the suggestions that you provide?
- Might it make a difference based upon the kind of device that people are using, such as a desktop computer, or a hand held camera phone?
- Can personal history of past annotations help with future annotation suggestions?
- Can you build community with a site offering annotations, and can community input and input of all members of a site be helpful in offering annotation suggestions?
- What role might annotations play in how search engines return which results to searchers?
One of the challenges that a search engine like Google faces is that in order for it to be useful globally, it needs to provide search for an audience that speaks many different languages.
It’s not surprising then, that the search engine has delved into learning as much as it can about many languages, and even providing a translation service – Google Translate, and a service that allows you to search for translated words and phrases in other languages – Google Language Tools.
Google also offers a Google Translate gadget that you can put on your pages to allow visitors to translate your page into their language.
Google has also worked to make their Translate service mobile by making it available on iPhones.
When you’re searching for something on the Web, does it matter whether you use the singular or plural version of a word in your search?
For example, let’s say that you are looking for a new pair of sneakers to go jogging in, and you want to find the right combination of comfort and support, so you decide to look into the best sneakers for running. Does it make a difference in search results when you type in running shoes or running shoe in a search box?
If a search engine just returned results to you based upon your choice of a singular or plural search term, would you get the best results? Should a search engine explore both versions, and try to provide you with a mix of results based upon what it believes are the best results, after looking at results from the singular and plural versions?
A quick look at the top ten results at Yahoo and at Google for both “running shoe” and “running shoes” (both searches without the quotation marks) showed some overlap in pages returned for singular and plural versions at each search engine, but the vast majority of search results seem to focus upon returning results for the plural version of the word, instead of the singular version.
When someone searches at one of the major search engines, they often type in keyword phrases, composed as if they were written for human readers. Those phrases may contain words or phrases that show up very frequently in pages on the web, and have little to do with the information being sought by the searcher.
Search engines that focus upon retrieving search results based upon keywords found in queries have often ignored those frequently appearing and irrelevant words contained within search query terms.
Those words have been referred to in the past by Google as “stopwords,” and could be words like: a, and, is, on, of, or, the, was, with. Similar groups of words that appear very common on web pages, and are also unconnected to an actual search could be referred to as “stop-phrases.”
The word “a” in the query “a London hotel” is a stopword.
Does it matter that Google knows about a trillion Web Addresses?
Is it important that new search engine Cuil has 120 Billion pages indexed, according to them, “three times more than any other search engine”?
The more pages that are known about by a search engine, the more difficult it might be to provide the “best” pages in response to a search, or personalized results according to a searcher’s interests.
But what if a couple of search engineers told you that a study of a major commercial search engine’s log files showed that while there are “a lot of pages out there, there are not that many pages that people actually go to.”
Their paper discusses the amount of useful pages on the web, or pages that people actually search through as opposed to all of the pages that are indexed, and explores the concept of information entropy related to search indexes.
You search for “Foo Fighters,” and the search engine takes your query and starts searching its databases to identify results. It might look through a video database, to see if there are any good videos to show you. It may dig through a News database to see if there was any recent news tied to the phrase, or an Image database to see if there were any popular pictures of the band. The search engine may see if any advertisers were running campaigns using the band’s name.
Some of that searching is done by trying to take the exact phrase that you used in your search, “Foo Fighters,” to find a set of results that you might be satisfied in seeing. But, there are steps that a search engine could try to take that might give you even better results.
Associating Search Terms with Categories