A recent comment here noted that the core algorithm behind how Google works hasn’t changed very much since its earliest days. I’m not sure that I agree. Many of the posts I’ve made over the past five years that involve Google patents and whitepapers describe ways that Google may be changing how it determines which results to show searchers.
Many of the changes Google makes to its algorithms aren’t always visible to its users, while others that change the interface we see when we search tend to stand out more. Interestingly, many changes that Google makes are based upon live tests that we may catch a glimpse of if are lucky, and we pay attention.
Google’s Testing Infrastructure
At Google, experimentation is practically a mantra; we evaluate almost every change that potentially affects what our users experience. Such changes include not only obvious user-visible changes such as modifications to a user interface, but also more subtle changes such as different machine learning algorithms that might affect ranking or content selection…
When you arrive at a web page, the owner of that page might start collecting information about your visit for a number of reasons. One of the most commonly collected pieces of information is an internet protocol (or IP) address. An IP address is a number that can be associated with the way and the place that you access the Web.
The Difficulties of Using an IP Address as a Data Point
Your IP address might be assigned to a server or a router that you use to connect to the Web, or a proxy server or firewall that stands between the computer that you are using and the rest of the internet. You might go online on a computer that you share with other people at home or at a public place like a library, or at an office filled with other computers. You might share an IP address with roommates or family on the same computer, or use more than one computer through the same IP address.
A unique IP address might be assigned to your internet access every time you dial into the internet, or may be leased by your router on a weekly basis through your broadband provider and may change if that lease isn’t automatically renewed by logging in within a certain amount of time after the lease period is over. If you access the web through an office, your IP address that can be seen by the pages you visit might be that of your company’s firewall.
Two Microsoft papers being presented at this week’s SIGIR’10 conference in Geneva, Switzerland explore the topics of Search Trails – The pages that a searcher travels through after performing a search for a query before reaching a final destination page.
The idea of delivering searchers to a final destination page, a page where previous searchers for a specific query often end up at before they either stop searching, or changed the focus of their search, is something that Microsoft has explored in the past.
I wrote about a patent filing from Microsoft a couple of years ago which explored how user behavior signals, such as how searchers browsed through pages to find information might be used to rerank search results. The post, Search Trails: Destinations, Interactive Hubs, and Way Stations, took a look at how search trails – the pages browsed between an initial query and a final page visited, might offer useful query suggestions to searchers as well.
That patent filing, and the 2007 SIGIR best paper, Studying the Use of Popular Destinations to Enhance Web Search Interaction (pdf) by Ryen W. White, Mikhail Bilenko, and Silviu Cucerzan, focused more upon the final destination pages found than the pages visited along the way. Ryen White is listed as a co-author in the earlier papers and patent filing on search trails, and he is one of the authors listed on the papers presented this week in Switzerland as well.
A search engine might use two sets of indexes – one for query terms that tend to show up in more searches and on more web pages, and another larger index that includes queries that aren’t searched for as much by searchers and don’t appear on many web pages.
By showing results for some terms only from a smaller index, information about pages which include those terms can be retrieved quicker by a search engine.
How would a search engine know which queries to search for in the smaller main index, and which to search for within that larger index?
I’ve written in the past about a patent on an extended index from Google, as well as a patent on a supplemental index from Microsoft, and both patents focused mostly upon how those indexes might be set up.
You may know him by a number of names or titles – Governor of California, Terminator, Governator, Conan the Barbarian, Kindergarten Cop, Mr. Universe, Mr. Olympia, Arnold Strong, Arnie, The Austrian Oak.
To Metaweb, Arnold Schwarzenegger is referred to as 9202a8c04000641f8000000000006567.
Who is Metaweb?
Metaweb is a company recently acquired by Google, and they’ve created a system of indexing named entities that allow you to search for information in a new way. Actually, the idea sounds a little like a library’s dewey decimal system, but for named entities. Why is this important, and what is a Named Entity?
A named entity is a specific person, place, or thing. For example, named entitles can include Barack Obama, or the Commonwealth of Virginia, or the Great American Ballpark in Cincinnati. Associating unique identification numbers with named entities can make it easier to index them, and to find information about those named entities when they might be referred to by different names, like my example above about Arnold Schwarzenegger. They can also help with local search, by allowing specific places or businesses or landmarks to have unique identification numbers.
How often do named entities appear in Web searches? A recent paper from Microsoft, Building Taxonomy of Web Search Intents for Name Entity Queries (pdf) tells us that they are pretty common:
Information about where searchers hover their mouse pointers over different parts of search results, as well as advertisements and Google Onebox results, may be collected by the search engine to be used as ranking signals to determine in part how relevant those items may be seen by Google users in response to a search query.
When I view the contents of a web page, I often find myself moving my mouse pointer along the areas that I am viewing. There are a couple of reasons for this. One is that it makes it easier to focus upon the part of the page that I’m looking at. Another is that it’s easier to click upon a link that I find interesting if my pointer is near what I’m viewing.
According to Google, I may not be alone in this kind of behavior. Google may track mouse movements on its search results pages to help rank pages that show up in search results, to determine the quality of sponsored ads within those search results, and to decide whether or not showing onebox results such as maps or definitions or news or stock quotes is appropriate for some search queries.
When Google ranks web pages, it considers a wide range of ranking signals, such as how relevant a page might be to keywords used by a searcher, the quality and quantity of links pointing to that page, and user-behavior data collected about that page.
A number of patent filings and whitepapers from Google have told us that Google might collect a fair amount of user-behavior data about how we browse web pages such as; how long we might spend on pages, how far we might scroll down those pages, which pages we might click upon in search results, which pages we might not click upon, which links we might follow when we visit pages, if we print or bookmark or save pages, and more.
When you search on Bing, sometimes instead of seeing an ordered list of search results, you might see search results broken up into categories. For example, if you search for “Virginia,” your search results start off with an image and link to the state web site, as well as a map. You then see a couple of search results that look pretty relevant for the term.
What comes next is a little interesting. Instead of showing you just more links to web pages like you might see at Google or Yahoo, Bing starts showing you groupings of additional web pages organized by category. There’s a Virginia map category, then Virginia Tourism followed by Virginia Facts, then Virginia Jobs, and finally, Virginia History.
This diversification and grouping of search results is a departure from a paradigm commonly followed by many search engines. When a query term might have more than one meaning, or different categories of results might be equally useful to searchers, Bing may decide to present those search results in different categories, like it does on a search for Virginia. Here’s the first category shown in the Bing results on a search for Virginia:
People still read books. I started on Nudge: Improving Decisions About Health, Wealth, and Happiness not long ago. I’m about a fifth of the way through, and I’ve already added “Choice architecture” to my list of concepts to study more, and I’m looking more carefully at the choices I make.
Seeing a lot of intriguing search patents published by Yahoo over the past few months, and that’s made me sad. I don’t know if they will end up in the graveyard of unfulfilled intellectual property, or migrate to Redmond, Washington, with Microsoft taking over Yahoo’s search results.
My favorite baseball team is in first place in their division after more than a decade straight of losing seasons (Go Reds!). Part of the reason for their winning comes from a few trades that have turned out better than expected, and part comes from an improved minor league system. I can’t help thinking of that as I watch Yahoo search engineers move to Microsoft or begin startups of their own. Also wondering if the Yahoo/Bing search merger has helped to made Google stronger. Especially when observing things like Yahoo’s Chief Scientist of Search choosing to join Google instead of Bing.
Seeing too many Search Engine Optimization tools that include keyword density calculators. Please stop.