Your website may be invaded by robots at any time. If you’re lucky that is – at least if you want people to visit you from places like Google or Yahoo or Bing. And, if the visiting robots are polite.
In the early days of the Web, automated programs known as robots, or bots, were created to find information on the Web, and to create indexes of that information. They would do this regardless of whether you wanted them to visit your pages or not, and you had no way to tell them not to go through your web site.
If you search through Usenet message boards from the early days of the Web, you might come across a document such as the World Wide Web Frequently Asked Questions (FAQ), Part 1/2 (December, 1994), which describes robots in those days:
4.10: Hey, I know, I’ll write a WWW-exploring robot! Why not?
Continue reading Google Patent Granted on Polite Web Crawling
Google’s recent purchase of Metaweb, who run the Freebase directory left many wondering at the motivations behind the acquisition. Did Google buy the company for its technology, for its Freebase directory, for the expertise of its employees?
A Google patent application published today hints at one reason behind the deal, with a mention of Metaweb’s Freebase, and how it could be used by Google in a process that may expand the amount of information that the search giant shows us about specific people, places, and things (including ideas and concepts such as democracy) in search results.
It might also result in search results that are mashups of different information relating to queries involving named entities, such as seen in the image below:
Continue reading Google and Metaweb: Named Entities and Mashup Search Results?
Two Microsoft papers being presented at this week’s SIGIR’10 conference in Geneva, Switzerland explore the topics of Search Trails – The pages that a searcher travels through after performing a search for a query before reaching a final destination page.
The idea of delivering searchers to a final destination page, a page where previous searchers for a specific query often end up at before they either stop searching, or changed the focus of their search, is something that Microsoft has explored in the past.
I wrote about a patent filing from Microsoft a couple of years ago which explored how user behavior signals, such as how searchers browsed through pages to find information might be used to rerank search results. The post, Search Trails: Destinations, Interactive Hubs, and Way Stations, took a look at how search trails – the pages browsed between an initial query and a final page visited, might offer useful query suggestions to searchers as well.
That patent filing, and the 2007 SIGIR best paper, Studying the Use of Popular Destinations to Enhance Web Search Interaction (pdf) by Ryen W. White, Mikhail Bilenko, and Silviu Cucerzan, focused more upon the final destination pages found than the pages visited along the way. Ryen White is listed as a co-author in the earlier papers and patent filing on search trails, and he is one of the authors listed on the papers presented this week in Switzerland as well.
Continue reading The Importance of the Journey: Search Trails and Destination Pages
Information about where searchers hover their mouse pointers over different parts of search results, as well as advertisements and Google Onebox results, may be collected by the search engine to be used as ranking signals to determine in part how relevant those items may be seen by Google users in response to a search query.
When I view the contents of a web page, I often find myself moving my mouse pointer along the areas that I am viewing. There are a couple of reasons for this. One is that it makes it easier to focus upon the part of the page that I’m looking at. Another is that it’s easier to click upon a link that I find interesting if my pointer is near what I’m viewing.
According to Google, I may not be alone in this kind of behavior. Google may track mouse movements on its search results pages to help rank pages that show up in search results, to determine the quality of sponsored ads within those search results, and to decide whether or not showing onebox results such as maps or definitions or news or stock quotes is appropriate for some search queries.
When Google ranks web pages, it considers a wide range of ranking signals, such as how relevant a page might be to keywords used by a searcher, the quality and quantity of links pointing to that page, and user-behavior data collected about that page.
A number of patent filings and whitepapers from Google have told us that Google might collect a fair amount of user-behavior data about how we browse web pages such as; how long we might spend on pages, how far we might scroll down those pages, which pages we might click upon in search results, which pages we might not click upon, which links we might follow when we visit pages, if we print or bookmark or save pages, and more.
Continue reading Where you Point Your Mouse May Influence Google Search Rankings, Advertisement Placement, and Oneboxes
When you search on Bing, sometimes instead of seeing an ordered list of search results, you might see search results broken up into categories. For example, if you search for “Virginia,” your search results start off with an image and link to the state web site, as well as a map. You then see a couple of search results that look pretty relevant for the term.
What comes next is a little interesting. Instead of showing you just more links to web pages like you might see at Google or Yahoo, Bing starts showing you groupings of additional web pages organized by category. There’s a Virginia map category, then Virginia Tourism followed by Virginia Facts, then Virginia Jobs, and finally, Virginia History.
This diversification and grouping of search results is a departure from a paradigm commonly followed by many search engines. When a query term might have more than one meaning, or different categories of results might be equally useful to searchers, Bing may decide to present those search results in different categories, like it does on a search for Virginia. Here’s the first category shown in the Bing results on a search for Virginia:
Continue reading Bing’s Categorized Search Results
How does a search engine choose whether to show news items in web search results and when not to?
If you live in Bealton, Virginia, chances are that you may not be too interested in news of a car crash in Brooklyn, New York, when searching for information about Brooklyn. If you’re from Brooklyn, and want to find vacation information about the parks in Wisconsin, you may not be very concerned about the latest winning numbers in the Wisconsin lottery. Yet, someone searching for information about one of the states bordering the Gulf of Mexico these days might be likely to want to see news about the Oil spill in the region.
A Yahoo patent filing published recently describes how they might use a prediction system based upon the search engine’s query logs to decide whether or not to show news results. The prediction system uses a mixture of geographic information related to queries and to searchers as well as information about how “newsworthy” a location might be to make that determination. The patent tells us that it might create similar prediction models to determine whether or not to show other types of results as well. The patent application is:
System and Method of Geo-Based Prediction in Search Result Selection
Invented by Rosie Jones, Fernando Diaz, and Ahmed Hassan Awadallah
US Patent Application 20100161591
Published June 24, 2010
Filed December 22, 2008
Continue reading How Search Engines May Use Geography and Population Info in Deciding to Show News in Web Searches
Optical Character Recognition, or OCR, is a technology that can enable a computer to look at pictures that include text, and translate those visual representations of text into actual text. If you have words within images on your web pages, there’s a good chance that search engines are ignoring those words, when it comes to indexing your pages.
But that might change sometime in the future.
While OCR has been around for a while, search engines haven’t been using the technology when crawling and indexing the content of Web pages. Google’s webmaster guidelines tell us:
Try to use text instead of images to display important names, content, or links. The Google crawler doesn’t recognize text contained in images. If you must use images for textual content, consider using the “ALT” attribute to include a few words of descriptive text.
Yahoo’s page, How to Improve the Position of Your Website in Yahoo! Search Results provides the following tip:
Continue reading Teaching Computers to Read Newspapers: How a Search Engine Might Use OCR to Index Complex Printed Pages
Earlier this month I wrote about a granted Google patent, and a continuation of that patent filed earlier this year, that describe How Google Might Suggest Topics for You to Write About, by providing information to web publishers on queries and topics that are either under-represented in search results or where there’s more demand for information about those topics or queries than there are search results to meet that demand.
The topic struck home with a number of people, especially journalists, and I had a chance to have a conversion with Financial Times (FT.com) reporter Kenneth Li about Google’s patents. The Financial Times ran with two different stories on the topic (Google shadow over new media groups, and Google eyes Demand Media’s way with words), focusing primarily on how the technology involved in the patents could bring Google into competition with companies such as Demand Media, Associated Content, and AOL.
While searching through patent filings this morning, I came across an interesting newly published patent application from Demand Media. In the FT.com article on Demand Media, we’re told that:
Continue reading How Demand Media May Target Keywords for Profitability