A new patent application from Google looks at prefetching and preloading pages into a browser based upon mouse movements, as well as a client-based cache file, and a server based cache.
The inventors listed in the application (US Patent document number 20060047804) are Eric Russell Fredricksen, Paul Buchheit, and Jeffrey Glen Rennie. It was filed on June 30, 2004, assigned to Google on October 10, 2004, and published on March 2, 2006 .
A client assistant, sometimes called a browser helper, runs on a client computer. The client assistant monitors a user’s browsing activities and infers one or more next documents that are most likely to be requested by the user. The client assistant attempts to locate a fresh copy of the inferred next document within a client cache. If a fresh copy of the inferred document is not found in the client cache, the client assistant submits a document download request to a document server.
What is interesting about this patent application is that it looks at mouse behavior to try to guess whether or not documents should be preloaded into a client-side cache. It also includes a persistent connection to a datastream that allows for two way communication between a client computer and a server, enabling streams of information to flow back and forth.
Chances are that the Google toolbar may be doing something like this already to feed pagerank indications to the toolbar. Collection of this type of information may increase the amount of data that Google collects on user behavior, which possibly could be used to influence search results.
It is important to keep in mind that this is still just a patent application, and what it describes may not happen, or at least may not happen in quite the fashion described. And it possibly may not be used even if the application is granted. But the idea of such interaction between a search engine and its user is worth considering. Is it something that Google may already be doing?
We know that Google will prefetch pages at the top of search results when people use Firefox or a Mozilla based browser to search with. The Google Web Accelerator also works to prefetch, and preload pages.
Does this patent application partially describe how their Web Accelerator works? It’s difficult to tell, but it might give some insight into some of the processes involved. The webmaster help page for the Web Accelerator does tell us that they look at mouse movements:
Google Web Accelerator decides what links should be prefetched based on aggregate usage statistics as well as the userâ€™s mouse movements.
It’s also important to note that it is possible for a webmaster to tell the Web Accelerator about pages that people might be interested in, and likely to visit, by using a link element like the one that follows:
Prefetch hints: As a webmaster, you know best what links users are likely to click on, and if you specify links to prefetch, these will be sped up for users who have Google Web Accelerator installed. For each link you’d like Google Web Accelerator to prefetch, simply add the following snippet somewhere in your page’s HTML source code: <link rel=”prefetch” href=”http://url/to/get/”>. The href value should be the actual URL you want prefetched. Google will prefetch this page, and when your users click on this link, that page will load more quickly.
Related patent application included
This patent application is related to another one that the inventors have filed, “System and Method of Accessing a Document Efficiently Through Multi-Tier Web Caching”, which doesn’t appear to have been published yet by the US Patent office. Aspects of it are likely incorporated into this document and include the use of a local cache accessed through a browser helper, such as a toolbar or web accelerator, and a server-based cache that may run across multiple servers.
Prediction and prefetching based upon pointer movements
This patent application involves expediting a browser’s access to pages on a network by predicting a user’s next document selection. It may look in a local cache on the client computer before going to a server (or the web) to check for a cached copy of a document there.
That grabbing of a copy of a document may be initiated before the clicking of a mouse, based either upon how close the mouse pointer is to a link to the document, or if the mouse pointer is moving towards the link. Prefetching could also start by a user hovering over a link, or a click down on a mouse (instead of waiting for a person to release the mouse button).
Rather than storing the whole document in a cache, what may be located there could be:
1) a content fingerprint uniquely identifying a particular version of the document and
2) a URL fingerprint uniquely identifying the source of the document.
3) Content freshness parameters (such as date field and expiration date in http headers)
Some of this predictive behavior is based upon a balancing between the amount of time saved (latency reduction) in obtaining a document, and the effort in prefetching and preloading that document, when determining whether or not to start downloading a document based upon the movement or nearness of a mouse pointer (or other pointing device).
Stale client cache documents
The document being loaded could already be in a cache on the computer a person is using. If it is, it will be reviewed to see if it should be considered stale, or potentially stale, based on some freshness parameters stored in that client cache.
One of those freshness parameters could be a host specified expiration date/time. If so, it could be considered stale when the current date/time is later than that specified expiration on the cached document.
Because many documents do not have a host specified expiration date/time, the client assistant may follow a policy to determine which cached documents to treat as stale.
1) cached documents that have no specified expiration date/time could be always be deemed to be stale.
2) Documents with no host specified expiration date/time and that are more than M minutes old are deemed to be stale (where M is any suitable value, such as a value between 5 and 60).
3) Staleness of cached documents without host specified expiration date/time is based, at least in part, on one or more additional freshness parameters stored in the client cache.
4) Staleness of cached documents having no host specified expiration date/time could be based, at least in part, on document type (e.g., html, doc, pdf, etc.).
A persistent connection
One version of the patent application would call for a persistent connection between the client assistant and a document server to reduce reduce client-server communication latency. The always on connection could include a control stream and multiple data streams back and forth.
When the document server responds to a document request, the client assistant receives the response, and if the response includes a copy of the requested document, the document is stored in the client cache. If the document copy in the client machine’s cache is the same as the copy about to be downloaded, the client assistant may update the document’s freshness parameters without grabbing a new copy of the document from the web.
Sometimes a document server may include one or more documents embedded within the document requested. If that happens, the client assistant may store those additional documents in the client cache also.
If there is a difference between the cached copy on the client, and the one on the document server, the client assistant may update the document in the local cache.
Actually clicking on a link, choosing one from a favorites list, or entering a URL directly into an address bar may produce similar results with the type of caching described here.
If the requested document isn’t found in the cache, or it isn’t fresh, then the document could be obtained from the document or from a web host. It’s difficult to tell what type of impact this might have on measuring access to a site using analytics tools. Requests for a page wouldn’t be reaching the web host where the page is located on the web.
The patent application also tells us that after a document is presented, a user’s activities may be monitored for additional actions, but doesn’t detail what those actions may be.
Document Server processes
The other patent application mentioned within this one (System and Method of Accessing a Document Efficiently Through Multi-Tier Web Caching) focuses upon “how the document server responds to a document address.” It appears that we need to wait for that to get more details, but we do get a summary view within this document. Rather than describing that process, I’ll leave it to anyone interested to dig into the patent application for that information.
Why would Google want to help accelerate the web?
It’s important to note that both patent applications don’t involve how a page is indexed in Google, but rather how quickly a page can be loaded into a browser, possibly when someone is looking at search results, or even when visiting any other page on the web.
While there is a benefit to a user of something like the Google Web Accelerator, one question that needs to be asked is “why would they do this?”
One answer is that it provides a more enjoyable surfing experience to someone using a program like that, and could make them feel positive about the company.
Another answer is that it helps the search engine collect user behavior data, and clickstream information, even outside of direct interactions with the search interface. I’d imagine that these interactions would be logged somewhere, and they could be analyzed.
There could be a lot of useful data in knowing how people behave on the web, especially to a company trying to understand the intent of its users as they move about the web.
Linkage data, like that used in pagerank, or derived from looking at anchor text and the words that surround it is one way of determining how important pages are. Actual usage data may be even more helpful.