When you arrive at a web page, the owner of that page might start collecting information about your visit for a number of reasons. One of the most commonly collected pieces of information is an internet protocol (or IP) address. An IP address is a number that can be associated with the way and the place that you access the Web.
The Difficulties of Using an IP Address as a Data Point
Your IP address might be assigned to a server or a router that you use to connect to the Web, or a proxy server or firewall that stands between the computer that you are using and the rest of the internet. You might go online on a computer that you share with other people at home or at a public place like a library, or at an office filled with other computers. You might share an IP address with roommates or family on the same computer, or use more than one computer through the same IP address.
A unique IP address might be assigned to your internet access every time you dial into the internet, or may be leased by your router on a weekly basis through your broadband provider and may change if that lease isn’t automatically renewed by logging in within a certain amount of time after the lease period is over. If you access the web through an office, your IP address that can be seen by the pages you visit might be that of your company’s firewall.
When you connect through a service such as AOL, you may share an IP address with many other people because you connect to the Web through a proxy server which may cache pages visited by others – so that if you visit a page that someone else has seen recently, you may see a cached copy of that page stored in the proxy server instead of visiting the server the page was published upon initially.
A patent granted to Google this week describes how the search engine might be able to estimate the number of people who might be accessing the web through individual IP addresses, using a number of different approaches.
Why a Search Engine Might Estimate Users Behind an IP Address
Why would Google want to be able to estimate how many people might be behind a single IP address and be able to distinguish between them if possible?
The patent tells us that there are a number of reasons – being able to estimate how many visitors come to your site from IP addresses can be useful in determining:
- Whether or not visits to a page from one or more IP addresses are from a single user, or from multiple users;
- Whether or not ads selected from one or more IP addresses are from a single user or from multiple users;
- Whether or not server resources from one or more IP addresses are from a single user or from multiple users;
- How many user access a Web page or Website;
- How many users viewed certain ad impressions;
- How frequently users visited pages.
There are a number of ways that Google might use this kind of information. For example, Google collects information about user-behavior during query sessions, so that they can see how a searcher might modify their queries when searching for information about a specific topic to try to understand the intent behind a search or a series of searches. There are a number of features that Google offers that benefit from being able to distinguish between different searchers, collecting that data from a large number of searchers, and aggregating and analyzing that information, such as:
- Spelling correction suggestions,
- Query refinement suggestions,
- Determining whether there is a geographical location intent behind a search (and possibly showing maps and local business suggestions),
- Personalization or customization of search results, and
- Diversification of search results
Being able to understand which searches and other interactions with Google originate, from IP addresses, and specific users behind IP addresses can also be useful in:
- Trying to determine if click fraud is happening,
- Determining whether searches, and clicks, and other interactions with search results and advertisements might be automated
- Deciding whether searches, and clicks, and other interactions with search results and advertisements might be manual but evidence an intent to manipulate user-behavior data
- Providing data to users of public tools from Google such as Google Analytics, Google Website Optimizer, Google’s Conversion Tracker
- Analyzing trends in searches, for use with tools like Google Insights for Search, Google Trends, Google Trends for Websites, and Google Hot Trends
- Analyzing trends for internal Google processes that might determine how popular (or bursty) some topics and some web sites might be, including news and blog results
- Determining how popular a web site or advertisement might be
- Determining how “Sticky” a site is
- Collecting user-data to determine which sitelinks to show for a site
- Running many other processes that rely upon distinguishing between individuals to track and measure user-behavior data
While Google uses information found on web pages and in links to web pages to determine the rankings of pages in search results, a number of patent filings and white papers over the past few years hint strongly that Google is also looking closely at user-behavior data to determine how much attention people are paying to different web pages, videos, news results, blogs, and other kinds of documents or objects on the Web. That public attention level may influence how well some sites may rank for different queries.
Cookies and Other Client Identifiers
When you visit a web page, the server it is on may send textual information to your browser, which is sent back to that server every time you access that page. This information is know as a cookie, and it can be used to authenticate your identity (so that you don’t have to log in every time you visit a site), as well as for user tracking, and to maintain specific information such as your preferences for a web site, and what you’ve entered into a shopping cart, and more.
But there are people who purposefully disable cookies on their browsers to avoid being tracked.
Cookies can be part of the solution of estimating how many people might be accessing the web through a specific IP address, but there are other approaches that can help when people have turned cookies off on their browser.
The patent refers to cookies and information about your browser and some other computer settings as client identifiers.
These browser parameters and “user-agent” parameters can include things such as:
- Screen setting information such as screen height/width, available height/width, and color depth,
- Time zone,
- History length,
- Whether or not Java is enabled,
- Number of plug-ins,
- Mime types,
- Type of connection device or program connecting to the web, whether desktop or mobile browser, screen reader or braille browser,
- Host operating System
- Language
- etc.
So, an estimate of the number of users who might be at a specific IP address could be created by looking at a ratio of unique sets of user agent and/or browser parameters for that IP address. Information about browser and other client parameters could be used to “differentiate different users.”
The Google Patent is:
Determining a number of users behind a set of one or more internet protocol (IP) addresses
Invented by Deepak Jindal, Rama Ranganath, Gokul Rajaram, and Fong Shen
Assigned to Google
US Patent 7,761,558
Granted July 20, 2010
Filed June 30, 2006
The patent provides a fair amount of detail on how they might attempt to analyze both cookie and client identifier data to estimate how many different users might be accessing the web at different IP addresses, and some of the assumptions and rules that they might use within that analysis. Some examples include:
- A single cookie at one IP address only is most likely a single user at that address, unless the computer is shared
- A single cookie appearing at a mix of IP addresses may be a single user whose IP address is dynamically changed each time they connect to the Web, or who moves between physical locations when connecting to the Web.
- A single IP address with multiple cookies:
- with a small number of cookies over a period of time, could be a single user visiting through different browsers or computers, or someone who clears or resets their cookies regularly
- with a large number of cookies over a period of time, could be a number of people visiting through a proxy server
Some cookies have a very short lifetime, of less than a day, and we’re told that those might be filtered out of this process because they may come from browsers that don’t accept cookies, or only accept session cookies, or are from first time visitors, or people who clear their cookies daily, or even spammers.
Speaking of spam, the patent tells us that a list of known spam proxies and IP addresses might be maintained that could be used to exclude information from estimates about how many people are behind a specific IP address.
This process could also be used to try to compile a list of suspicious IP addresses, such as an IP address which appears to have a single user behind it but an unusually large number of impressions or conversions. Such an IP address might be listed as a spam address. While the patent doesn’t describe other patterns that might associate addresses with spamming activity, it’s possible that a system like this could potentially look at many other signals as well.
Conclusion
As a searcher, or site owner, or SEO, why should you be concerned about how Google might estimate the number of users behind different IP addresses?
One major reason is that the information collected by a search engine about visits from different IP addresses may influence how a search engine like Google operates in areas such as identifying click fraud, incorporating user-behavior data into search rankings, removing search volume information from keyword tools from automated queries or from people checking rankings instead of searching for information on a specific topic or query term.
It’s also helpful to get a better sense of how much information, and what kinds of information Google might collect about people who use the tools it offers.
Turning cookies off on your browser doesn’t mean that Google might not be able to distinguish between your searches and those of someone else who may share an IP address with you – Google can and will likely use other information to get a sense of how many people are behind a different IP address that can include which browser you use and the add-ons you’ve installed upon it, the size of the window you use to browse with, your browsing preferences, and more.
I think separating genuine search queries from ranking checking queries is definitely good for the strength of the data in their keyword tools. I think it can also help in many digital agencies who are heavy on internet usage and possibly share the same IP… avoiding the “we’re sorry…but your query looks similar to a automated request” Google error page
Hi Matt,
Just how useful are the search volume numbers we see at Google’s keyword tools? We don’t know how effectively Google might be filtering out both automated and manual rank checking when presenting the numbers that they do. I don’t know how easy it is to avoide that “your query looks similar to an automated request” message. I know that I see it if I do more than ten or so “allintitle” type searches in a row.
I am pretty sure they don’t filter out the automated querying from keyword reports. As for the information they are analyzing, all of it can be spoofed. However, one must hope that relatively few if any such applications with so much sophistication exist.
Almost assuredly, there will be many that spoof this data within 2-6 months from the publication of this patent.
What about if the user uses a proxy.That can be misleading information.
Hi Michael,
It’s likely that you’re right about the lack of filtering of automated queries, and the possibility that many of the signals mentioned in this patent will be spoofed in the future. I’m not sure if the granting of the patent will trigger that. Google’s privacy policy FAQ notes many of the same kind of signals being recorded as well:
Hi Andrew,
The use of a proxy server can definitely be misleading. That is one of the things that they noted in the patent, and one of the reasons why they would use information other than just an IP address, such as cookies or characteristics associated with someone’s browser and user agent.
There are tons of proxy applications available today that offer to handle cookies for the user. Google’s ability to monitor this data will be much more accurate in the aggregate than in specific situations.
The EFF is running a large scale experiment called “Panopticlick” to get reasonable data about the uniqueness of each surfer by means of Fingerprinting. You can run the test yourself and see, what identifies your system at panopticklick.eff.org. The summary for my pretty normal notebook is: “Your browser fingerprint appears to be unique among the 1,100,076 tested so far”. If the EFF knows it, Google and Yahoo knows it as well…
Seems like they would use it more to detect click fraud then to personalize the results.
I don’t see them applying this to their search algo.
Hi Michael,
It does appear that if someone tries to find ways to keep information about themselves from being measured and recorded, that they do have some tools that might be helpful. I wonder what percentage of people who use the Web take advantage of applications such as the ones you point out.
Hi Martin,
Thanks for pointing out the Panopticlick pages. I remember running into the project some time back, and considered adding something about it herea, but couldn’t remember the name, and didn’t recall that it was associated with the EFF.
They point out another interesting site, that lets you test what kind of information might be gleaned from your browser at BrowserSpy
Both sites are pretty eye opening, especially pages like the BrowserSpy CSS Exploit page, which may be able to tell you if you visited a number of sites lately, such as twitter or Facebook or eBay.
Hi Montreal web designer,
We know that Google will sometimes customize or personalize search results based upon previous queries that you use, regardless of whether you are logged into Google or accepting a cookie. We don’t know if it will do that if your browser is set up to not accept cookies, but that might be worth testing.
As I noted above, Google does collect massive amounts of user-based data, and in many instances it can be really helpful for them to look at query sessions rather than just individual queries from searchers, which would require them to be able to identify where each of the queries in a query session were coming from. One question about that is whether they are collecting that information just from people logged into their Google accounts, or including people who are being tracked by cookies, or even including people who may not be logged in and may have cookies disabled, but are able to be indentified by the configurations of their browser and user agent. It’s possible that they could be.
I think this is some how a good news to SEO’s as there is nowhere it mentioned about back-links because we know links that are generic don’t value much. Hence, can we assume if an SEO do link methods to a website from an IP needn’t worry about Google?
Hi Australia SEO,
I’m not quite sure that I understand your question. Are you concerned about how Google might track the IP addresses involving links pointing to a site? I suspect that’s something that Google has been keeping an eye upon for a while, though it’s hard to tell how the search engine might or might not be using that information. It’s possible that if the search engine sees a link from an IP address that it has identified as a blacklisted spam site, that might influence how much weight a link might (or might not) pass along, for instance.
Great article.
When I check my stats at Google Analytics, I’m pretty sure that my own visits are counted also in the global number of visitors. So how is that Google differences the visits by IP? Wouldn’t it be more efficient if they substract my own visits?
Thank you
Hi Miguel,
I think Google set things up purposefully so that if someone using Google Analytics wanted to filter out their IP address, they would have to do that themselves. When you sign up for GA, the IP address you use might not be the one that you access your site with on a regular basis for a wide number of reasons. Rather than automatically filter out your IP, they’ve left that decision up to you.
Here’s a description of how to do that from one of Google’s help forums:
Where is the setting that allows Google Analytics to ignore my own views to my site?
Thanks cool how that get all that information from just an IP address.
P.S if you don’t want them to spy on you, you can use tor network (torproject.org).
Hi Iouri
You can turn off cookies as well.
Imagine large businesses with hundreds of employes behind one single IP addresses. That has to be headache for google. But you are right, they must not only check the IP, they must identify who is behind in order to offer a custom experience.