Value in Being Able to Classify Search Query Traffic From Robots and Humans
Some visitors to search engines are people looking for information. Others may have other purposes for visiting search engines and might not even be humans.
Instead, automated visitors may be attempting to check rankings of pages in search results, conduct keyword research, provide results for games, or even be used to identify sites to spam or to alter click-through rates. Thus, it can be helpful for a search engine to be able to classify search query traffic to understand if that traffic is coming from human searchers.
Non-human visitors can use up search engine resources and skew possible user data information that a search engine might consider using to modify search rankings and search suggestions.
Google has asked its visitors not to use programs like that for many years. On their Google Webmaster Guidelines, they tell us:
Don’t use unauthorized computer programs to submit pages, check rankings, etc. Such programs consume computing resources and violate our Terms of Service.
All of the major commercial search engines have likely developed ways to try to distinguish between human visitors and automated search query traffic or bots.
A recent patent application from Microsoft tells us about some of the ways that it may use to attempt to differentiate between manual and automated searches:
Classifying Search Query Traffic
Invented by Greg Buehrer, Kumar Chellapilla, and Jack W. Stokes
Assigned to Microsoft
US Patent Application 20090265317
Published October 22, 2009
Filed: April 21, 2008
Abstract
A method for classifying search query traffic can involve receiving a plurality of labeled sample search query traffic and generating a feature set partitioned into human physical limit features and query stream behavioral features. A model can be generated using the plurality of labeled sample search query traffic and the feature set. Search query traffic can be received, and the model can be utilized to classify the received search query traffic as generated by a human or automatically generated.
When a search engine tracks queries used by searchers, it can collect a fair amount of information related to that search query traffic.
That information can include the keywords themselves, as well as meta data about queries, such as:
- Search query strings,
- Search query traffic pages,
- The search query input source,
- A unique identifier identifying the user or device used to enter the search query,
- An IP (internet protocol) address,
- Query time,
- Click time, and/or;
- Other information.
As a search engine receives these queries, they may be analyzed and labeled based upon whether they were generated by a human searcher or by an automated process.
So, how do you consider whether a human or a bot submitted a query?
Physical limitations of Human Searchers
One way to distinguish between queries between men and robots is to keep some physical limitations of humans. The patent filing tells us about a couple:
Volume – Humans can only do so many searches in any one period of time. Someone submitting 100 queries in 10 seconds likely isn’t human. And 200 queries from the same searcher in the period of a day might also seem unlikely. We’re told about one user searching for the word “mynet” 12,061 times during the course of one day.
Location – It’s hard for a person to be in more than one place at the same time. But a search engine might keep track of the IP addresses used by a user ID, and see whether queries are made from that ID from different locations that significant distances might separate. It’s not unusual for someone to use different computers at different locations, such as from home or work or a mobile device. But queries close in time from great distances apart may be good indications of botnets being used or someone using an anonymous browsing tool without having disabled cookies.
Behavioral Features
While the physical limitations above might help identify automated search query traffic, automated queries can be toned down. Hence, those queries seem to be more like those from a human.
There might be behavioral indications that queries are automated. I’ve included below a number of the patterns that the inventors of the patent filing tell us might be some of the things their system might use to determine whether a search is from an automated program or a human searcher.
Click Through Rates
People tend to click pages that show up in search query traffic results sometimes. We’re told that “typically users click at least once in ten queries.” However, often automated programs don’t click through results, which may be something a search engine will look for.
Some bots collect additional information about some targeted URLs, and so a different set of patterns may show up for those.
We’re told that there are three “typical” bot click through rates:
- A bot that clicks on no links,
- A bot that clicks on every link, and;
- A bot that only clicks on targeted links.
Bots Search in Alphabetical Order in Search Query Traffic (Sometimes)
Is there a pattern to the searches in search query traffic, such as searches for terms in alphabetical order? If so, it’s more likely that the searches are automated.
Bots Search Using Spam Words
We’re told that some words tend to have higher spam scores than others and that user IDs that submit queries that contain large numbers of spam terms are more likely to be submitted by a computer program. The same with queries that tend to focus upon adult content.
Query keyword entropy
Queries that tend to be extremely redundant may be signals of automated searches in search query traffic. For example, a stock quote block searching for terms about the stock market may contain searches that all tend to be around the same length.
Query time periodicity
The amount of time between queries from a specific searcher may be recorded to be used to measure the time between queries, or the time between individual queries and clicks upon results. A pattern found based upon those times may indicate requests from a bot.
Advanced Search Query Traffic Syntax
Many searches over the course of a day that use advanced search operators such as “allintitle:” or “allinurl;” might be seen as coming from automated traffic.
Category Entropy
It might be possible to assign categories to specific queries. When there are many searches from a specific user ID that fit into a small number of categories, those searches may be from automated programs.
Reputations and trends
Searches from blocklisted IP addresses or blocked user agents, or from particular country codes might indicate bot activity.
Some bots search for rare queries very frequently, and some bots may have sessions where each query appears nonsensical. Queries where users frequently click on results with very low probabilities of being clicked upon may also indicate automated searches and clicks.
Search Query Traffic Conclusion
I didn’t list all of the methods described in the patent filing about how it might classify search query traffic, and I expect that other patterns may be used to tell the two apart that weren’t included either.
The patent application tells us that it might label queries that it receives, but doesn’t tell us how it might use those classifications.
Like any webmasters, the people from search engines want to be able to understand where their traffic is coming from, and how their site is used. It’s possible that if a search engine believes that queries are being received from an automated program, it might challenge the source of those queries with something like a CAPTCHA for that searcher to fill out, to determine whether the searcher is a person or a program.
Good post, Bill! 🙂
However, the fact is that webmasters, SEOs and researchers have always had, and will always have, a legitimate need to research search data. As long as the engines do not provide accurate and reliable (paid or not) APIs “we” will need to scrape. No TOS or smart systems will ever fully stop that.
The solution for the engines is to provide good APIs. Most pros will not mind paying for it. After all a structured and reliable access to the needed data via APIs saves a lot of development time compared to scraping.
Hi Mikkel,
Thank you. It was pretty tempting to discuss application programming interfaces, but I tried to restrain myself, and focus upon how search engines spend time analyzing the queries that they receive, and the kinds of pattern matching that they might do to understand where those queries might be coming from.
APIs are pretty big topic by themselves, worth a number of posts, but I think you’re right that I should have at least given them a mention.
Who offers APIs? What are the expressed limitations on their use? Are there limitations on existing APIs that should be changed to make them more useful? What examples might there be of creative uses of those data interfaces? What makes an API a good one? (This document from a Google researcher starts that discussion, at least in the design of an API: How to Design a Good API and Why it Matters) Do you miss out on some important information when using something like a web search API that you would see if you did a manual search, such as rerankings because of domain collapsing, or because of searches being customized by location or a past history of searches on related topics or because of the inclusion of blended search results or other features/rerankings?
@mikkel totally agree. There’s value to webmasters beyond spamming SEs. I think of it like social networks. If Twitter didn’t have an API people would have to scrape for mentions of their brand. It’s not an attempt to spam, just understand how their being represented within that channel, beyond the traffic they get.
I ultimately think SEs aren’t going to risk serving an actual user false results just to block automated queries (beyond the CAPTCHA). They’re at a point right now which is good enough, if you will, and any further efforts to thwart bots would probably provide a pretty marginal return on those efforts.
What’s interesting though, is that I’m not even sure using Google’s AJAX search API for rank monitoring, for example, is even in accordance with their TOS.
Via http://code.google.com/apis/websearch/terms.html
“By way of example, and not as a limitation, You agree that when using the Service, You will not, and will not permit users or other third parties to:
# use any robot, spider, site search/retrieval application, or other device to retrieve or index any portion of Google Search Results or to collect information about users for any unauthorized purpose;”
Would definitely love to hear your thoughts on different SE API uses Bill.
In my opinion automated ranking check tool is less important in SEO work. Webanalytics does the job. Keyword strength, conversion rate, and search behavior better appear in webanalytics than in ranking tool. Webanalytics becomes our daily tool.
Hi Chase,
I agree with you – there can be a lot of value to APIs. One of the APIs I like very much is the Google Maps API, which has led to the creation of lots of interesting mashups.
I think writing a post on different SE API uses would probably be pretty interesting – it’s something that I would rather do as a post on its own than in the comments to this post. Thanks for the idea.
Hi Renaud,
I’ve felt that way about automated rank checking for a number of years. There’s a lot of actionable information to be found in log analysis and web analytics that just doesn’t come through when spending time checking the rankings of pages for queries. And that rank checking misses some of the things that search engines are doing in search results, such as customizing some results based upon the geographic location of a searcher.
Automation is part of our daily life in this day and age of information technology. Automating certain processes such as delegation intensive search to bots sometimes make business sense. It all depend on what the ultimate goal or use of that data is.
Very interesting post – How much do you think that Google allows user tracking to influence their search results?
Hi Joel,
I’m not sure that I could tell you how much, but I can tell you that Google has published a lot of patent filings and whitepapers that discuss how they might, and there are signs that some of those processes are in place, such as customizing search results for searchers who aren’t even logged into personalized search, based upon search patterns from other searchers.
Hi Buzz,
I agree with you – automating business processes wisely does make sense, but it needs to be done in a way that doesn’t potentially harm participants that own and maintain the data involved in those business processes. If a search engine makes information available under a terms of service that prohibits certain activities, then attempting to access that information in a way that violates those terms of service may be a questionable approach. It can be a smart move for a search engine to provide APIs that help their users automate business processes, but even those have terms of service attached to them.
I don’t know if anyone else here has experienced this, but I work for a relatively small ecommerce site and we do all our keyword research in-house. We pay particular attention to “allintitle” keyword competition, but have never found a tool or software that can reliably automate that process for us. Since we’re in a niche market I do EVERY allintitle query by hand and yet I routinely get blocked by Google after running just a few.
Apparently I can copy-and-paste and type faster than a human.
At first Google started showing me an error page with a CAPTCHA test, but after a while even that option was gone and I would get blacklisted completely. Sometimes for hours or even days.
I have checked Google’s TOS and even tried to contact them repeatedly to find if there’s a limit to human-entered queries, because I do run a lot of them. I have been told there isn’t, but Google still thinks I’m automated software.
Now when I search I try to avoid patterns. I try to avoid getting into a “rhythm”, search at different times of the day, and will search for terms out of alphabetical order. It seems to help as the blocking has grown less frequent, but it still happens.
Oh, and just as an ironic confessional aside, we actually have automated software that scrapes for our SE rankings and NOT ONCE has it ever been blocked by any of the engines. I started using it AFTER I was already regularly getting blocked for the allintitle stuff, so I know it’s not the source of our problems.
Last, and sorry this is so long, I am kind of fed up with how Google seems to regard a natural desire to understand one’s traffic and competition as borderline evil. My company is not trying to spam the engines or “game” the system; we just want to know the landscape around our business. It’s like wanting to start a brick-and-mortar store and not being allowed to see a map of the town you want to build in, or being denied access to a population report.
The funny thing is, my company is really committed to properly optimizing our site for the engines AND our customers with great content. On top of that we use PPC so we’re also paying customers of Google. So basically I feel like we’re getting slapped in the face for doing precisely what Google wants us to do. Total Catch-22.
OK, getting off soap box now.
Great post, thank Bill 🙂
Even though I am one of the webmasters with a great need to research search data automatically, I respect that search engines have a bunch of terms of service. Besides there is a way to monitor rankings using Google Analytics advanced filter segmentation.
Just giving an API like Yahoo Boss solves a lot of issues. Without that, analytic hungry people like us will always keep trying to figure out better ways to see our and competitions ranking.
Hi Polly,
There is a lot of information worth looking at in analytics that can produce some interesting ideas for topics to write about, content to change, and so on. Also, some of the search data that we see just by looking at search results (whether automated or manually) may be misleading, especially with personalized results, and results that might be customized based upon things like location and search patterns during query sessions.
Hi Chris,
I’ve seen CAPTCHA tests after doing a number of manual queries for thins like “allintitle:” searches, but haven’t been blocked.
I also understand the desire to try to understand why some results are shown where they are, and when and why others might change, when you are faced with a system like a search engine that can have a direct impact upon your business. As important as those search results might be, it’s never a bad idea to try to find other ways to bring visitors to your site that don’t necessarily rely upon search engines. Funny thing is, some of those approaches may actually help you increase your search traffic as well.
Hi RG,
There is value to having APIs, not only for people interested in the data, but also for the search engines themselves. Yahoo Boss is definitely a good example of how an API can spur innovation, and provide tools that make search better.
The patent filing I wrote about above really doesn’t delve into APIs, though comments in this thread have definitely turned that way. The future of search (and many other applications on the Web) may end up seeing some of its greatest growth through programming interfaces like this.
I think I probably am one of the few people who oppose the view that bots/automated search queries and especially ranking monitoring are a bad idea.
If I am working on a selection of keywords I want to know where I rank for them. I want to know the progress I am making and also want to be able to show anyone who I am working for this information.
All good SEO progress reports should include this information and while clients can get hung up on it and this might be a bad thing if too much focus is given it should be done.
I admit that it should not be the only focus point and all the other metrics like time on site/conversion etc are vital but I cannot get my head around why people often say that they are a bad metric to measure by/focus on they are the starting point of all other metrics.
Your conversion rate could be 100% but if you are drawing minimal traffic because you are not listed then you are not really getting where you want to be.
Hi Jimmy,
Search results are increasingly going to be different from viewer to viewer as the search engines introduce more and more personalization into those results.
Ranking reports really face an impossible task in providing realistic results – showing rankings when those change from one viewer to another. Analytics are going to be the tools to use to show clients the impacts of SEO progress.
I’ve just stumbled across this post and it’s something we’ve been thinking a lot about recently, particularly with regard to AdWords and the search query volumes we see. The question we have is what proportion of queries Google reports come from automated rank-checkers? When we see 10k impressions in a month are we actually looking at 5,000 real searches? The question that arises from this is how much our ad performance is impacted by this – if there are many automated queries going on, our ads will surely be penalised through no fault of ours. At a simple level it would seem relatively easy for Google to view click data and filter out the queries that never ever lead to clicks, which might be a reasonable indicator of a bot or script.
Hi Nick,
Without a doubt, that’s a question worth asking.
Google does have some approaches towards trying to understand when search queries are automated. Often automated programs checking rankings follow patterns that can be identified based upon a very wide mix of signals like the ones that I described in my post.
Mikkel’s suggestion, in the first comment to this post, which suggests the use of APIs for researchers to use would be really helpful as well.
People often but don’t always perform more than one search, especially when their first search doesn’t have any results that they want to click upon. The search engines don’t just look at individual search queries, but often look at whole search sessions and use information in those sessions in a number of ways. While analyzing those, they may also attempt to identify automated sessions. Not all queries that have no clicks are from bots, but if you have a substantial string of searches in a search query and none of them lead to clicks, that might be a better indication of automated searches.
It’s funny though about the automated searches thing, even if you’re a human doing regular searches, you can confuse Google and it asks you for a security code to verify your’re not a bot. I guess it just comes down to the speed in which you’re searching.
Hi San Diego SEO,
Good point. I’ve received a captcha message when manually doing a number of consecutive searches using “site” searches even when I done those somewhat slowly. Speed may be one thing that they look at, but it’s not necessarily the only thing.