Value in Being Able to Classify Search Query Traffic From Robots and Humans
Some of the visitors to search engines are people looking for information. Other visitors may have other purposes for visiting search engines, and might not even be humans.
Instead, those automated visitors may be attempting to check rankings of pages in search results, or conducting keyword research, or providing results for games, or even be used to identify sites to spam, or to alter click-through rates. It can be helpful for a search engine to be able to classify search query traffic, to understand if that traffic is coming from human searchers.
These non-human visitors can use up search engines resources, as well as skew possible user data information that a search engine might consider using to modify search rankings and search suggestions.
Google has asked its visitors not to use programs like that for a number of years. On their Google Webmaster Guidelines, they tell us:
Don’t use unauthorized computer programs to submit pages, check rankings, etc. Such programs consume computing resources and violate our Terms of Service.
It’s likely that all of the major commercial search engines have developed ways to try to distinguish between human visitors and automated visitors or bots.
A recent patent application from Microsoft tells us about some of the ways that it may use to attempt to differentiate between manual and automated searches:
Classifying Search Query Traffic
Invented by Greg Buehrer, Kumar Chellapilla, and Jack W. Stokes
Assigned to Microsoft
US Patent Application 20090265317
Published October 22, 2009
Filed: April 21, 2008
A method for classifying search query traffic can involve receiving a plurality of labeled sample search query traffic and generating a feature set partitioned into human physical limit features and query stream behavioral features. A model can be generated using the plurality of labeled sample search query traffic and the feature set. Search query traffic can be received and the model can be utilized to classify the received search query traffic as generated by a human or automatically generated.
When a search engine tracks queries used by searchers, it can collect a fair amount of information related to those searches.
That information can include the keywords themselves, as well as meta data about queries, such as:
- Search query strings,
- Search query results pages,
- The search query input source,
- A unique identifier identifying the user or device used to enter the search query,
- An IP (internet protocol) address,
- Query time,
- Click time, and/or;
- Other information.
As these queries are received by a search engine, they may be analyzed and labeled based upon whether they were generated by a human searcher or by an automated process.
So, how do you consider whether a query was submitted by a human or a bot?
Physical limitations of Human Searchers
One way to distinguish between queries between men and robots is to keep some physical limitations of humans. The patent filing tells us about a couple:
Volume – Humans can only do so many searches in any one period of time. Someone submitting 100 queries in 10 seconds likely isn’t human. And 200 queries from the same searcher in the period of a day might also seem unlikely. We’re told about one user searching for the word “mynet” 12,061 times during the course of one day.
Location – It’s hard for a person to be in more than one place at the same time. But a search engine might keep track of the IP addresses used by a user ID, and see whether queries are made from that ID from different locations that might be separated by significant distances. It’s not unusual for someone to use different computers at different locations, such as from home or work or a mobile device. But queries close in time from great distances apart may be good indications of botnets being used, or someone using an anonymous browsing tool without having disabled cookies.
While the physical limitations above might help identify automated queries, it’s possible for automated queries to be toned down so those queries seem to be more like those from a human.
There might be behavioral indications that queries are automated. I’ve included below a number of the patterns that the inventors of the patent filing tell us might be some of the things their system might use to determine whether a search is from an automated program or a human searcher.
Click Through Rates
People tend to click pages that show up in search results sometimes. We’re told that “typically users click at least once in ten queries.” Often automated programs don’t click through results, so that maybe something a search engine will look for.
Some bots collect additional information about some targeted URLs, and so a different set of patterns may show up for those.
We’re told that there are three “typical” bot click through rates:
- A bot that clicks on no links,
- A bot that clicks on every link, and;
- A bot that only clicks on targeted links.
Bots Search in Alphabetical Order (Sometimes)
Is there a pattern to the searches, such as searches for terms in alphabetical order? If so, it’s more likely that the searches are automated.
Bots Search Using Spam Words
We’re told that some words tend to have higher spam scores than others and that user IDs that submit queries which contain large numbers of spam terms are more likely to be submitted by a computer program. The same with queries that tend to focus upon adult content.
Query keyword entropy
Queries that tend to be extremely redundant may be signals of automated searches. For example, a stock quote block searching for terms about the stock market may contain searches that all tend to be around the same length.
Query time periodicity
The amount of time between queries from a specific searcher may be recorded to be used to measure the time between queries, or the time between individual queries and clicks upon results. A pattern found based upon those times may indicate requests from a bot.
Advanced Query Syntax
Many searches over the course of a day that use advanced search operators such as “allintitle:” or “allinurl;” might be seen as coming from automated traffic.
It might be possible to assign categories to specific queries. When there are many searches from a specific user ID that fit into a small number of categories, it’s possible that those searches are from automated programs.
Reputations and trends
Searches from blacklisted IP addresses or blacklisted user agents, or from particular country codes might be indications of bot activity.
Some bots search for rare queries very frequently, and some bots may have sessions where each query appears nonsensical. Queries where users frequently click on results that have very low probabilities of being clicked upon may also be indications of automated searches and clicks.
I didn’t list all of the methods described in the patent filing about how it might classify search query traffic, and I expect that there are other patterns that may be used to tell the two apart that weren’t included either.
The patent application tells us that it might label queries that it receives, but doesn’t tell us how it might use those classifications.
Like any webmasters, the people from search engines want to be able to understand where their traffic is coming from, and how their site is used. It’s possible that if a search engine believes that queries are being received from an automated program, it might challenge the source of those queries with something like a CAPTCHA for that searcher to fill out, to determine whether the searcher is a person or a program.