Value in Being Able to Classify Search Query Traffic From Robots and Humans
Some visitors to search engines are people looking for information. Others may have other purposes for visiting search engines and might not even be humans.
Instead, automated visitors may be attempting to check rankings of pages in search results, conduct keyword research, provide results for games, or even be used to identify sites to spam or to alter click-through rates. Thus, it can be helpful for a search engine to be able to classify search query traffic to understand if that traffic is coming from human searchers.
Non-human visitors can use up search engine resources and skew possible user data information that a search engine might consider using to modify search rankings and search suggestions.
Google has asked its visitors not to use programs like that for many years. On their Google Webmaster Guidelines, they tell us:
Don’t use unauthorized computer programs to submit pages, check rankings, etc. Such programs consume computing resources and violate our Terms of Service.
All of the major commercial search engines have likely developed ways to try to distinguish between human visitors and automated search query traffic or bots.
A recent patent application from Microsoft tells us about some of the ways that it may use to attempt to differentiate between manual and automated searches:
Classifying Search Query Traffic
Invented by Greg Buehrer, Kumar Chellapilla, and Jack W. Stokes
Assigned to Microsoft
US Patent Application 20090265317
Published October 22, 2009
Filed: April 21, 2008
A method for classifying search query traffic can involve receiving a plurality of labeled sample search query traffic and generating a feature set partitioned into human physical limit features and query stream behavioral features. A model can be generated using the plurality of labeled sample search query traffic and the feature set. Search query traffic can be received, and the model can be utilized to classify the received search query traffic as generated by a human or automatically generated.
When a search engine tracks queries used by searchers, it can collect a fair amount of information related to that search query traffic.
That information can include the keywords themselves, as well as meta data about queries, such as:
- Search query strings,
- Search query traffic pages,
- The search query input source,
- A unique identifier identifying the user or device used to enter the search query,
- An IP (internet protocol) address,
- Query time,
- Click time, and/or;
- Other information.
As a search engine receives these queries, they may be analyzed and labeled based upon whether they were generated by a human searcher or by an automated process.
So, how do you consider whether a human or a bot submitted a query?
Physical limitations of Human Searchers
One way to distinguish between queries between men and robots is to keep some physical limitations of humans. The patent filing tells us about a couple:
Volume – Humans can only do so many searches in any one period of time. Someone submitting 100 queries in 10 seconds likely isn’t human. And 200 queries from the same searcher in the period of a day might also seem unlikely. We’re told about one user searching for the word “mynet” 12,061 times during the course of one day.
Location – It’s hard for a person to be in more than one place at the same time. But a search engine might keep track of the IP addresses used by a user ID, and see whether queries are made from that ID from different locations that significant distances might separate. It’s not unusual for someone to use different computers at different locations, such as from home or work or a mobile device. But queries close in time from great distances apart may be good indications of botnets being used or someone using an anonymous browsing tool without having disabled cookies.
While the physical limitations above might help identify automated search query traffic, automated queries can be toned down. Hence, those queries seem to be more like those from a human.
There might be behavioral indications that queries are automated. I’ve included below a number of the patterns that the inventors of the patent filing tell us might be some of the things their system might use to determine whether a search is from an automated program or a human searcher.
Click Through Rates
People tend to click pages that show up in search query traffic results sometimes. We’re told that “typically users click at least once in ten queries.” However, often automated programs don’t click through results, which may be something a search engine will look for.
Some bots collect additional information about some targeted URLs, and so a different set of patterns may show up for those.
We’re told that there are three “typical” bot click through rates:
- A bot that clicks on no links,
- A bot that clicks on every link, and;
- A bot that only clicks on targeted links.
Bots Search in Alphabetical Order in Search Query Traffic (Sometimes)
Is there a pattern to the searches in search query traffic, such as searches for terms in alphabetical order? If so, it’s more likely that the searches are automated.
Bots Search Using Spam Words
We’re told that some words tend to have higher spam scores than others and that user IDs that submit queries that contain large numbers of spam terms are more likely to be submitted by a computer program. The same with queries that tend to focus upon adult content.
Query keyword entropy
Queries that tend to be extremely redundant may be signals of automated searches in search query traffic. For example, a stock quote block searching for terms about the stock market may contain searches that all tend to be around the same length.
Query time periodicity
The amount of time between queries from a specific searcher may be recorded to be used to measure the time between queries, or the time between individual queries and clicks upon results. A pattern found based upon those times may indicate requests from a bot.
Advanced Search Query Traffic Syntax
Many searches over the course of a day that use advanced search operators such as “allintitle:” or “allinurl;” might be seen as coming from automated traffic.
It might be possible to assign categories to specific queries. When there are many searches from a specific user ID that fit into a small number of categories, those searches may be from automated programs.
Reputations and trends
Searches from blocklisted IP addresses or blocked user agents, or from particular country codes might indicate bot activity.
Some bots search for rare queries very frequently, and some bots may have sessions where each query appears nonsensical. Queries where users frequently click on results with very low probabilities of being clicked upon may also indicate automated searches and clicks.
Search Query Traffic Conclusion
I didn’t list all of the methods described in the patent filing about how it might classify search query traffic, and I expect that other patterns may be used to tell the two apart that weren’t included either.
The patent application tells us that it might label queries that it receives, but doesn’t tell us how it might use those classifications.
Like any webmasters, the people from search engines want to be able to understand where their traffic is coming from, and how their site is used. It’s possible that if a search engine believes that queries are being received from an automated program, it might challenge the source of those queries with something like a CAPTCHA for that searcher to fill out, to determine whether the searcher is a person or a program.