How Search Engines May Identify Malicious Web Sites
Unfortunately, there are web pages that can be harmful to visit. Google researchers discussed the identification of malicious code on web pages earlier this year in The Ghost In The Browser: Analysis of Web-based Malware (pdf).
The Google paper’s authors tell us that the focus of delivery of harmful code to computer users has shifted from software that someone installs, to software that is delivered directly to a browser via the Web.
Microsoft has also detailed some of the research that they’ve conducted on web-based malware in their Strider HoneyMonkey pages
Search Engines and Malicious Web Sites
When a search engine delivers you to search results filled with a list of links to web pages, should it warn you about any potentially harmful or malicious code on those pages before you visit one of them?
If a search engine does scan pages for embedded code, what implications might that scanning have for site builders?
If a search engine were to show some kind of indication that there was embedded code, or potentially malicious code, on pages, within the search results listings, would that impact which pages searchers visited?
What kinds of embedded code might a search engine look for, and how might it try to find it?
The Google paper discusses some of the efforts that Google undertakes to try to keep from delivering people to sites that attempt to serve malicious software to visitors:
Web sites that have been identified as malicious, using our verification procedure, are labeled as potentially harmful when returned as a search result. Marking pages with a label allows users to avoid exposure to such sites and results in fewer users being infected. Besides, we keep detailed statistics about detected web pages and keep track of identified malware binaries for later analysis.
Yahoo Patent Filing on Malicious Web Sites
A recent patent application from Yahoo goes into even more depth in exploring the identification of embedded and malicious code on web pages, and the display of warnings to searchers.
Search Early Warning
Invented by Edward F. Seitz
Assigned to Yahoo
United States Patent Application 20070294203
Published December 20, 2007
Filed: June 16, 2006
Abstract
Systems and methods for automatically delivering information to a user concerning the embedded code contained in a web page before the user downloads the web page are disclosed.
A search engine, in addition to performing a standard subject matter word search requested by a user, searches each web page to be listed to the user as part of the search results for information indicating that there is embedded code in the web page.
If it is determined that a web page contains embedded code, the search results graphical user interface is provided with additional information indicating to the user which web page in the results contains embedded code.
The user may also be alerted if a web page contains embedded code known to be malicious and the order of the search results may be modified based on the embedded code information of the web pages in the results.
Identifying Embedded Code on Malicious Web Sites
The patent application provides some interesting details on the process that they might follow to identify malicious code, including the use of a database of code that they’ve come across on the web that can be compared to code found upon newly crawled pages.
The kind of scripts or other types of code that they are looking at can include ActiveX, Flash, Shockwave, Javascript, and style sheets.
I was a little surprised by the inclusion of style sheets on that list, but they point that embedded code might be contained within a style sheet, or be pointed to by a style sheet.
While there is a good amount of embedded code on web pages, the focus of the patent document is in finding malicious code that might cause some type of harm. The kinds of malicious code being referred to in this patent filing include the installation of dialers, spyware, or Trojan horses.
This process might begin with a web crawling program identifying elements, such as an applet element, or an object element, or many others. This process of identifying malicious code would also include virtually rendering a web page in addition to simply scanning the web page for embedded code identifiers.
User Interface Icons
This system might use special icons or other indicators, to show information to searchers on search results pages, about what kind of code might be embedded upon a page. For instance, a page with an ActiveX control may use one icon, while a second icon might indicate that Shockwave item is embedded in the page.
Under this system, there may be a way for searchers to decide whether they want to even see such icons. Another option might be for the search engine not to show search results that have certain types of icons and embedded code.
So, if a searcher didn’t want to see pages that used Flash, or javascript, or ActiveX components, they could purposefully filter those pages out of search results.
A searcher might also be able to choose to not be shown pages that contain “unsafe content.”
Conclusion
The use of cascading style sheets, javascript, flash, and other code isn’t that uncommon these days. What percentage of pages serve malicious software? The Google paper above tells us
At the time of this writing, we have conducted an in-depth analysis of about 4.5 million URLs and found 450, 000 URLs that were engaging in drive-by-downloads. Another 700, 000 seemed malicious but had lower confidence. That means that about 10% of the URLs we analyzed were malicious and provides verification that our MapReduce created good candidate URLs.
But that was after a process that took several billion pages and pruned them into millions. Many news reports and articles that described the Google paper were taking the numbers from the above quote, and stating that 1 in 10 sites URLs were delivering malicious code to visitors, when in fact the percentage is much smaller than that.
Regardless, a search engine delivering searchers to pages that install malware isn’t a good user experience.
Google added Badware Alerts to the Webmaster Central tools at the end of November last year, to let site owners know when the search engine was serving an interstitial warning about malware to visitors. That post refers to the site StopBadWare, as a place that Google relies upon to identify sites that may be infected.
Earlier this year, when Google purchased the company Green Border, it appeared that they might have done so to address the downloading of malicious software.
Worth a read on this topic is a Google Groups thread that discusses the appearance of interstitial malware warnings, that appear after someone clicks on a link in search results, and before the page is delivered to the searcher – Why did our site go through this headache?
Did you see that item about botnets filling the SERPs with malware hosted on .cn domain names? They can rank stuff at will, essentially. Scary as all hell… On a related note, I recently got hit by crap like that (which was ranking, and I clicked) and it cost me $260 in antivirus services… :(!
Pretty sure the Yahoo patent isn’t worth the paper it’s written on because of prior art such as McAfee’s SiteAdvisor which was founded a year before that patent was filed in April 2005 and does basically the same thing for the visitor.
Additionally, I’ve been running a directory (with search) for 10 years and my link scanner, similar to a crawler, actually detects some types of infected web pages and has for many years with the only difference to Yahoo’s or Google’s approach is that I delist the sites so they simply don’t exist in my listings until they are fixed.
Glad to see them cracking down but that patent is junk.
FYI, I can show you infected sites of Russian origin currently indexed in Google with no warning whatsoever and they’ve been indexed in Google for quite some time. It’s the garden variety scraper that redirects to pr0n sites with malicious trojans pretending to be “media viewers” to lock up your browser unless you click “YES” to accept the download. Quite nasty.
Hi Gab,
Yep. There seemed to be at least a couple of waves of that happening from sites hosted on .cn domains, too.
Sorry to hear about your experience. I know that it’s possible to be very diligent with anti-virus software and updates, and still have problems. I’m happy that search engines are trying to make an effort in this direction.
I probably should have mentioned the Google paper, The Anatomy of Clickbot.A. It’s easy to focus upon the false clicks aspect of that, and ignore that the clicks were coming from compromised computers, that had Browser Helper Objects (BHO) installed upon them.
It shows another reason for a search engine having an interest in avoiding pages that might download malicious software – the kind that leads to widely distributed click fraud.
Hi Bill,
Thanks for stopping by and commenting. Always appreciate your posts on bots, and security on the web.
The Site Advisor plugins for IE and Firefox are definitely worth looking into for anyone concerned about safer browsing.
There’s somewhat of a chance that the Yahoo patent application may move on to become a granted patent, not so much because of how it identifies malicious code, but rather because of the way that it integrates detection into the crawling, indexing, and display processes that a search engine follows.
Regardless of whether it does or not, what I liked about the patent application was that it provided me with the opportunity to write about some of the issues surrounding how search engines may deal with malicious code, and with displaying (or not displaying) pages to searchers.
Should search engines feel responsible when they deliver searchers to pages that contain malicious code? If they do, will searchers feel safer if they know that a search engine is checking pages out before delivering them to those pages?
With the Google Clickbot A paper, we see search engines should also be concerned about malicious code that enables click fraud, so there’s a very real concern that impacts upon them directly when it comes to code that might click upon pages.
Interesting regarding the Russian sites. You would think that kind of redirect and installation of trojans should be detectable regardless of the language used.
Actually the Russian pr0n pages are in English claiming to be Pr0nTube but it doesn’t matter as it’s all visual so a visitor would click what they wanted in any language. It’s how the virus is hidden with redirects which is why I’m sure the search engines don’t pick it up so easily.
Hopefully Yahoo won’t get this patent because I’m pretty sure this one will cause some ruckus in court.
One in ten sites ( or pages? ) being malicious is hard to believe. It sounds a bit like the Texas Sharpshooter, who would fire a rifle into the side of a barn, then draw a bulls-eye around his bullet holes. That the sample may have been deliberately chosen to provide a dramatic result, in other words.
Is it safe to assume there’s a difference between ’embedded’ code, and I guess you might call it regular code? Like you point out, javascript and plenty of other formats are very common these days.
Hi Forrest,
Good questions.
I should have written “URLs” there instead of “sites.” I imagine that some of the sites that contain malicious code may be dynamic sites, and probably have quite a few URLs attached to them.
The Google study described wasn’t intended to report on how wide-spread malicious code was on the Web, but rather to see how effective that first filtering step was (the Mapreduce part) – unfortunately, those results were taken by the media, and reported as if ten percent of the URLs on the Web did contain harmful software.
For instance, the BBC reported:
If a page has regular code on it, I think under the Yahoo approach in the patent application, it’s still “embedded code.” It’s not necessarily harmful, or “malicious” code, but the process flags those pages that have some kind of code upon them, and may then follow up with another visit that might check to see if the code on those pages is malicious.
That may mean that they are checking a lot of websites for malicious code, especially if they are considering javascript and style sheets as being potentially harmful.
So why does Google ban directories that harm no-one except Google’s own purse (effect: less adWords) and they leave those malicious web sites alone and active and turning up in the search results?
Ridiculous if you ask me. This is clearly a case of NOT their interest. If they can identify malicious software why a warning and not simply ban those sites? Only answer could be that they are not so sure after all how to identify the software and second oh it sounds so holy protecting the users and at least show or announce an attempt in this direction of user protection.
Cheers,
Doris
I was checking mt web logs and noticed traffic (45) from an unknown site. I often will click on these to see who has sent me traffic. 45 visits was way ahead of any other site. When I got there the site tried to download a video… Thank god I closed it out before this happened… Has anyone else noticed this?
“When a search engine delivers you to search results filled with a list of links to web pages, should it warn you about any potentially harmful or malicious code on those pages before you visit one of them?”
Definitely Yes. We are not aware of a malware that might be in a site before we open the site. A warning from the search engines would be very useful. I see Google labeling some sites as harmful. It would be great if it actually changes color of the link/search result so that it can be seen very easily.
Hi rcplinks,
Some kind of warning would be nice. I know that the McAfee SiteAdvisor plugin puts buttons next to results providing some indication of what they feel about a page in search results.
I wonder how people would respond if they had an option in the Google tool bar to see warnings like that in their search results.
Some very good points made but, how many sites carry a warning and don’t contain any potentially harmful or malicious code. What can be done to have an incorrect warning removed, this has to be done correctly as having a site labeled incorrectly would be a nightmare for the owner of the site.
Hi Opseo,
Unfortunately, false positives do happen when search engines carry warnings about sites and potential malicious code that may exist upon them. Such a warning can be harmful to a site that doesn’t contain code or not engage in activities that can harm others.
What recourse do the search engines provide in that situation?
Google explains how a site might request a human review of their site on this page:
My site’s been hacked
a snippet from that page:
Instructions for requesting a review are on that page.
Microsoft Live provides information about their malicious code warnings for pages, and instructions on how to stop warnings in this blog post:
Live Search Webmaster Center Fall Update
Yahoo has partnered with McAfee in their approach to identifying malicious code on Web pages. The instructions for having a review of a page that has been determined to be a security risk can be found here:
My Site is marked with a warning, why? What can I do about it?