How safe are search engines? Should they be warning about malicious sites? A recent answer might surprise you.
Back in May, Ben Edelman wrote about Search Engine Safety. In part, he was writing about how search engine paid advertising for some products, like screensavers, may lead to sites that would put spyware on the computers of visitors who download the screensavers. He wrote more on that practice in a January post titled Pushing Spyware through Search
He was also announcing a study that he had worked on with McAfee, about The Safety of Internet Search Engines. If you missed this report in May, it’s worth a visit. It discusses the safety of organic results through search engines, as well as paid results.
What’s a search engine to do?
Viruses aren’t anything new, and most internet literate folks now recognize some of the potential ways that viruses may spread, and know the risk of opening unsolicited emails, opening attachments from unknown sources, and visiting the websites listed in those emails.
But most people don’t think twice about following a link from a search engine to a listed result. As the study above on search engine safety describes, maybe they should be concerned about visits to malicious sites.
Should search engines explore ways to avoid leading people to sites filled with harmful scripts and viruses? Do they have a responsibility in keeping people away from such sites? Does it help them maintain a level of trust with their users in doing so?
An Approach from Microsoft
A new Microsoft patent application about malicious sites explores search engine safety in the context of organic results.
System and method for utilizing a search engine to prevent contamination
Invented by Art Shelest and Eytan D. Seidman
Assigned to Microsoft
US Patent Application 20060136374
Published June 22, 2006
Filed: December 17, 2004
Abstract
A system and method are incorporated within a search engine for preventing proliferation of malicious searchable content. The system includes a detection mechanism for detecting malicious searchable content within searchable content traversed by a web crawler. The system additionally includes a presentation mechanism for handling the detected malicious searchable content upon determination that the malicious searchable content is included in search results provided by the search engine. The presentation mechanism handles the detected malicious searchable content in order to prevent proliferation of the malicious searchable content to a receiver of the search results.
How would a search engine go about trying to identify malicious sites, and keeping searchers from being harmed by those webpages? There are a couple of different potential approaches, that could even be combined.
One way would be to look for malicious sites during a crawl of websites. A detection mechanism would look for potentially harmful pages, and identify them. Or it could notice harmful pages in real-time during the return of a search query. At the time of the presentation of results, the search engine might show a link in many different ways to try to keep a searcher from being harmed. Some of these approaches might even keep a search from returning a link to a page.
A presentation module, as part of the process described in the patent, might:
- Use a special code to tell the web browser to protect itself – for example, by pre-pending of an exclamation point to the malicious site link. So “www.example.com” would be served as “!http://www.example.com” and the indicator could let the browser know to perform actions, such as disabling selected macros.
- Modify the dangerous link to point to a proxy to shield the searcher’s computer from malicious activity.
- Modify the link to point to a disinfected cached copy of the web pages, stored by or on behalf of the search engine, which would reference the disinfected cached copy saved at the time of crawling.
- Present a modified link that points to a dynamically disinfected non-cached copy of the malicious site, where disinfecting occurs when the user selects the modified link.
- Create a warning to be shown to the user, which would indicate that content on the link may be malicious, if accessed.
- Hide the dangerous link or not show the link to the searcher.
An Active Detection Mechanism
The detection mechanism may include static analysis tools and also perform dynamic analysis. The static analysis tools would inspect visited pages and sites for known code patterns, and look for things like unnecessarily long HTML fields.
The dynamic analysis tools operate slightly differently and would look for known malicious behavior and traffic patterns. So, the dynamic analysis tools might see a malicious site initiating a connection back to a client computer on a port often associated with vulnerabilities. Or the visited malicious site might attempt to hack back into the search engine.
According to the patent application:
The search engine should be well-defended and should be configured to appear as a regular user computer* to the visited web sites.
* My Emphasis.
A Presense Detection Method (Virtual Machine Approach)
Instead of using a normal crawling approach, another method might be to use a disposable machine, such as a virtual machine and a disposable or virtual machine inspection mechanism. It would behave similarly to a Virtual PC program that allows windows to run inside of windows.
Sort of a sandbox, it would operate in a manner independent of the outside windows or primary machine, so that whatever would happen to the inside window doesn’t have a detrimental impact upon the primary machine. Thus the virtual machine includes a crawler that visits each web site.
After each visit, a virtual machine inspection mechanism would check the inside crawler within the virtual machine for infection or detrimental effects. The virtual machine inspection mechanism looks for the result of each visit to determine if files or behaviors of the virtual machine have changed, instead of looking for behavior on the visited websites
If the virtual machine is infected or compromised, then the visited web page or website is assumed to malicious.
It would be possible for the disposable machine used to include a physical personal computer, but using a virtual machine has the advantage of recovering rapidly from an infected state.
Active and Presense Approach Combined
The active approach prevents the behavior from occurring and the present approach allows the behavior to occur on the virtual machine and thereafter ascertains whether the visited website was malicious. These methods could be combined such that some visited sites that appear to be affected could be cached and analyzed after the crawling process.
Scope of Coverage
The web crawler might be set up to detect malicious behavior on a page by page or site by site basis, or it could look at individual web objects (e.g. embedded picture files), domain names, IP addresses, or other grouping methods of units of crawling.
For example, it used to be common that under a single domain name, many shared websites used a tilde to separate portions of a site into areas owned by individual users. So, http://www.example.com/users/~barney/demos/hack.htm is assumed to belong to user Barney, while http://www.example.com/users/~adam/index.htm is assumed to belong to user Adam. If something on Barney’s pages were determined to be malicious, that shouldn’t affect the area operated by Adam.
Detecting in Real Time
While the detection and presentation mechanisms can look for malicious activity during crawling and indexing, malicious activity can also be detected in real-time. That’s probably not a bad idea considering that web pages can change since the time they’ve been indexed.
A real-time approach would have the presentation mechanism presenting links redirecting the user to a proxy that will dynamically detect and disinfect malicious web content. This pre-indexing and real-time indexing could be combined.
Conclusion
Looking for malicious sites on pages indexed sounds like it could be a resources intensive process. Yet the McAfee study makes it sound like one that might be worth pursuing.
Is it the search engines’ responsibility to protect us from potentially harmful and malicious sites, or the makers of browsers, or anti-virus software manufacturers?
Is there a potential risk to site owners if search engines adopt methods like this one? How likely are false positives?
What exactly is meant by a “malicious site?” The patent filing doesn’t fully define its use of the word malicious. Does this mean viruses, or might it also include spyware?
Will paid search results also be included in a process like this? The patent application is silent on that subject.
If a search engine takes on the responsibility of filtering or limiting access to malicious sites, do they also then assume responsibility for notifying the owners of sites that they have identified as hosting malicious content?
On this subject, there’s an interesting service called Scandoo (http://www.scandoo.com) which is currently using the “Detecting in Real Time” method.
I blogged about this service a while ago –
http://www.interdigitalstrategies.com/blog/search-general/scandoo-making-your-searches-safe/
Interesting stuff!
Google is already taking a step in to this area with their Safe Browsing option in the Google Toolbar. Limited in scope, but I doubt they’ll stop working on it.
Also, McAfee acquired SiteAdvisor which is a service that rates web sites based on various safety measures. Every site gets assigned a color code e.g. green, yellow, red. Once you install the browser plug-in they provide, you can see these color codes in real-time in search engine results at Google, Yahoo, and MSN. You can also see the color code for the current site you’re on.
Thanks, Joe and Marios,
Scandoo is a great example, Joe. It’s pretty unintrusive, too. It goes a little further than what’s described here, in that it’s providing information about the kind of content on the pages to be visited in addition to information about potentially malicious software on the other side.
The safebrowsing option in the Google toolbar also seems like a good approach. I was reminded a little of the Netcraft anti-phishing toolbar in their approach, in that both are targeted at phishing as opposed to the possible presense of spyware or viruses.
The siteadvisor plugin allows for user comments and reviews, which leads to an interesting mix at sites like eBay.
It might not hurt, if you own a site, to check on how these programs view your site. Siteadvisor lets people take on online look at what its results are for pages it has tested.
Hi Bill,
Last may I posted on this matter and coined a new term : Saferank (after PageRank and Trustrank, why not ?), since I think that Safe search really could become soon the next big thing ( http://adscriptum.blogspot.com/2006/05/after-google-pagerank-and-google.html )
Jean-Marie
Hi Jean-Marie,
Nice article. I think that the siteadvisor report that Ben Edelman worked upon was an eye-opening one for many people, and the launch of Scandoo was a great idea.
I wonder how much importance Google and Yahoo will place on this type of research and filtering of the web. We did see a patent application come out from Google not very long ago involving identifying spam in emails, based upon the links in those emails.
Is the next reasonable step to look for viruses, spyware, and phishing attacks in emails and on web pages? They are making decisions about spam in emails based upon classifications they make about web pages – in effect some type of ranking. So, you may be right – some type of saferank could be part of the near future.
The Go Daddy patent application is interesting. In a number of ways, very similar to the Google Patent application – but each has their own focus and goals. Should be a good post. I’ll look forward to it.
Bill,
Actually I’m writing a post about a few patents, of which this one by Godaddy: no. 20060129644 – Email filtering system and method. Abstract: “Systems and methods of the present invention allow filtering out spam and phishing email messages based on the links embedded into the email messages. In a preferred embodiment, an Email Filter extracts links from the email message and obtains desirability values for the links. The Email Filter may route the email message based on desirability values. Such routing includes delivering the email message to a Recipient, delivering the message to a Quarantine Mailbox, or deleting the message.”
It’s another step.
J-M