Google Malware Detection Using Document Classification

Sharing is caring!

In the Google paper, Predicting Bounce Rates in Sponsored Search Advertisements (pdf), we’re told about an experiment at Google where researchers used a document classification model on sponsored advertisements and landing pages to try to predict how many people might see an advertisement in Google’s search results, and after clicking upon the ad leave the landing page very quickly. The experiment in that paper is also described in another Google paper, PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce (pdf), which tells us how Google might be able to take an extremely large amount of observational data and use it to create classifications that, amongst other things, could potentially be used to help rank pages in organic search like we’ve been told that Google’s Panda updates do.

A patent about Google Malware Detection was granted today that appears to use a similar approach to determine whether sponsored advertisements in Google might lead to malware. The patent describes malware as malicious software that might be deceptively or automatically installed on a visitor’s computer when they arrive at a page. In addition to trojan horses and viruses, this can include monitoring software. In some instances, a landing page may be the first in a series of one or more redirections, which can include malware on the page or pages being redirected to. The need for such a classification approach comes about because of the sheer volume of advertisements that Google shows.

We know that Google’s Panda updates look for features on websites that indicate “quality” in some manner. Under the document classification approach in this patent, “intrusion features” are tested and weighted on landing pages.

This process involves Google taking a large number of landing pages and using a machine learning approach to examine all of the features that appear on those pages that might indicate the possibility of malware. This training set then can be used by the system to classify other landing pages. The malware detection approach may test pages and take appropriate actions when malware is detected such as suspending an advertiser’s account, flagging ads associated with the landing pages, rechecking those landing pages to see if they have been cleaned up, and enabling advertisers to have their accounts unsuspended (keep in mind that malware may be introduced to a site through someone who may have hacked the site rather than the advertiser themselves).

The patent provides a fair amount of details on how a malware detection system might be implemented by Google, but my interest is in looking at the kinds of “intrusion features” that might be used to indicate that a landing page might contain malware. The classification approach described in this patent would be used as a first step in evaluating pages to predict when a landing page might contain malware or redirect to it. We know that Google purchased Green Border in 2007 to use that technology to protect browsers from malware when surfing the Web, and if the classification approach from this patent predicts the presence of malware, it could be further tested by approaches similar to that used in the Green Border technology.

What kind of intrusion features might Google look for on landing pages or any pages that might redirect from those landing pages?

Those could include specific iframe features, URL features, or script features that are known to be associated with landing pages that include malware. If a certain feature score is reached when a page is evaluated, it would be classified as a candidate for further evaluation.

Here are some examples of the kinds of features that might be evaluated to determine if a site should be further reviewed to see if it contains malware:

  • Small iFrames that may be indicative of an attempt to embed other HTML documents (e.g., malware-related) inside a main document.
  • A bad or suspicious URL that may match a URL on a known list of malware-infected domains.
  • Suspicious script language containing certain function calls or language elements that are known to be used in serving malware.
  • Multiple frames, scripts or iFrames appearing in unusual places, such as at the end of the HTML.
  • Domain name used that was shown to have malware installed upon it on other pages.
  • Geographical features, such as a domain being from Russia or China or any other country statistically known to have higher rates of infected domains.
  • A page with a domain originating in one country with an iframe that contains content originating from a geographically remote location.
  • Relative age of a domain and URLs or links to that domain (malware is often distributed from new sites).

The amount of weight that each of these features carries may be different and may change over time as the system goes through and learns from training data.

The Google Malware Detection patent is:

Intrusive feature classification model
Invented by Mark Palatucci, Panayiotis Mavrommatis, Niels Provos, Christopher K. Monson, Yunkai Zhou, Kamal P. Nigam, Clayton W. Bavor, Jr., Eric L. Davis, Rachel Nakauchi
Assigned to Google Inc.
US Patent 7,991,710
Granted August 2, 2011
Filed: March 3, 2008

Abstract

Landing pages associated with advertisements are partitioned into training landing pages and testing landing pages. Iterative training and testing of a classification mode on intrusion features of the partitioned landing pages is conducted until the occurrence of a cessation event. Feature weights are derived from the iterative training and testing and are associated with the intrusion features. The associated feature weights and intrusion features can be used to classify other landing pages.

It’s possible that some of the features in this malware detection system may be used on webpages that Google comes across that aren’t tied to advertisement and landing pages, as part of Google’s Safe Browsing Diagnostic Program, since Google wouldn’t want to deliver searchers to pages that contain malware.

Conclusion

It’s interesting to see how Google may use a document classification approach as described in the Predicting Bounce Rates paper to try to help advertisers build more effective ads and landing pages, and how Google may use a document classification approach to evaluate landing pages for Malware. The paper was originally published in 2009 and the patent was filed in 2008, and we’ve been told by Google’s Matt Cutts and Amit Singhal that this was roughly the same time period that works on the document classification system behind Panda started.

There has been a lot of speculation and guesses about what types of features might be involved the classification of pages for quality in Google’s Panda, including things like numbers and sizes and locations of advertisements and a ratio of advertisements to content, reading levels and originality of content, and many others. The number of actual features could be fairly large, and like the intrusion features in this Google malware detection classification approach may change over time in how much weight each feature is given based upon training data used.

Sharing is caring!

11 thoughts on “Google Malware Detection Using Document Classification”

  1. Thanks Bill – another great update!

    I like the idea of the document classification system to predict the propensity of a destination site to be hosting malware. Considering the high profile occurrences of sites being tampered with by unauthorised 3rd parties it seems like a good idea to have a way of analysing the landing page to have a first port of call as a preventative measure.

    The scoring system would be an interesting one to watch evolve – independently the majority of highlighted factors wouldn’t necessarily trigger a malware warning but the scoring system should mean that in combination it’s enough to prevent Google pointing to a Malware laden site.

    I think this patent could form a nice indicative example at some point in the future!

    Tom

  2. Hi Bill,
    Coincidentally, right before I read this post, I was reading Blue Coat System’s mid-year report which shows that search engines rank well above porn and e-mail as the number 1 place to get infected with malware. It listed the malware delivery networks Schnakule, Ishabor and Nakinakindu being among the top culprits. Regardless of how fun it is to say ‘Nakinakindu’ three times really fast, it seems the filter process laid out in this patent could dramatically decrease malware delivery system’s presence on SERP’s. Anyway, the Blue Coat report I was reading before this wasn’t very hopeful sounding, but this patent at least tells me that Google’s working on it, and has been since at least 2008. So thanks Bill for bringing this to light!
    Corey

  3. Great post Bill,

    Anything they can do to make the internet a safer place is good. Reducing your chance of getting malware is good. What will be interesting is how this is used with the Panda updates and what changes will come in the future.

  4. What I find most interesting about this is the amount of time from patent filing to patent granted. You addressed this in your conclusion; dealing with these issues all the time, is this lengthy period standard or is the patent process always this slow? I’d be interested if you have any numbers about typical patent duration from filing to acceptance or denial.

  5. Hello Bill –
    One of the factors you mention is a domain that was shown to have malware installed on it. If such a site is found, what kind of penalty is enacted? Is it totally removed from the index? Because hypothetically, say I wanted to buy a really good domain like “baseballbats.com” but the domain, under its past owner, had malware signals all over it (and was found by this intrusive feature update, and punished accordingly – before I took ownership). If I clean it up and submit a re-inclusion request, or something like that, will it be able to move out from under the shadow of its shady past or will it be forever tarnished by the malware that was once on it? What are your thoughts on that? Thanks!

  6. Hi Corey,

    Thanks. It definitely doesn’t hurt to make sure that you have good antivirus and anti-malware software on your computer these days when you surf the Web.

    Recently Google sent out a message about a particular infestation on Windows machines where they also showed a special warning highlighted in Yellow to people whose computers had been infected, along with links and instructions to help them remove the malware. It was encouraging to see them take those steps as well.

    I was encouraged by that as well.

  7. Hi Tom,

    Thank you. The classification system seems like a fairly intelligent approach towards finding and fighting malware without having to go through the steps of scanning every site on the Web with a much more rigourous evaluation – I’d imagine that could be pretty costly in terms of bandwidth and processing power.

    I’m not sure how much of a chance we might have to see how this system evolves – I would suspect that Google would want to keep many aspects of it quiet.

  8. Hi Scot,

    Good to see you. I agree completely about making the Web safer.

    I’m not sure if malware detection will ever become part of Panda, but the classification approach that it uses sounds similar in many ways to how Panda appears to work. Given that, I suspect that people working upon classification approaches at Google, whether for sponsored search, organic search, email spam filtering, malware detection, and possibly even deciding which data center to send a searcher to, probably have opportunities to learn from each other on topics like optimizing approaches like these for very large data sets and perhaps in other ways as well.

  9. Hi Dave,

    I’m not sure that we can read too much into how much time passes from the time a patent is filed until the point where it is granted. Many patents have some claims rejected during the patent process with the opportunity to appeal and/or amend those, and the whole process can often take years. Often, when a patent is filed, it needs to be published as a pending patent application within 14 months, and then it often takes more than a year before it’s granted. Based upon that kind of timeline, this particular patent was granted within 3 1/2 years, which is fairly quick.

    On the USTPO’s Frequently asked questions for kids page, the answer to the question about how long it takes to get a patent is:

    At this time it takes an average of 22 months.

  10. Hi Mike,

    It appears that a warning message might show up in search results about malware, or pages might be removed from search results in some cases.

    Google does have a reinclusion process for sites that have been flagged as containing malware. See their help page: Request a malware review of your site.

    I think Google recognizes that sites containing malware often do because they may have been hacked, or that situations exists where someone else might now be in control of the site in question.

    It’s possible that the past history of malware on a site might play a role in future evaluations flagging of a site for a more indepth review, but if that further review doesn’t find any malware, it probably wouldn’t impact the site negatively.

  11. Bill,

    Glad you’ve been digging that paper…one question it raises that is really perplexing to me is:
    Why is there an initial hump in Figure 1 (from 0 to .1)?

    It almost looks to me like the curve is the sum of two different underlying functions, so maybe there’s two different behaviors of users affecting it(?!?) An interesting mystery.

    – Ted

Comments are closed.