BinaryPig - Scalable Malware Analytics in Hadoop

Presented at Black Hat USA 2013, July 31, 2013, 5 p.m. (30 minutes).

Over the past 2.5 years Endgame received 20M samples of malware equating to roughly 9.5 TB of binary data. In this, we're not alone. McAfee reports that it currently receives roughly 100,000 malware samples per day and received roughly 10M samples in the last quarter of 2012 [1]. Its total corpus is estimated to be about 100M samples. VirusTotal receives between 300k and 600k unique files per day, and of those roughly one-third to half are positively identified as malware [2].

This huge volume of malware offers both challenges and opportunities for security research especially applied machine learning. Endgame performs static analysis on malware in order to extract feature sets used for performing large-scale machine learning. Since malware research has traditionally been the domain of reverse engineers, most existing malware analysis tools were designed to process single binaries or multiple binaries on a single computer and are unprepared to confront terabytes of malware simultaneously. There is no easy way for security researchers to apply static analysis techniques at scale; companies and individuals that want to pursue this path are forced to create their own solutions.

Our early attempts to process this data did not scale well with the increasing flood of samples. As the size of our malware collection increased, the system became unwieldy and hard to manage, especially in the face of hardware failures. Over the past two years we refined this system into a dedicated framework based on Hadoop so that our large-scale studies are easier to perform and are more repeatable over an expanding dataset.

To address this problem, we will present our open framework, BinaryPig, as well as some example uses of this technology to perform a multiyear, multi-terabyte, multimillion-sample malware census. This framework is built over Apache Hadoop, Apache Pig, and Python. It addresses many issues of scalable malware processing, including dealing with increasingly large data sizes, improving workflow development speed, and enabling parallel processing of binary files with most pre-existing tools. It is also modular and extensible, in the hope that it will aid security researchers and academics in handling ever-larger amounts of malware.

In addition, we will demonstrate the results of our exploration and the techniques used to derive these results. The framework, analysis modules, and some example applications will be released as open source (Apache 2.0 License) at Blackhat.

http://www.darkreading.com/identityandaccessmanagement/167901114/security/attacksbrea ches/240006702/mcafeecloseto100knewmalwaresamplesperdayinq2.html

https://www.virustotal.com/en/statistics/ as of 4/9/2013


Presenters:

  • Zachary Hanif - Endgame
    Zachary Hanif holds the position of Senior Researcher at Endgame. He currently works to create powerful analytics within batch and real time data processing engines though applied statistics and rapid correlation. His research interests revolve around applications of machine learning and graph mining within the realm of massive security data.
  • Telvis Calhoun - Endgame
    Telvis Calhoun is a software engineer at Endgame Inc. with many years experience with commercial security companies. His expertise is building distributed server-side applications. While completing his M.S., he was a member of the Communications Assurance and Performance Group at Georgia State University where he published research on wireless security. Telvis blogs about mining twitter.com using Twitter Storm, Hadoop, Elasticsearch and whatever data analytics hotness catches his eye. His goal is to simply be challenged by his work, work with great people and build great products.
  • Jason Trost - Endgame
    Jason Trost is a Software Engineer working at Endgame and is deeply interested in Big Data/cloud computing and machine Learning. He has several years of experience working with Hadoop, MapReduce, Accumulo, and more recently Twitter's Storm project. He is currently focused on building highly scalable systems for processing, analyzing, and visualizing high speed network/security events in real-time as well as, systems for analyzing massive amounts of malware. He is a regular attendee of Big Data and security conferences, and he has spoken at Hadoop Summit and FloCon.

Links:

Similar Presentations: