Building a Benign Data Set

Presented at BSidesLV 2017, July 26, 2017, 11 a.m. (25 minutes).

Though featurization is important, the datasets used to make conclusions are just as important, if not more so. Information Security researchers often cannot release data, resulting in lack of benchmark datasets and causing cross-dataset generalization to be understudied in this domain. Despite this fact, presence of dataset bias (especially negative set bias) is now common knowledge in machine learning for malware classification. For these reasons, we have developed a standard for benign datasets to be used toward machine learning in the malware classification domain. We are also releasing a sample benign data set designed to minimize these problems.


Presenters:

  • Rob Brandon - Security Researcher - Booz-Allen-Hamilton
    Rob is currently a security researcher with the Booz-Allen Hamilton Dark Labs. He has over a decade of experience in the security field, primarily in the areas of network traffic analysis, forensics, reverse engineering, and machine learning. Rob holds a PhD in Computer Science from the University of Maryland, Baltimore County and a B.S. in Computer Science from the University of Maryland, University College. His research interests include novel ways to represent cybersecurity data and machine augmentation of human cognition.
  • John Seymour / Delta Zero - University of Maryland, Baltimore County   as John Seymour
    John Seymour is a Senior Data Scientist at ZeroFOX, Inc. by day, and Ph.D. student at University of Maryland, Baltimore County by night. He researches the intersection of machine learning and InfoSec in both roles. He's mostly interested in dataset bias (seriously, do people still use malware datasets from 1998?) He's also an admin of his local python meetup, judges local coding competitions, consults for his university's Ethics Bowl team, and has spoken at DEF CON 23 and 24, BSidesCharm, BSidesLV, Black Hat USA, and SecTor.

Links:

Similar Presentations: