Labeling the VirusShare Corpus: Lessons Learned

Presented at BSidesLV 2016, Aug. 3, 2016, 11:35 a.m. (55 minutes)

A machine learning researcher needs a nice dataset to work with, but all of the publicly available malware datasets have major issues. We'll start by reviewing the basics of machine learning on malware: what works, what doesn't, and what data is out there. We'll introduce the VirusShare dataset, show how we fixed the labels issue (using VirusTotal) so that it may be used for supervised machine learning, and discuss why this corpus should be used as a standard for machine learning research. Finally, we'll look at pyspark, and how it can be used to both summarize the corpus and to help us find which chunks have high concentrations of particular families of malware.


Presenters:

  • John Seymour / Delta Zero - University of Maryland, Baltimore County   as John Seymour
    John Seymour is a Senior Data Scientist at ZeroFOX, Inc. by day, and Ph.D. student at University of Maryland, Baltimore County by night. He researches the intersection of machine learning and InfoSec in both roles. He's mostly interested in dataset bias (seriously, do people still use malware datasets from 1998?) He's also an admin of his local python meetup, judges local coding competitions, consults for his university's Ethics Bowl team, and has spoken at DEF CON 23 and 24, BSidesCharm, BSidesLV, Black Hat USA, and SecTor.

Links:

Similar Presentations: