Detecting malware callouts in realtime network traffic

Presented at BSides Austin 2017, May 4, 2017, 10:30 a.m. (60 minutes)

Domain generation algorithm (DGA) malware makes callouts to unique web addresses to avoid detection by static rules engines. To counter this type of malware, we created an ensemble model that analyzes domains and evaluates if they were generated by a machine and thus potentially malicious. The model works entirely on the URL being accessed, thereby eliminating the need for DNS data, which can be difficult to access in large organizations. The ensemble consists of a transliteration pipeline to handle non-English language domains, a highly advanced NLP-based linguistic entropy algorithm, and a collocation and linear word embeddings algorithm to identify dictionary DGAs. We are also researching sequence-based machine learning analysis to detect dictionary DGAs. Our system analyzes enterprise-scale network traffic in real time, renders predictions, and raises alerts for cyber security analysts to evaluate. This talk will discuss the machine learning algorithms that were used to build the model, the features that we found to be informative, and the tools used in model testing and creation. We will then present the tools leveraged in building our model-as-a-service architecture for low-latency stream processing of high velocity and high volume traffic. The talk will appeal to those interested in the modeling and data engineering challenges around building a machine learning pipeline for malware detection.


Presenters:

  • Domenic Puzio
    Domenic Puzio is a Data Engineer with Capital One. He graduated from the University of Virginia with degrees in Mathematics and Computer Science. On his current project, he is a core developer of a custom platform for ingesting, processing, and analyzing Capital One's cyber-security data sources. Built entirely from open-source tools (NiFi, Kafka, Storm, Elasticsearch, Kibana), this framework processes hundreds of millions of events per hour. Currently, his focus is on the creation and productionization of machine learning models that provide enrichment to the data being streamed through the system. He is a contributor to two Apache projects, and his research interests include natural language processing and deep learning.

Links:

Similar Presentations: