In a world where threat actors move fast and the Internet evolves in a non-deterministic fashion, turning threat intelligence into automated protection has proven to be a challenge for the information security industry. While traditional threat research methods will never go away, there is an increasing need for powerful decision models that can process data in a real-time fashion and scale to incorporate increasingly-rich sources of threat intel. This talk will focus on one way to build a scalable machine learning infrastructure in real-time on a massive amount of DNS data (approximately 80B queries per day).
In this talk, we will offer a sneak peek into how OpenDNS does scalable data science. We will touch on two core components, Big Data engineering and Big Data science, and specifically how they are used to implement a real-time threat detection systems for large-scale network traffic.
To begin, we will detail Avalanche, a stream processing framework that helps OpenDNS data scientists create their own data processing pipelines using a modular graph-oriented representation. Each node acts as a data stream processor running as a process, thread or EC2 instance. In this graph database, the edges represent streaming channels connecting the different inputs and outputs of the nodes. The whole data pipeline can then easily be scaled and deployed to hundreds of instances in an AWS cloud.
The Avalanche project's paradigm is to translate the approach that the finance world has been using for decades in high frequency or quantitative trading and apply it to traffic analysis. Applying intelligent detection models as close as possible to the data source holds the key to build a truly predictive security system, one where requests are classified and filtered on the fly. In our particular case at OpenDNS, we see a strong interest in integrating such a detection pipeline at the resolver level.
We will next discuss how we integrate our statistical model NLP-Rank (a model that does large scale phishing detection) with Avalanche, and show some benchmarks. At its core, NLP-Rank is a fraud detection system that applies machine learning to the HTML content of a domain's web page to extract relevant terms and identify whether the content is potentially malicious or not. In this sense we are automating the security analyst's decision-making process in judging whether a website is legitimate or not. Typically when an analyst performs a review for a domain or URL in question, the analyst visits the site in a TOR browser, analyzes the content, and identifies the themes/summarize the page before deciding whether it's a fake or a false positive.
In this talk, we will describe how we have automated this process at OpenDNS. We will also discuss the unique characteristics of NLP-Rank, including its machine learning techniques. Additionally, we will discuss the design and implementation of our phishing classification system. We will provide an overview of data preprocessing techniques and the information retrieval/natural language processing techniques used by our classifier. We will then discuss how Avalanche manages the results of NLP-Rank, how we add those results to our blocklists and our corpus, and Avalanche's overall performance.