The Security Wolf of Wall Street: Fighting Crime with High-Frequency Classification and Natural Language Processing

Presented at Black Hat Asia 2016, Unknown date/time (Unknown duration)

In a world where threat actors move fast and the Internet evolves in a non-deterministic fashion, turning threat intelligence into automated protection has proven to be a challenge for the information security industry. While traditional threat research methods will never go away, there is an increasing need for powerful decision models that can process data in a real-time fashion and scale to incorporate increasingly-rich sources of threat intel. This talk will focus on one way to build a scalable machine learning infrastructure in real-time on a massive amount of DNS data (approximately 80B queries per day).

In this talk, we will offer a sneak peek into how OpenDNS does scalable data science. We will touch on two core components, Big Data engineering and Big Data science, and specifically how they are used to implement a real-time threat detection systems for large-scale network traffic.

To begin, we will detail Avalanche, a stream processing framework that helps OpenDNS data scientists create their own data processing pipelines using a modular graph-oriented representation. Each node acts as a data stream processor running as a process, thread or EC2 instance. In this graph database, the edges represent streaming channels connecting the different inputs and outputs of the nodes. The whole data pipeline can then easily be scaled and deployed to hundreds of instances in an AWS cloud.

The Avalanche project's paradigm is to translate the approach that the finance world has been using for decades in high frequency or quantitative trading and apply it to traffic analysis. Applying intelligent detection models as close as possible to the data source holds the key to build a truly predictive security system, one where requests are classified and filtered on the fly. In our particular case at OpenDNS, we see a strong interest in integrating such a detection pipeline at the resolver level.

We will next discuss how we integrate our statistical model NLP-Rank (a model that does large scale phishing detection) with Avalanche, and show some benchmarks. At its core, NLP-Rank is a fraud detection system that applies machine learning to the HTML content of a domain's web page to extract relevant terms and identify whether the content is potentially malicious or not. In this sense we are automating the security analyst's decision-making process in judging whether a website is legitimate or not. Typically when an analyst performs a review for a domain or URL in question, the analyst visits the site in a TOR browser, analyzes the content, and identifies the themes/summarize the page before deciding whether it's a fake or a false positive.

In this talk, we will describe how we have automated this process at OpenDNS. We will also discuss the unique characteristics of NLP-Rank, including its machine learning techniques. Additionally, we will discuss the design and implementation of our phishing classification system. We will provide an overview of data preprocessing techniques and the information retrieval/natural language processing techniques used by our classifier. We will then discuss how Avalanche manages the results of NLP-Rank, how we add those results to our blocklists and our corpus, and Avalanche's overall performance.


Presenters:

  • Jeremiah O'Connor - OpenDNS
    Jeremiah O'Connor is a security data scientist at OpenDNS where he focuses on building scalable threat detection models and writing software to solve real-world security problems. His current interests are in machine learning, natural language processing, distributed systems, and big data engineering. Prior to joining OpenDNS, he worked at Evernote, and at Mandiant/Fireeye. Jeremiah earned a Master's Degree in Computer Science from University of San Francisco in 2014. Jeremiah presented his research at ISOI APT 2015, Source Boston 2015, Source Seattle 2015, and at Open Late, and Text Mining meetups in San Francisco.
  • Thibault Reuille - OpenDNS
    Thibault Reuille is a security researcher at OpenDNS and creator of OpenGraphiti, an open-source 3D data visualization engine. Prior to OpenDNS, he was a software engineer for Nvidia, where he helped develop the Nvidia Parallel Nsight integrated development environment for GPU computing and graphics applications. Thibault holds a Master's in information technology from the EPITA in Paris. He has presented at many information security events, including Virus Bulletin, BlackHat, CanSecWest, BayThreat, Defcon, BSides SF, and the NASA Ames Cyber Security Turbo Talks.

Links:

Similar Presentations: