Presented at
Black Hat USA 2019,
Aug. 8, 2019, 9 a.m.
(25 minutes).
Humans cannot scale to the amount of Threat Intelligence being generated. While the Security Community has mastered the use of machine readable feeds from OSINT systems or third party vendors, these usually provide IOCs or IOAs without contextual information. On the other hand, we have rich textual data that describes the operations of cyber attackers, their tools, tactics and procedures; contained in internal incident response reports, public blogs and white papers. Today, we can't automatically consume or use these data because they are composed of unstructured text. Threat Analysts manually go through them to extract information about adversaries most relevant to their threat model, but that manual work is a bottleneck for time and cost. <br><br>In this project we will automate this process using Machine Learning. We will share how we can use ML for Custom Entity Extraction to automatically extract entities specific to the cyber security domain from unstructured text. We will also share how this system can be used to generate insights such as:<br><ul><li>Identify patterns of attacks an enterprise may have faced</li><li>Analyze the most effective attacker techniques against the enterprise they are defending</li><li>Extract trends of techniques used in the overall eco-system or a specific vertical industry</li></ul><br>These insights can be used to make data backed decisions about where to invest in the defenses of an enterprise. And in this talk we will describe our solution for building an entity extraction system from public domain text specific to the security domain; using opensource ML tooling. The goal is to enable applied researchers to extract TI insights automatically, at scale and in real time.<br><br>We will cover:<br><ul><li>The importance of this process for threat intelligence and share some examples of actionable insights we can provide as a result of this research</li><li>Overall Architecture of the system and ML principles used</li><li>How we automatically created a training dataset for our domain using a dictionary of entities</li><li>Supervised and unsupervised featurization methods we experimented with</li><li>Experimentation and results from Statistical Modeling methods and Deep Learning Methods</li><li>Recommendations and resources for Applied Researchers who may want to implement their own TI Extraction pipeline.</li></ul>
Presenters:
-
Bhavna Soman
- Security Researcher, Microsoft
Bhavna Soman is a Security Researcher working for the Windows Defender Research Team. In her day job, she develops Machine Learning models to classify malware in real time. In the past she worked in the field of Threat Intelligence. This project is a combination of her experiences in Threat Intelligence, with her expertise in Machine Learning and Natural Language Processing. Bhavna holds a master's degree in Computer Security from Georgia Tech and is also a trainer for Malware Reverse Engineering with Blackhoodie.
Links:
Similar Presentations: