De-anonymizing Programmers: Large Scale Authorship Attribution from Executable Binaries of Compiled Code and Source Code

Presented at 32C3 (2015), Dec. 29, 2015, 5:15 p.m. (60 minutes).

Last year I presented research showing how to de-anonymize programmers based on their coding style. This is of immediate concern to open source software developers who would like to remain anonymous. On the other hand, being able to de-anonymize programmers can help in forensic investigations, or in resolving plagiarism claims or copyright disputes. I will report on our new research findings in the past year. We were able to increase the scale and accuracy of our methods dramatically and can now handle 1,600 programmers, reaching 94% de-anonymization accuracy. In ongoing research, we are tackling the much harder problem of de-anonymizing programmers from binaries of compiled code. This can help identify the author of a suspicious executable file and can potentially aid malware forensics. We demonstrate the efficacy of our techniques using a dataset collected from GitHub. It is possible to identify individuals by de-anonymizing different types of large datasets. Once individuals are de-anonymized, different types of personal details can be detected from data that belong to them. Furthermore, their identities across different platforms can be linked. This is possible through utilizing machine learning methods that represent human data with a numeric vector that consists of features. Then a classifier is used to learn the patterns of each individual, to classify a previously unseen feature vector. Tor users, social networks, underground cyber forums, the Netflix dataset have been de-anonymized in the past five years. Advances in machine learning and the improvements in computational power, such as cloud computing services, make these large scale de-anonymization tasks possible in a feasible amount of time. As data aggregators are collecting vast amounts of data from all possible digital media channels and as computing power is becoming cheaper, de-anonymization threatens privacy on a daily basis. Last year, we showed how we can de-anonymize programmers from their source code. This is an immediate concern for programmers who would like to remain anonymous. (Remember Saeed Malekpour, who was sentenced to death after the Iranian government identified him as the web programmer of a porn site.) We scaled our method to 1,600 programmers after last year’s talk on identifying source code authors via stylometry. We reach 94% accuracy in correctly identifying the 1,600 authors of 14,400 source code samples. These results are a breakthrough in accuracy and magnitude when compared to related work. This year we have been focusing on de-anonymizing programmers from their binaries of compiled code. Identifying stylistic fingerprints in binaries is much more difficult in comparison to source code. Source code goes through compilation to generate binaries and some stylistic fingerprints get lost in translation while some others survive. We reach 65% accuracy, again a breakthrough, in de-anonymizing binaries of 100 authors. De-anonymization is a threat to privacy but it has many security enhancing applications. Identifying authors of source code helps aid in resolving plagiarism issues, forensic investigations, and copyright-copyleft disputes. Identifying authors of binaries can help identify the author of a suspicious executable file or even be extended to malware classification. We show how source code and binary authorship attribution works on a real world datasets collected from GitHub. I hope this talk raises awareness on the dangers of de-anonymization while showing how it can be helpful in resolving conflicts in some other areas. Binary de-anonymization could potentially enhance security by identifying malicious actors such as malware writers or software thieves. I would like to conclude by mentioning two future directions. Can binary de-anonymization be used for malware family classification and be incorporated to virus detectors? Obfuscators are not the counter measure to de-anonymizing programmers. We can identify the authors of obfuscated code with high accuracy. There is an immediate need for a code anonymization framework, especially for all the open source software developers who would like to remain anonymous.

Presenters:

  • Aylin
    Aylin Caliskan-Islam is a postdoctoral research associate at Center for Information Technology and Policy at Princeton University. Aylin Caliskan-Islam is a Postdoctoral Research Associate at CITP. Her work on the two main realms, security and privacy, involves the use of machine learning and natural language processing. In her previous work, she demonstrated that de-anonymization is possible through analyzing linguistic style in a variety of textual media, including social media, cyber criminal forums, and source code. She is currently extending her de-anonymization work to include non-textual data such as binary files and developing countermeasures against de-anonymization. Aylin’s other research interests include quantifying and classifying human privacy behavior and designing privacy nudges to avoid private information disclosure as a countermeasure. At Princeton, she works on text sanitization of sensitive documents for public disclosure, which can enable researchers to share data with linguists, sociologists, psychologists, and computer scientists without breaching the research subjects’ privacy. Her work has been featured in prominent privacy and security conferences such as Usenix Security Symposium, IEEE Symposium on Security and Privacy, Privacy Enhancing Technologies Symposium, and the Workshop on Privacy in the Electronic Society. In addition, she has given lectures and talks on privacy, security, and machine learning subjects at the Chaos Communications Congress and Drexel University. She holds a PhD in Computer Science from Drexel University and a Master of Science in Robotics from the University of Pennsylvania.

Links:

Similar Presentations: