Developers sometimes statically link libraries from other projects, maintain an internal copy of other software or fork development of an existing project. This practice can lead to software vulnerabilities when the embedded code is not kept up to date with upstream sources. As a result, manual techniques have been applied by Linux vendors to track embedded code and identify vulnerabilities. We propose an automated solution to identify embedded packages, which we call package clones, without any prior knowledge of these relationships. Our approach identifies similar source files based on file names and content to identify elationships between packages. We extract these and other features to perform statistical classification using machine learning. We evaluated our automated system named Clonewise against Debian's manually created database.
Clonewise had a 68% true positive rate and a false positive rate of less than 1%. Additionally, our system detected many package clones not previously known or tracked. Our results are now starting to be used by Linux vendors such as Debian and Redhat to track embedded packages. Redhat started to track clones in a new wiki, and Debian are planning to integrate Clonewise into the operating procedures used by their security team. Based on our work, over 30 unknown package clone vulnerabilities have been identified and patched.