Code obfuscation by embedding a virtual machine with a secretinstruction set is one of the strongest types of obfuscation. Itrequires a great deal of tedious work by an analyst to recover thearchitecture and instruction set of the virtual machine by reverseengineering the VM interpreter code. However, some information can beobtained by analyzing the binary code of the program for the VM itself before the analysis of the virtual machine interpreter. We propose anumber of heuristics that could achieve this task. The heuristics arebased on reasonable assumptions regarding typical structure of binarycode, produced by a compiler. The obtained information may facilitatefurther analysis of the virtual machine itself.
Some of these heuristics are implemented in a heuristics binaryanalyzer module of SmartDec decompiler (http://decompilation.info)being developed by the authors.In this presentation we are going to perform partial reconstruction ofthe instruction set:· Initial markup of the binary program. Identification of datasections and code sections. Prior information about the purpose of theprogram or even some documentation hints may be used. At this step theentry point to the code (or several entry points) are identified.· Reconstruction of subroutine structure by identification ofthe subroutine borders. The subroutine return (RET) instruction isidentified. It is naturally to expect that the last instruction in thecode segment would be the return instruction. RET instruction normallyseparates subroutines, so we may expect, that CALL instruction shouldpass the control right after RET instruction in many (or even most)cases.· Identification of the unconditional jump (JMP) instructionusing the assumption, that code execution starts at some fixedaddress.· Identification of call instruction. Call instructions aresimilar to unconditional jumps. By investigation of initializationcode several candidates for the CALL instruction can be identified,and the one candidate remained after validation on the whole code.· Recovering of absolute and relative jumps and call by lookingat the bit encoding of instruction and checking whether it could be anoffset relative to the next instruction word. This way relative jumps,calls and candidates for conditional jumps were identified.· Identification of memory load and store instruction byobserving load-store patterns for memory-memory copy operations.· Observations on the virtual machine register structure andregister width. How many registers this VM probably has and how wideare they.· Observations on the arithmetics and logics operation bypairing with the identified conditional jumps.Then we will show some examples of binary code than can bedeobfuscated using the presented method and will discuss possibilitiesfor automatization.In the end of the talk the related features of SmartDec decompilersuch as partial decompilation of partial reconstructed assemblyprogram will be demonstrated.