Abstract
We address the problem of code search in executables. Given a function in binary form and a large code base, our goal is to statically find similar functions in the code base. Towards this end, we present a novel technique for computing similarity between functions. Our notion of similarity is based on decomposition of functions into tracelets: continuous, short, partial traces of an execution. To establish tracelet similarity in the face of low-level compiler transformations, we employ a simple rewriting engine. This engine uses constraint solving over alignment constraints and data dependencies to match registers and memory addresses between tracelets, bridging the gap between tracelets that are otherwise similar. We have implemented our approach and applied it to find matches in over a million binary functions. We compare tracelet matching to approaches based on n-grams and graphlets and show that tracelet matching obtains dramatically better precision and recall.
- A heap based vulnerability in gnu's rtapelib.c. http://www.cvedetails.com/cve/CVE-2010-0624/.Google Scholar
- Hex-rays IDAPRO. http://www.hex-rays.com.Google Scholar
- Yard-plot. http://pypi.python.org/pypi/yard.Google Scholar
- Balakrishnan, G., and Reps, T. Divine: discovering variables in executables. In VMCAI'07 (2007), pp. 1--28. Google Scholar
Digital Library
- Ball, T., and Larus, J. R. Efficient path profiling. In Proceedings of the 29th Int. Symp. on Microarchitecture (1996), MICRO 29. Google Scholar
Digital Library
- Bansal, S., and Aiken, A. Automatic generation of peephole superoptimizers. In ASPLOS XII (2006). Google Scholar
Digital Library
- Bellon, S., Koschke, R., Antoniol, G., Krinke, J., and Merlo, E. Comparison and evaluation of clone detection tools. IEEE TSE 33, 9 (2007), 577--591. Google Scholar
Digital Library
- Bruschi, D., Martignoni, L., and Monga, M. Detecting self-mutating malware using control-flow graph matching. In DIMVA'06. Google Scholar
Digital Library
- Comparetti, P., Salvaneschi, G., Kirda, E., Kolbitsch, C., Kruegel, C., and Zanero, S. Identifying dormant functionality in malware programs. In IEEE Symp. on Security and Privacy (2010). Google Scholar
Digital Library
- Horwitz, S. Identifying the semantic and textual differences between two versions of a program. In PLDI '90. Google Scholar
Digital Library
- Horwitz, S., Reps, T., and Binkley, D. Interprocedural slicing using dependence graphs. In PLDI '88 (1988). Google Scholar
Digital Library
- Jang, J., Woo, M., and Brumley, D. Towards automatic software lineage inference. In USENIX Security (2013). Google Scholar
Digital Library
- Khoo, W. M., Mycroft, A., and Anderson, R. Rendezvous: a search engine for binary code. In MSR '13. Google Scholar
Digital Library
- Kruegel, C., Kirda, E., Mutz, D., Robertson, W., and Vigna, G. Polymorphic worm detection using structural information of executables. In Proc. of int. conf. on Recent Advances in Intrusion Detection, RAID'05. Google Scholar
Digital Library
- Myles, G., and Collberg, C. K-gram based software birthmarks. In Proceedings of the 2005 ACM symposium on Applied computing, SAC '05, pp. 314--318. Google Scholar
Digital Library
- Partush, N., and Yahav, E. Abstract semantic differencing for numerical programs. In SAS (2013).Google Scholar
- Reps, T., Ball, T., Das, M., and Larus, J. The use of program profiling for software maintenance with applications to the year 2000 problem. In ESEC '97/FSE-5. Google Scholar
Digital Library
- Rosenblum, N., Zhu, X., and Miller, B. P. Who wrote this code? identifying the authors of program binaries. In ESORICS'11. Google Scholar
Digital Library
- Rosenblum, N. E., Miller, B. P., and Zhu, X. Extracting compiler provenance from program binaries. In PASTE'10. Google Scholar
Digital Library
- Saebjornsen, A., Willcock, J., Panas, T., Quinlan, D., and Su, Z. Detecting code clones in binary executables. In ISSTA '09. Google Scholar
Digital Library
- Schkufza, E., Sharma, R., and Aiken, A. Stochastic superoptimization. In ASPLOS '13. Google Scholar
Digital Library
- Sharma, R., Schkufza, E., Churchill, B., and Aiken, A. Data-driven equivalence checking. In OOPSLA'13. Google Scholar
Digital Library
- Singh, R., Gulwani, S., and Solar-Lezama, A. Automated feedback generation for introductory programming assignments. In PLDI '13, pp. 15--26. Google Scholar
Digital Library
- Swamidass, S. J., Azencott, C.-A., Daily, K., and Baldi, P. A CROC stronger than ROC. Bioinformatics 26, 10 (May 2010). Google Scholar
Digital Library
- Wagner, R. A., and Fischer, M. J. The string-to-string correction problem. J. ACM 21, 1 (Jan. 1974), 168--173. Google Scholar
Digital Library
Index Terms
Tracelet-based code search in executables
Recommendations
Tracelet-based code search in executables
PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and ImplementationWe address the problem of code search in executables. Given a function in binary form and a large code base, our goal is to statically find similar functions in the code base. Towards this end, we present a novel technique for computing similarity ...
Stochastic superoptimization
ASPLOS '13We formulate the loop-free binary superoptimization task as a stochastic search problem. The competing constraints of transformation correctness and performance improvement are encoded as terms in a cost function, and a Markov Chain Monte Carlo sampler ...
Stochastic superoptimization
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsWe formulate the loop-free binary superoptimization task as a stochastic search problem. The competing constraints of transformation correctness and performance improvement are encoded as terms in a cost function, and a Markov Chain Monte Carlo sampler ...







Comments