skip to main content
research-article

Tracelet-based code search in executables

Published:09 June 2014Publication History
Skip Abstract Section

Abstract

We address the problem of code search in executables. Given a function in binary form and a large code base, our goal is to statically find similar functions in the code base. Towards this end, we present a novel technique for computing similarity between functions. Our notion of similarity is based on decomposition of functions into tracelets: continuous, short, partial traces of an execution. To establish tracelet similarity in the face of low-level compiler transformations, we employ a simple rewriting engine. This engine uses constraint solving over alignment constraints and data dependencies to match registers and memory addresses between tracelets, bridging the gap between tracelets that are otherwise similar. We have implemented our approach and applied it to find matches in over a million binary functions. We compare tracelet matching to approaches based on n-grams and graphlets and show that tracelet matching obtains dramatically better precision and recall.

References

  1. A heap based vulnerability in gnu's rtapelib.c. http://www.cvedetails.com/cve/CVE-2010-0624/.Google ScholarGoogle Scholar
  2. Hex-rays IDAPRO. http://www.hex-rays.com.Google ScholarGoogle Scholar
  3. Yard-plot. http://pypi.python.org/pypi/yard.Google ScholarGoogle Scholar
  4. Balakrishnan, G., and Reps, T. Divine: discovering variables in executables. In VMCAI'07 (2007), pp. 1--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ball, T., and Larus, J. R. Efficient path profiling. In Proceedings of the 29th Int. Symp. on Microarchitecture (1996), MICRO 29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bansal, S., and Aiken, A. Automatic generation of peephole superoptimizers. In ASPLOS XII (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Bellon, S., Koschke, R., Antoniol, G., Krinke, J., and Merlo, E. Comparison and evaluation of clone detection tools. IEEE TSE 33, 9 (2007), 577--591. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Bruschi, D., Martignoni, L., and Monga, M. Detecting self-mutating malware using control-flow graph matching. In DIMVA'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Comparetti, P., Salvaneschi, G., Kirda, E., Kolbitsch, C., Kruegel, C., and Zanero, S. Identifying dormant functionality in malware programs. In IEEE Symp. on Security and Privacy (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Horwitz, S. Identifying the semantic and textual differences between two versions of a program. In PLDI '90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Horwitz, S., Reps, T., and Binkley, D. Interprocedural slicing using dependence graphs. In PLDI '88 (1988). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jang, J., Woo, M., and Brumley, D. Towards automatic software lineage inference. In USENIX Security (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Khoo, W. M., Mycroft, A., and Anderson, R. Rendezvous: a search engine for binary code. In MSR '13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kruegel, C., Kirda, E., Mutz, D., Robertson, W., and Vigna, G. Polymorphic worm detection using structural information of executables. In Proc. of int. conf. on Recent Advances in Intrusion Detection, RAID'05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Myles, G., and Collberg, C. K-gram based software birthmarks. In Proceedings of the 2005 ACM symposium on Applied computing, SAC '05, pp. 314--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Partush, N., and Yahav, E. Abstract semantic differencing for numerical programs. In SAS (2013).Google ScholarGoogle Scholar
  17. Reps, T., Ball, T., Das, M., and Larus, J. The use of program profiling for software maintenance with applications to the year 2000 problem. In ESEC '97/FSE-5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Rosenblum, N., Zhu, X., and Miller, B. P. Who wrote this code? identifying the authors of program binaries. In ESORICS'11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Rosenblum, N. E., Miller, B. P., and Zhu, X. Extracting compiler provenance from program binaries. In PASTE'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Saebjornsen, A., Willcock, J., Panas, T., Quinlan, D., and Su, Z. Detecting code clones in binary executables. In ISSTA '09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Schkufza, E., Sharma, R., and Aiken, A. Stochastic superoptimization. In ASPLOS '13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sharma, R., Schkufza, E., Churchill, B., and Aiken, A. Data-driven equivalence checking. In OOPSLA'13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Singh, R., Gulwani, S., and Solar-Lezama, A. Automated feedback generation for introductory programming assignments. In PLDI '13, pp. 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Swamidass, S. J., Azencott, C.-A., Daily, K., and Baldi, P. A CROC stronger than ROC. Bioinformatics 26, 10 (May 2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wagner, R. A., and Fischer, M. J. The string-to-string correction problem. J. ACM 21, 1 (Jan. 1974), 168--173. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Tracelet-based code search in executables

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGPLAN Notices
          ACM SIGPLAN Notices  Volume 49, Issue 6
          PLDI '14
          June 2014
          598 pages
          ISSN:0362-1340
          EISSN:1558-1160
          DOI:10.1145/2666356
          • Editor:
          • Andy Gill
          Issue’s Table of Contents
          • cover image ACM Conferences
            PLDI '14: Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
            June 2014
            619 pages
            ISBN:9781450327848
            DOI:10.1145/2594291

          Copyright © 2014 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 9 June 2014

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!