Abstract
Identifying free open-source software (FOSS) packages on binaries when the source code is unavailable is important for many security applications, such as malware detection, software infringement, and digital forensics. This capability enhances both the accuracy and the efficiency of reverse engineering tasks by avoiding false correlations between irrelevant code bases. Although the FOSS package identification problem belongs to the field of software engineering, conventional approaches rely strongly on practical methods in data mining and database searching. However, various challenges in the use of these methods prevent existing function identification approaches from being effective in the absence of source code. To make matters worse, the introduction of obfuscation techniques, the use of different compilers and compilation settings, and software refactoring techniques has made the automated detection of FOSS packages increasingly difficult. With very few exceptions, the existing systems are not resilient to such techniques, and the exceptions are not sufficiently efficient.
To address this issue, we propose FOSSIL, a novel resilient and efficient system that incorporates three components. The first component extracts the syntactical features of functions by considering opcode frequencies and applying a hidden Markov model statistical test. The second component applies a neighborhood hash graph kernel to random walks derived from control-flow graphs, with the goal of extracting the semantics of the functions. The third component applies z-score to the normalized instructions to extract the behavior of instructions in a function. The components are integrated using a Bayesian network model, which synthesizes the results to determine the FOSS function. The novel approach of combining these components using the Bayesian network has produced stronger resilience to code obfuscation.
We evaluate our system on three datasets, including real-world projects whose use of FOSS packages is known, malware binaries for which there are security and reverse engineering reports purporting to describe their use of FOSS, and a large repository of malware binaries. We demonstrate that our system is able to identify FOSS packages in real-world projects with a mean precision of 0.95 and with a mean recall of 0.85. Furthermore, FOSSIL is able to discover FOSS packages in malware binaries that match those listed in security and reverse engineering reports. Our results show that modern malware binaries contain 0.10--0.45 of FOSS packages.
- 2012. Full Analysis of Flame’s Command 8 Control servers. Retrieved from https://securelist.com/blog/incidents/34216/full-analysis-of-flames-command-control-servers-27/.Google Scholar
- 2016. Script modifies GNU assembly files (.s) to confuse linear sweep disassemblers like objdump. It does not confuse recursive traversal disassemblers like IDA Pro. It is very inefficient, making simple code about 2x slower. Retrieved from https://github.com/defuse/gas-obfuscation.Google Scholar
- 2016. The Lintian Reports. Retrieved from https://lintian.debian.org.Google Scholar
- 2016. The Paradyn project. Retrieved from http://www.paradyn.org/html/dyninst9.0.0-features.html.Google Scholar
- 2016. The tracelet system. Retrieved from https://github.com/Yanivmd/TRACY.Google Scholar
- 2016. The Z table. Retrieved from http://www.stat.ufl.edu/athienit/Tables/Ztable.pdf.Google Scholar
- 2016. Tigress is a diversifying virtualizer/obfuscator for the C language. Retrieved from http://tigress.cs.arizona.edu/.Google Scholar
- A.S.L. 2016. EXEINFO PE. Retrieved from http://exeinfo.atwebpages.com/. Accessed on March, 2017.Google Scholar
- Saed Alrabaee, Paria Shirani, Lingyu Wang, and Mourad Debbabi. 2015. SIGMA: A semantic integrated graph matching approach for identifying reused functions in binary code. Dig. Invest. 12 (2015), S61--S71. Google Scholar
Digital Library
- B. Bencsáth, L. Buttyán, and M. Félegyházi. 2012a. Pék, G. sKyWIper (aka flame aka flamer): A complex malware for targeted attacks. CrySyS Lab: Budapest, Hungary (2012).Google Scholar
- Boldizsár Bencsáth, Gábor Pék, Levente Buttyán, and Mark Felegyhazi. 2012b. The cousins of stuxnet: Duqu, flame, and gauss. Future Internet 4, 4 (2012), 971--1003.Google Scholar
Cross Ref
- Daniel Bilar. 2007. Opcodes as predictor for malware. Int. J. Electron. Secur. Dig. Forens. 1, 2 (2007), 156--168. Google Scholar
Digital Library
- Boldizsár Bencsáth, Gábor Pék, Levente Buttyán, and Mark Felegyhazi. 2012. The cousins of stuxnet: Duqu, flame, and gauss. Future Internet 4, 4 (2012), 971--1003.Google Scholar
Cross Ref
- Martial Bourquin, Andy King, and Edward Robbins. 2013. Binslayer: Accurate comparison of binary executables. In Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop. ACM, 4. Google Scholar
Digital Library
- Joan Calvet, José M. Fernandez, and Jean-Yves Marion. 2012. Aligot: Cryptographic function identification in obfuscated binary programs. In Proceedings of the 2012 ACM Conference on Computer and Communications Security. ACM, 169--182. Google Scholar
Digital Library
- Shuang Cang and Derek Partridge. 2004. Feature ranking and best feature subset using mutual information. Neural Comput. Appl. 13, 3 (2004), 175--184. Google Scholar
Digital Library
- Silvio Cesare, Yang Xiang, and Wanlei Zhou. 2014. Control-flow-based malware variant detection. IEEE TRans. Depend. Secure Comput. 11, 4 (2014), 307--317.Google Scholar
Cross Ref
- Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu, Yang Liu, Chia Yuan Cho, and Hee Beng Kuan Tan. 2016. BinGo: Cross-architecture cross-OS binary search. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 678--689. Google Scholar
Digital Library
- Cory Cohen and Jeffrey S. Havrilla. 2009. Function hashing for malicious code analysis. CERT Research Annual Report (2009), 26--29.Google Scholar
- Paolo Milani Comparetti, Guido Salvaneschi, Engin Kirda, Clemens Kolbitsch, Christopher Kruegel, and Stefano Zanero. 2010. Identifying dormant functionality in malware programs. In Proceedings of the 2010 IEEE Symposium on Security and Privacy (SP’10). IEEE, 61--76. Google Scholar
Digital Library
- Scott A. Czepiel. 2002. Maximum likelihood estimation of logistic regression models: Theory and implementation. 1--23. https://czep.net/stat/mlelr.pdf.Google Scholar
- DARPA. 2016. DARPA-BAA-10-36, Cyber Genome Program. Retrieved from https://www.fbo.gov/index?s=opportunity.Google Scholar
- DevExpress. 2016b. Refactoring tool. Retrieved from https://www.devexpress.com/Products/CodeRush/.Google Scholar
- Sanjeev Das, Yang Liu, Wei Zhang, and Mahintham Chandramohan. 2016. Semantics-based online malware detection: Towards efficient real-time protection against malware. IEEE Trans. Info. Forens. 11, 2 (2016), 289--302.Google Scholar
Cross Ref
- Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. In ACM SIGPLAN Notices, Vol. 49. ACM, 349--360. Google Scholar
Digital Library
- Jesse Davis and Mark Goadrich. 2006. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 233--240. Google Scholar
Digital Library
- José Gaviria de la Puerta, Borja Sanz, Igor Santos, and Pablo García Bringas. 2015. Using dalvik opcodes for malware detection on android. In Hybrid Artificial Intelligent Systems. Springer, 416--426.Google Scholar
- Steven H. H. Ding, Benjamin C. M. Fung, and Philippe Charland. 2016. Kam1n0: MapReduce-based assembly clone search for reverse engineering. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 461--470. Google Scholar
Digital Library
- Chris Eagle. 2011. HexRays: IDA Pro. Retrieved from https://www.hex-rays.com/products/ida/index.shtml.Google Scholar
- Manuel Egele, Maverick Woo, Peter Chapman, and David Brumley. 2014. Blanket execution: Dynamic similarity testing for program binaries and components. 23rd USENIX Security Symposium. 303--317. Google Scholar
Digital Library
- Sebastian Eschweiler, Khaled Yakdan, and Elmar Gerhards-Padilla. 2016. discovRE: Efficient cross-architecture identification of bugs in binary code. The Network and Distributed System Security Symposium (NDSS’16).Google Scholar
Cross Ref
- Mohammad Reza Farhadi, Benjamin C. M. Fung, Philippe Charland, and Mourad Debbabi. 2014. BinClone: Detecting code clones in malware. In Proceedings of the 8th International Conference on Software Security and Reliability (SERE’14). IEEE, 78--87. Google Scholar
Digital Library
- Mohammad Reza Farhadi, Benjamin C. M. Fung, Yin Bun Fung, Philippe Charland, Stere Preda, and Mourad Debbabi. 2015. Scalable code clone search for malware analysis. Dig. Invest. 15 (2015), 46--60. Google Scholar
Digital Library
- Qian Feng, Rundong Zhou, Chengcheng Xu, Yao Cheng, Brian Testa, and Heng Yin. 2016. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. ACM, 480--491. Google Scholar
Digital Library
- Eric Filiol and Sébastien Josse. 2007. A statistical model for undecidable viral detection. J. Comput. Virol. 3, 2 (2007), 65--74.Google Scholar
Cross Ref
- Halvar Flake. 2004. Structural comparison of executable objects. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA’04).Google Scholar
- Martin Fowler. 1999. Refactoring: Improving the Design of Existing Code. Pearson Education India. Google Scholar
Digital Library
- GReAT. 2016. Resource 207: Kaspersky Lab Research proves that Stuxnet and Flame developers are connected. Retrieved from http://newsroom.kaspersky.eu/fileadmin/user_upload/en/Images/Lifestyle/20120611_Kaspersky_Lab_Press_Release_Flame_Stuxnet_cooperation_final_-_UK.pdf. Accessed on Feb, 2016.Google Scholar
- Carlos Gañán, Orcun Cetin, and Michel van Eeten. 2015. An empirical analysis of zeus c8c lifetime. In Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security. ACM, 97--108. Google Scholar
Digital Library
- Debin Gao, Michael K. Reiter, and Dawn Song. 2008. Binhunt: Automatically finding semantic differences in binary programs. In Information and Communications Security. Springer, 238--255. Google Scholar
Digital Library
- Thomas Gärtner, Peter Flach, and Stefan Wrobel. 2003. On graph kernels: Hardness results and efficient alternatives. In Learning Theory and Kernel Machines. Springer, 129--143.Google Scholar
- Ilfak Guilfanov. 1997. Fast library identification and recognition technology. DataRescue (1997).Google Scholar
- IDA Pro. 2016. HexRays: FLAIR. Retrieved from https://www.hex-rays.com/products/ida/support/download.shtml.Google Scholar
- Emily R. Jacobson, Nathan Rosenblum, and Barton P. Miller. 2011. Labeling library functions in stripped binaries. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools. ACM, 1--8. Google Scholar
Digital Library
- Jiyong Jang, David Brumley, and Shobha Venkataraman. 2011. Bitshred: Feature hashing malware for scalable triage and semantic analysis. In Proceedings of the 18th ACM Conference on Computer and Communications Security. ACM, 309--320. Google Scholar
Digital Library
- Jiyong Jang, Maverick Woo, and David Brumley. 2013. Towards automatic software lineage inference. In Proceedings of the USENIX Security Symposium. 81--96. Google Scholar
Digital Library
- Wesley Jin, Sagar Chaki, Cory Cohen, Arie Gurfinkel, Jeffrey Havrilla, Charles Hines, and Priya Narasimhan. 2012. Binary function clustering using semantic hashes. In Proceedings of the 11th International Conference on Machine Learning and Applications (ICMLA’12), Vol. 1. IEEE, 386--391. Google Scholar
Digital Library
- Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. 2015a. Obfuscator-LLVM—Software protection for the masses. In Proceedings of the IEEE/ACM 1st International Workshop on Software Protection (SPRO’15), Brecht Wyseur (Ed.). IEEE, 3--9. Google Scholar
Digital Library
- Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. 2015b. Obfuscator-LLVM: Software protection for the masses. In Proceedings of the 1st International Workshop on Software Protection. IEEE Press, 3--9. Google Scholar
Digital Library
- Min Gyung Kang, Pongsin Poosankam, and Heng Yin. 2007. Renovo: A hidden code extractor for packed executables. In Proceedings of the 2007 ACM Workshop on Recurring Malcode. ACM, 46--53. Google Scholar
Digital Library
- Wei Ming Khoo. 2013. Decompilation as search. University of Cambridge, Computer Laboratory, Technical Report UCAM-CL-TR-844 (2013).Google Scholar
- Wei Ming Khoo, Alan Mycroft, and Ross Anderson. 2013. Rendezvous: A search engine for binary code. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 329--338. Google Scholar
Digital Library
- Ivo Krka, Yuriy Brun, Daniel Popescu, Joshua Garcia, and Nenad Medvidovic. 2010. Using dynamic execution traces and program invariants to enhance behavioral model inference. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering, Vol. 2. ACM, 179--182. Google Scholar
Digital Library
- Christopher Kruegel, Engin Kirda, Darren Mutz, William Robertson, and Giovanni Vigna. 2005. Polymorphic worm detection using structural information of executables. In Recent Advances in Intrusion Detection. Springer, 207--226. Google Scholar
Digital Library
- Arun Lakhotia, Mila Dalla Preda, and Roberto Giacobazzi. 2013. Fast location of similar code fragments using semantic’juice’. In Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop. ACM, 5. Google Scholar
Digital Library
- Charles LeDoux, Arun Lakhotia, Craig Miles, Vivek Notani, Avi Pfeffer, and Charles River Analytics. 2013. FuncTracker: Discovering shared code to aid malware forensics extended abstract. In Proceedings of the 6th USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET’13).Google Scholar
- JongHyup Lee, Thanassis Avgerinos, and David Brumley. 2011. TIE: Principled reverse engineering of types in binary programs. In Proceedings of the Network and Distributed System Security Symposium (NDSS’11). Citeseer.Google Scholar
- Da Lin and Mark Stamp. 2011. Hunting for undetectable metamorphic viruses. J. Comput. Virol. 7, 3 (2011), 201--214. Google Scholar
Digital Library
- Zhiqiang Lin, Xiangyu Zhang, and Dongyan Xu. 2010. Automatic reverse engineering of data structures from binary execution. In Proceedings of the 11th Annual Information Security Symposium. 5. Google Scholar
Digital Library
- Martina Lindorfer, Alessandro Di Federico, Federico Maggi, Paolo Milani Comparetti, and Stefano Zanero. 2012. Lines of malicious code: Insights into the malicious software industry. In Proceedings of the 28th Annual Computer Security Applications Conference. ACM, 349--358. Google Scholar
Digital Library
- Lorenzo Martignoni, Mihai Christodorescu, and Somesh Jha. 2007. Omniunpack: Fast, generic, and safe unpacking of malware. In Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC’07). IEEE, 431--441.Google Scholar
Cross Ref
- Ryan McDonald and Fernando Pereira. 2005. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinfo. 6, 1 (2005), 1.Google Scholar
Cross Ref
- Jason Milletary. 2012. Citadel trojan malware analysis. DELL SecureWorks. Vol. 13. 2014.Google Scholar
- Ned Moran and James Bennett. 2013. Supply Chain Analysis: From Quartermaster to Sun-shop. Vol. 11. FireEye Labs.Google Scholar
- Naynaeve. 2016. Adventure in Windows debugging and reverse engineering. Retrieved from http://www.nynaeve.net/.Google Scholar
- Lakshmanan Nataraj, Dhilung Kirat, B. S. Manjunath, and Giovanni Vigna. 2013. Sarvam: Search and retrieval of malware. In Proceedings of the Annual Computer Security Conference (ACSAC) Worshop on Next Generation Malware Attacks and Defense (NGMAD’13).Google Scholar
- Oreans Technologies. 2016. Advanced Windows software protection system, developed for software developers who wish to protect their applications against advanced reverse engineering and software cracking. Retrieved from http://www.oreans.com/themida.php.Google Scholar
- PELock. 2016. PELock is a software security solution designed for protection of any 32 bit Windows applications. Retrieved from https://www.pelock.com/.Google Scholar
- Hanchuan Peng, Fuhui Long, and Chris Ding. 2005. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 8 (2005), 1226--1238. Google Scholar
Digital Library
- Jannik Pewny, Behrad Garmany, Robert Gawlik, Christian Rossow, and Thorsten Holz. 2015. Cross-architecture bug search in binary executables. In Proceedings of the 2015 IEEE Symposium on Security and Privacy (SP’15). IEEE, 709--724. Google Scholar
Digital Library
- Jing Qiu, Xiaohong Su, and Peijun Ma. 2016. Using reduced execution flow graph to identify library functions in binary code. IEEE Trans. Softw. Eng. 1 (2016), 1--15. Google Scholar
Digital Library
- Ashkan Rahimian, Philippe Charland, Stere Preda, and Mourad Debbabi. 2012. RESource: A framework for online matching of assembly with open source code. In Proceedings of the International Symposium on Foundations and Practice of Security. Springer, 211--226. Google Scholar
Digital Library
- Ashkan Rahimian, Paria Shirani, Saed Alrbaee, Lingyu Wang, and Mourad Debbabi. 2015. BinComp: A stratified approach to compiler provenance attribution. Dig. Invest. 14 (2015), S146--S155. Google Scholar
Digital Library
- Brian Ruttenberg, Craig Miles, Lee Kellogg, Vivek Notani, Michael Howard, Charles LeDoux, Arun Lakhotia, and Avi Pfeffer. 2014. Identifying shared software components to support malware forensics. In Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 21--40.Google Scholar
- Andreas Sæbjørnsen, Jeremiah Willcock, Thomas Panas, Daniel Quinlan, and Zhendong Su. 2009. Detecting code clones in binary executables. In Proceedings of the 18th International Symposium on Software Testing and Analysis. ACM, 117--128. Google Scholar
Digital Library
- Joshua Saxe, Rafael Turner, and Kristina Blokhin. 2014. CrowdSource: Automated inference of high level malware functionality from low-level symbols using a crowd trained machine learning model. In Proceedings of the 9th International Conference on Malicious and Unwanted Software: The Americas (MALWARE’14). IEEE, 68--75.Google Scholar
Cross Ref
- Marc Shapiro and Susan Horwitz. 1997. The effects of the precision of pointer analysis. In Static Analysis. Springer, 16--34. Google Scholar
Digital Library
- Paria Shirani, Lingyu Wang, and Mourad Debbabi. 2017. BinShape: Scalable and robust binary library function identification using function shape. In Proceedings of the International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 301--324.Google Scholar
Cross Ref
- Mark Stamp. 2004. A revealing introduction to hidden Markov models. Department of Computer Science, San Jose State University.Google Scholar
- Saša Stojanović, Zaharije Radivojević, and Miloš Cvetanović. 2015. Approach for estimating similarity between procedures in differently compiled binaries. Info. Softw. Technol. 58 (2015), 259--271.Google Scholar
Cross Ref
- Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser, and Tony Jebara. 2016. Code relatives: Detecting similarly behaving software. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 702--714. Google Scholar
Digital Library
- Annie H. Toderici and Mark Stamp. 2013. Chi-squared distance and metamorphic virus detection. J. Comput. Virol. Hack. Techniq. 9, 1 (2013), 1--14. Google Scholar
Digital Library
- S. Vichy, N. Vishwanathan, Nicol N. Schraudolph, Risi Kondor, and Karsten M. Borgwardt. 2010. Graph kernels. J. Mach. Learn. Res. 11 (2010), 1201--1242. Google Scholar
Digital Library
- Whole Tomato Software. 2016a. C++ refactoring tools for visual studio. Retrieved from http://www.wholetomato.com/.Google Scholar
- Andrew Walenstein and Arun Lakhotia. 2012. A transformation-based model of malware derivation. In Proceedings of the 7th International Conference on Malicious and Unwanted Software (MALWARE’12). IEEE, 17--25. Google Scholar
Digital Library
- Chaitanya Yavvari, Arnur Tokhtabayev, Huzefa Rangwala, and Angelos Stavrou. 2012. Malware characterization using behavioral components. In Computer Network Security. Springer, 226--239. Google Scholar
Digital Library
- Yanfang Ye, Tao Li, Yong Chen, and Qingshan Jiang. 2010. Automatic malware categorization using cluster ensemble. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 95--104. Google Scholar
Digital Library
- Yijia Zhang, Hongfei Lin, Zhihao Yang, and Yanpeng Li. 2011. Neighborhood hash graph kernel for protein--protein interaction extraction. J. Biomed. Info. 44, 6 (2011), 1086--1092. Google Scholar
Digital Library
- Yijia Zhang, Hongfei Lin, Zhihao Yang, Jian Wang, and Yanpeng Li. 2012. Hash subgraph pairwise kernel for protein-protein interaction extraction. IEEE/ACM Trans. Comput. Biol. Bioinfo. (TCBB) 9, 4 (2012), 1190--1202. Google Scholar
Digital Library
Index Terms
FOSSIL: A Resilient and Efficient System for Identifying FOSS Functions in Malware Binaries
Recommendations
Packer identification based on metadata signature
SSPREW-7: Proceedings of the 7th Software Security, Protection, and Reverse Engineering / Software Security and Protection WorkshopMalware applies lots of obfuscation techniques, which are often automatically generated by the use of packers. This paper presents a packer identification of packed code based on metadata signature, which is a frequency vector of occurrences of ...
Timing Performance Profiling of Substation Control Code for IED Malware Detection
ICSS 2017: Proceedings of the 3rd Annual Industrial Control System Security WorkshopWe present a binary static analysis approach to detect intelligent electronic device (IED) malware based on the time requirements of electrical substations. We explore graph theory techniques to model the timing performance of an IED executable. Timing ...
Binary code traceability of multigranularity information fusion from the perspective of software genes
AbstractBinary code traceability aims to use the relevant characteristics of anonymous binary codes to identify concealed authors or teams and replace error-prone and time-consuming manual reverse engineering tasks with automated systems. ...






Comments