Abstract
Successful software authorship de-anonymization has both software forensics applications and privacy implications. However, the process requires an efficient extraction of authorship attributes. The extraction of such attributes is very challenging, due to various software code formats from executable binaries with different toolchain provenance to source code with different programming languages. Moreover, the quality of attributes is bounded by the availability of software samples to a certain number of samples per author and a specific size for software samples. To this end, this work proposes a deep Learning-based approach for software authorship attribution, that facilitates large-scale, format-independent, language-oblivious, and obfuscation-resilient software authorship identification. This proposed approach incorporates the process of learning deep authorship attribution using a recurrent neural network, and ensemble random forest classifier for scalability to de-anonymize programmers. Comprehensive experiments are conducted to evaluate the proposed approach over the entire Google Code Jam (GCJ) dataset across all years (from 2008 to 2016) and over real-world code samples from 1,987 public repositories on GitHub. The results of our work show high accuracy despite requiring a smaller number of samples per author. Experimenting with source-code, our approach allows us to identify 8,903 GCJ authors, the largest-scale dataset used by far, with an accuracy of 92.3%. Using the real-world dataset, we achieved an identification accuracy of 94.38% for 745 C programmers on GitHub. Moreover, the proposed approach is resilient to language-specifics, and thus it can identify authors of four programming languages (e.g., C, C++, Java, and Python), and authors writing in mixed languages (e.g., Java/C++, Python/C++). Finally, our system is resistant to sophisticated obfuscation (e.g., using C Tigress) with an accuracy of 93.42% for a set of 120 authors. Experimenting with executable binaries, our approach achieves 95.74% for identifying 1,500 programmers of software binaries. Similar results were obtained when software binaries are generated with different compilation options, optimization levels, and removing of symbol information. Moreover, our approach achieves 93.86% for identifying 1,500 programmers of obfuscated binaries using all features adopted in Obfuscator-LLVM tool.
- 20120. Stunnix. Retrieved February 2, 2021 from http://stunnix.com/.Google Scholar
- 2020. Google Code Jam. Retrieved February 2, 2021 from https://codingcompetitions.withgoogle.com/codejam.Google Scholar
- 2020. Hex-Rays. Retrieved February 2, 2021 from https://www.hex-rays.com/products/decompiler/.Google Scholar
- 2020. IDA Pro. Retrieved February 2, 2021 from https://www.hex-rays.com/products/ida/.Google Scholar
- 2020. Radare. Retrieved February 2, 2021 from https://www.radare.org/.Google Scholar
- 2020. The tigress c obfuscator. Retrieved February 2, 2021 from https://tigress.wtf/.Google Scholar
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Gregory S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian J. Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Józefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Gordon Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul A. Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda B. Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: Large-scale machine learning on heterogeneous distributed systems. arXiv:1603.04467. Retrieved from http://arxiv.org/abs/1603.04467.Google Scholar
- Ahmed Abbasi and Hsinchun Chen. 2008. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Trans. Inf. Syst. 26, 2 (2008), 7.Google Scholar
Digital Library
- Mohammed Abuhamad, Tamer AbuHmed, Aziz Mohaisen, and DaeHun Nyang. 2018. Large-scale and language-oblivious code authorship identification. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 101–114.Google Scholar
Digital Library
- Mohammed Abuhamad, Tamer Abuhmed, DaeHun Nyang, and David Mohaisen. 2020. Multi-: Identifying multiple authors from source code files. Proceedings on Privacy Enhancing Technologies (PoPETs) 2020, 3 (2020), 25–41.Google Scholar
Cross Ref
- Mohammed Abuhamad, Ji-su Rhim, Tamer AbuHmed, Sana Ullah, Sanggil Kang, and DaeHun Nyang. 2019. Code authorship identification using convolutional neural networks. Fut. Gener. Comput. Syst. 95 (2019), 104–115.Google Scholar
Cross Ref
- Sadia Afroz, Aylin Caliskan Islam, Ariel Stolerman, Rachel Greenstadt, and Damon McCoy. 2014. Doppelgänger finder: Taking stylometry to the underground. In Proceedings of the IEEE Symposium on Security and Privacy (SP’14). IEEE, 212–226.Google Scholar
Digital Library
- Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. Oba2: An onion approach to binary code authorship attribution. Dig. Invest. 11 (2014), S94–S103.Google Scholar
Cross Ref
- Alexander T. Basilevsky. 2009. Statistical Factor Analysis and Related Methods: Theory and Applications. Vol. 418. John Wiley & Sons.Google Scholar
- Yoshua Bengio. 2008. Neural net language models. Scholarpedia 3, 1 (2008), 3881.Google Scholar
Cross Ref
- Yoshua Bengio. 2009. Learning deep architectures for AI. Found. Trends Mach. Learn. 2, 1 (Jan. 2009), 1–127. DOI:https://doi.org/10.1561/2200000006Google Scholar
Digital Library
- Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 8 (2013), 1798–1828.Google Scholar
Digital Library
- Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. 2012. Unsupervised feature learning and deep learning: A review and new perspectives. arxiv:1206.5538. Retrieved from http://arxiv.org/abs/1206.5538.Google Scholar
- Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2012. Joint learning of words and meaning representations for open-text semantic parsing. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’12), Vol. 22. 127–135.Google Scholar
- Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. 2012. Modeling temporal dependencies in high-dimensional sequences: application to polyphonic music generation and transcription. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012. icml.cc.Google Scholar
- Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5–32.Google Scholar
Digital Library
- Michael Brennan, Sadia Afroz, and Rachel Greenstadt. 2012. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Trans. Inf. Syst. Secur. 15, 3 (2012), 12.Google Scholar
Digital Library
- Steven Burrows and S. M. M. Tahaghoghi. 2007. Source code authorship attribution using n-grams. In Proceedings of the 12th Australasian Document Computing Symposium (ADCS’07), A. Spink, A. Turpin, and M. Wu (Eds.), 32–39.Google Scholar
- Steven Burrows, S. M. M. Tahaghoghi, and Justin Zobel. 2007. Efficient plagiarism detection for large code repositories. Softw. Pract. Exper. 37, 2 (Feb. 2007), 151–175. DOI:https://doi.org/10.1002/spe.v37:2Google Scholar
Digital Library
- Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA’09). Springer-Verlag, Berlin, 699–713. DOI:https://doi.org/10.1007/978-3-642-00887-0_61Google Scholar
Digital Library
- S. Burrows, A. L. Uitdenbogerd, and A. Turpin. 2009. Temporally robust software features for authorship attribution. In Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference, Vol. 1. 599–606.Google Scholar
- Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2014. Comparing techniques for authorship attribution of source code. Softw.: Pract. Exper. 44, 1 (2014), 1–32. DOI:https://doi.org/10.1002/spe.2146Google Scholar
Cross Ref
- Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, and Rachel Greenstadt. 2015. De-anonymizing programmers via code stylometry. In Proceedings of the 24th USENIX Conference on Security Symposium (SEC’15). USENIX Association, Berkeley, CA, 255–270. http://dl.acm.org/citation.cfm?id=2831143.2831160.Google Scholar
Digital Library
- Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, and Arvind Narayanan. 2018. When coding style survives compilation: De-anonymizing programmers from executable binaries. In Proceedings of the Network and Distributed System Security Symposium 2018 (NDSS’18).Google Scholar
Cross Ref
- B. Chandra and Rajesh Kumar Sharma. 2017. On improving recurrent neural network for image classification. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN’17). IEEE, 1904–1907.Google Scholar
Cross Ref
- Chahes Chopra, Shivam Sinha, Shubham Jaroli, Anupam Shukla, and Saumil Maheshwari. 2017. Recurrent neural networks with non-sequential data to predict hospital readmission of diabetic patients. In Proceedings of the 2017 International Conference on Computational Biology and Bioinformatics. 18–23.Google Scholar
Digital Library
- Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. 2012. Multi-column deep neural networks for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’12). IEEE, 3642–3649.Google Scholar
- George E. Dahl, Dong Yu, Li Deng, and Alex Acero. 2012. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans. Aud. Speech Lang. Process. 20, 1 (2012), 30–42.Google Scholar
Digital Library
- Edwin Dauber, Aylin Caliskan, Richard Harang, and Rachel Greenstadt. 2018. Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings. ACM, 356–357.Google Scholar
Digital Library
- Haibiao Ding and Mansur H. Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. J. Syst. Softw. 72, 1 (2004), 49–57. DOI:https://doi.org/10.1016/S0164-1212(03)00049-9Google Scholar
Digital Library
- John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (Jul. 2011), 2121–2159.Google Scholar
- Bruce S. Elenbogen and Naeem Seliya. 2008. Detecting outsourced student programming assignments. J. Comput. Sci. Coll. 23, 3 (Jan. 2008), 50–57. http://dl.acm.org/citation.cfm?id=1295109.1295123.Google Scholar
Digital Library
- Brian S. Everitt and Graham Dunn. 2001. Applied Multivariate Data Analysis. Vol. 2. Wiley Online Library.Google Scholar
- Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, Carole E. Chaski, and Blake Stephen Howald. 2007. Identifying authorship by byte-level n-grams: The source code author profile (scap) method. Int. J. Dig. Evid. 6, 1 (2007), 1–18.Google Scholar
- Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. 2006. Effective identification of source code authors using byte-level information. In Proceedings of the 28th International Conference on Software Engineering (ICSE’06). ACM, New York, NY, 893–896. DOI:https://doi.org/10.1145/1134285.1134445Google Scholar
Digital Library
- Ian J. Goodfellow, Aaron Courville, and Yoshua Bengio. 2012. Spike-and-slab sparse coding for unsupervised feature discovery. CoRR abs/1201.3382 (2012). http://arxiv.org/abs/1201.3382.Google Scholar
- Niels Dalum Hansen, Christina Lioma, Birger Larsen, and Stephen Alstrup. 2014. Temporal context for authorship attribution. In Proceedings of the Information Retrieval Facility Conference. Springer, 22–40.Google Scholar
Cross Ref
- Geoffrey Hinton, Li Deng, Dong Yu, George E. Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N. Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Sign. Process. Mag. 29, 6 (2012), 82–97.Google Scholar
Cross Ref
- Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neur. Comput. 18, 7 (2006), 1527–1554.Google Scholar
Digital Library
- Pascal Junod, Julien Rinaldini, Johan Wehrli, and Julie Michielin. 2015. Obfuscator-LLVM—Software protection for the masses. In Proceedings of the IEEE/ACM 1st International Workshop on Software Protection (SPRO’15), Brecht Wyseur (Ed.). IEEE, 3–9. DOI:https://doi.org/10.1109/SPRO.2015.10Google Scholar
Digital Library
- Patrick Juola et al. 2008. Authorship attribution. Found. Trends Inf. Retriev. 1, 3 (2008), 233–334.Google Scholar
- Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING’03), Vol. 3. 255–264.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015. http://arxiv.org/abs/1412.6980Google Scholar
- Ron Kohavi et al. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 14. Stanford, CA, 1137–1145.Google Scholar
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Assoc. Inf. Sci. Technol. 60, 1 (2009), 9–26.Google Scholar
Digital Library
- Ivan Krsul and Eugene H. Spafford. 1997. Refereed paper: Authorship analysis: Identifying the author of a program. Comput. Secur. 16, 3 (Jan. 1997), 233–257.Google Scholar
Digital Library
- Robert Charles Lange and Spiros Mancoridis. 2007. Using code metric histograms and genetic algorithms to perform author identification for software forensics. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation (GECCO’07). ACM, New York, NY, 2082–2089. DOI:https://doi.org/10.1145/1276958.1277364Google Scholar
Digital Library
- Liang Li, Shuhui Wang, Shuqiang Jiang, and Qingming Huang. 2018. Attentive recurrent neural network for weak-supervised multi-label image classification. In Proceedings of the 26th ACM International Conference on Multimedia. 1092–1100.Google Scholar
Digital Library
- S. G. Macdonell, A. R. Gray, G. MacLennan, and P. J. Sallis. 1999. Software forensics for discriminating between program authors using case-based reasoning, feedforward neural networks and multiple discriminant analysis. In Proceedings of the 6th International Conference on Neural Information Processing (ICONIP’99), Vol. 1. 66–71. DOI:https://doi.org/10.1109/ICONIP.1999.843963Google Scholar
- Cameron H. Malin, Eoghan Casey, and James M. Aquilina. 2008. Malware Forensics: Investigating and Analyzing Malicious Code. Syngress.Google Scholar
- Xiaozhu Meng, Barton P. Miller, and Kwang-Sung Jun. 2017. Identifying multiple authors in a binary program. In Proceedings of the European Symposium on Research in Computer Security. Springer, Oslo, Norway, 286–304.Google Scholar
Cross Ref
- Lichao Mou, Pedram Ghamisi, and Xiao Xiang Zhu. 2017. Deep recurrent neural networks for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 55, 7 (2017), 3639–3655.Google Scholar
Cross Ref
- Brian N. Pellin. 2000. Using classification techniques to determine source code authorship. White Paper. Department of Computer Science, University of Wisconsin.Google Scholar
- Salah Rifai, Yann Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller. 2011. The manifold tangent classifier. In Proceedings of the Conference on Neural Information Processing Systems (NIPS’11), Vol. 271. 523.Google Scholar
- Nathan Rosenblum, Barton P. Miller, and Xiaojin Zhu. 2011. Recovering the toolchain provenance of binary code. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. ACM, 100–110.Google Scholar
Digital Library
- Nathan Rosenblum, Xiaojin Zhu, and Barton Miller. 2011. Who wrote this code? Identifying the authors of program binaries. In Proceedings of the European Symposium on Research in Computer Security (ESORICS’11), 172–189.Google Scholar
Cross Ref
- Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing functions in binaries with neural networks. In Proceedings of the 24th USENIX Security Symposium (USENIX Security’15). Washington, DC, 611–626.Google Scholar
- Eugene H. Spafford and Stephen A. Weeber. 1993. Software forensics: Can we track code to its authors?Comput. Secur. 12, 6 (1993), 585–595.Google Scholar
Digital Library
- Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (Jan. 2014), 1929–1958.Google Scholar
Digital Library
- Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Assoc. Inf. Sci. Technol. 60, 3 (2009), 538–556.Google Scholar
Digital Library
- Ariel Stolerman, Rebekah Overdorf, Sadia Afroz, and Rachel Greenstadt. 2013. Classify, but verify: Breaking the closed-world assumption in stylometric authorship attribution. In IFIP Working Group, Vol. 11. 64.Google Scholar
- Tijmen Tieleman and Geoffrey Hinton. 2012. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. Neur. Netw. Mach. Learn. 4, 2 (2012).Google Scholar
- Özlem Uzuner and Boris Katz. 2005. A comparative study of language models for book and author recognition. In Proceedings of the International Conference on Natural Language Processing. Springer, 969–980.Google Scholar
Digital Library
- Linda J. Wilcox. 1998. Authorship: The coin of the realm, the source of complaints. J. Am. Med. Assoc. 280, 3 (1998), 216–217.Google Scholar
Cross Ref
Index Terms
Large-scale and Robust Code Authorship Identification with Deep Feature Learning
Recommendations
Code Authorship Attribution: Methods and Challenges
Code authorship attribution is the process of identifying the author of a given code. With increasing numbers of malware and advanced mutation techniques, the authors of malware are creating a large number of malware variants. To better deal with this ...
Large-Scale and Language-Oblivious Code Authorship Identification
CCS '18: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications SecurityEfficient extraction of code authorship attributes is key for successful identification. However, the extraction of such attributes is very challenging, due to various programming language specifics, the limited number of available code samples per ...
Code authorship identification using convolutional neural networks
AbstractAlthough source code authorship identification creates a privacy threat for many open source contributors, it is an important topic for the forensics field and enables many successful forensic applications, including ghostwriting ...
Highlights- We proposed three CNN-based code authorship identification systems.
- We ...






Comments