skip to main content
10.1145/3424954.3424957acmotherconferencesArticle/Chapter ViewAbstractPublication PageseiccConference Proceedingsconference-collections
research-article

Cross-domain meta-learning for bug finding in the source codes with a small dataset

Published:12 January 2021Publication History

ABSTRACT

In terms of application security, detecting security vulnerabilities in prior and fixing them is one of the effective ways to prevent malicious activities. However, finding security bugs is highly reliant upon human experts due to its complexity. Therefore, source code auditing, one of the ways to find bugs, costs a lot, and the quality of auditing quite varies according to the performer. There have been many attempts to make automated systems for code auditing, but they have been suffered from huge false positives and false negatives.

Meanwhile, machine learning technology is advancing dramatically in recent years, and it is outperforming humans in many tasks with high accuracy. Thus there have been lots of efforts to accommodate machine learning technology for security research. Most of the time, however, it is very difficult to obtain legitimate training data, and rarer often means more lethal in security. Therefore it is not easy to build reliable machine learning systems for security defects, and we are highly relying on human experts who can learn easily by a few examples.

To overcome the obstacle, this paper proposes a deep neural network model for finding security bugs, which takes advantages of the recent developments in the machine learning technology; the language model adapted sub-word tokenization and self-attention based transformer from natural language processing for source code understanding, and a meta-learning technique from computer vision to overcome lack of legitimate vulnerability samples for the deep learning model. The model is also evaluated for finding DOM-based XSS bugs which is prevalent but hard to spot with traditional detection methods. The result shows that the model outperforms the baseline by 45% in the F1 score.

References

  1. Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. arXiv:cs.CL/2005.14165Google ScholarGoogle Scholar
  2. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:cs.CL/1810.04805Google ScholarGoogle Scholar
  3. Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. arXiv:cs.LG/1703.03400Google ScholarGoogle Scholar
  4. Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2019. Pre-trained Contextual Embedding of Source Code. arXiv preprint arXiv:2001.00059 (2019).Google ScholarGoogle Scholar
  5. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 66--71. https://doi.org/10.18653/v1/D18-2012Google ScholarGoogle Scholar
  6. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. arXiv:cs.CL/1909.11942Google ScholarGoogle Scholar
  7. Xin Li, Lu Wang, Yang Xin, Yixian Yang, and Yuling Chen. 2020. Automated Vulnerability Detection in Source Code Using Minimum Intermediate Representation Learning. Applied Sciences 10, 5 (2020), 1692.Google ScholarGoogle ScholarCross RefCross Ref
  8. Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. Proceedings 2018 Network and Distributed System Security Symposium (2018). https://doi.org/10.14722/ndss.2018.23158Google ScholarGoogle ScholarCross RefCross Ref
  9. Alex Nichol, Joshua Achiam, and John Schulman. 2018. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999 (2018).Google ScholarGoogle Scholar
  10. Michael Pradel and Koushik Sen. 2018. DeepBugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages 2, OOPSLA (2018), 1--25.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don't know: Unanswerable questions for SQuAD. arXiv preprint arXiv:1806.03822 (2018).Google ScholarGoogle Scholar
  12. Rebecca Russell, Louis Kim, Lei Hamilton, Tomo Lazovich, Jacob Harer, Onur Ozdemir, Paul Ellingwood, and Marc McConley. 2018. Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, 757--762.Google ScholarGoogle ScholarCross RefCross Ref
  13. Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715--1725. https://doi.org/10.18653/v1/P16-1162Google ScholarGoogle ScholarCross RefCross Ref
  14. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.Google ScholarGoogle Scholar

Index Terms

  1. Cross-domain meta-learning for bug finding in the source codes with a small dataset

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      EICC '20: Proceedings of the 2020 European Interdisciplinary Cybersecurity Conference
      November 2020
      72 pages
      ISBN:9781450375993
      DOI:10.1145/3424954

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 January 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited
    • Article Metrics

      • Downloads (Last 12 months)21
      • Downloads (Last 6 weeks)2

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader