Abstract
Deep neural networks (DNNs) have a wide range of applications, and software employing them must be thoroughly tested, especially in safety-critical domains. However, traditional software test coverage metrics cannot be applied directly to DNNs. In this paper, inspired by the MC/DC coverage criterion, we propose a family of four novel test coverage criteria that are tailored to structural features of DNNs and their semantics. We validate the criteria by demonstrating that test inputs that are generated with guidance by our proposed coverage criteria are able to capture undesired behaviours in a DNN. Test cases are generated using a symbolic approach and a gradient-based heuristic search. By comparing them with existing methods, we show that our criteria achieve a balance between their ability to find bugs (proxied using adversarial examples and correlation with functional coverage) and the computational cost of test input generation. Our experiments are conducted on state-of-the-art DNNs obtained using popular open source datasets, including MNIST, CIFAR-10 and ImageNet.
- [n.d.]. Guide for Verification of Autonomous Systems. https://standards.ieee.org/project/2817.html.Google Scholar
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In OSDI, Vol. 16. USENIX Association, 265--283.Google Scholar
Digital Library
- Rob Ashmore and Matthew Hill. 2018. “Boxing clever”: Practical techniques for gaining insights into training data and monitoring distribution shift. In Computer Safety, Reliability, and Security (LNCS), Vol. 11094. Springer, 393--405.Google Scholar
- Rob Ashmore and Elizabeth Lennon. 2017. Progress towards the assurance of non-traditional software. In Safety-critical Systems Symposium.Google Scholar
- Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. 2008. Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110, 3 (June 2008), 346--359.Google Scholar
Digital Library
- Nicholas Carlini and David Wagner. 2017. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy (S8P). 39--57.Google Scholar
Cross Ref
- Chih-Hong Cheng, Chung-Hao Huang, and Hirotoshi Yasuoka. 2018. Quantitative projection coverage for testing ML-enabled autonomous systems. In International Symposium on Automated Technology for Verification and Analysis, ATVA (LNCS), Vol. 11138. Springer, 126--142.Google Scholar
Cross Ref
- Tommaso Dreossi, Alexandre Donzé, and Sanjit A. Seshia. 2018. Compositional falsification of cyber-physical systems with machine learning components. Journal of Automated Reasoning (2018).Google Scholar
- Tommaso Dreossi, Shromona Ghosh, Alberto Sangiovanni-Vincentelli, and Sanjit A. Seshia. 2019. A formalization of robustness for deep neural networks. arXiv preprint arXiv:1903.10033 (2019).Google Scholar
- Souradeep Dutta, Xin Chen, and Sriram Sankaranarayanan. 2019. Reachability analysis for neural feedback systems using regressive polynomial rule inference. In Hybrid Systems: Computation and Control. ACM, 157--168.Google Scholar
- Souradeep Dutta, Susmit Jha, Sriram Sankaranarayanan, and Ashish Tiwari. 2018. Output range analysis for deep feedforward neural networks. In NASA Formal Methods Symposium (LNCS), Vol. 10811. Springer, 121--138.Google Scholar
Cross Ref
- Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep Learning. Vol. 1. MIT Press.Google Scholar
Digital Library
- Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR.Google Scholar
- Kelly Hayhurst, Dan Veerhusen, John Chilenski, and Leanna Rierson. 2001. A Practical Tutorial on Modified Condition/Decision Coverage. Technical Report. NASA.Google Scholar
- Xiaowei Huang, Marta Kwiatkowska, Sen Wang, and Min Wu. 2017. Safety verification of deep neural networks. In Computer Aided Verification, CAV (LNCS), Vol. 10426. Springer, 3--29. DOI:https://doi.org/10.1007/978-3-319-63387-9_1Google Scholar
- Radoslav Ivanov, James Weimer, Rajeev Alur, George J. Pappas, and Insup Lee. 2019. Verisig: Verifying safety properties of hybrid systems with neural network controllers. In Hybrid Systems: Computation and Control. ACM, 169--178.Google Scholar
- Cem Kaner. 2006. Exploratory testing. In Quality Assurance Institute Worldwide Annual Software Testing Conference.Google Scholar
- Guy Katz, Clark Barrett, David L. Dill, Kyle Julian, and Mykel J. Kochenderfer. 2017. Reluplex: An efficient SMT solver for verifying deep neural networks. In Computer Aided Verification, CAV (LNCS). Springer, 97--117.Google Scholar
- Shuyue Lan, Chao Huang, Zhilu Wang, Hengyi Liang, Wenhao Su, and Qi Zhu. 2018. Design automation for intelligent automotive systems. In International Test Conference (ITC). IEEE, 1--10.Google Scholar
Cross Ref
- David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60, 2 (Nov. 2004), 91--110.Google Scholar
Digital Library
- Lei Ma, Felix Juefei-Xu, Jiyuan Sun, Chunyang Chen, Ting Su, Fuyuan Zhang, Minhui Xue, Bo Li, Li Li, Yang Liu, Jianjun Zhao, and Yadong Wang. 2018. DeepGauge: Multi-granularity testing criteria for deep learning systems. In Automated Software Engineering (ASE). ACM, 120--131.Google Scholar
- Matthew Mirman, Timon Gehr, and Martin Vechev. 2018. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning, ICML. 3575--3583.Google Scholar
- Vinod Nair and Geoffrey E. Hinton. 2010. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, ICML. 807--814.Google Scholar
- Augustus Odena, Catherine Olsson, David Andersen, and Ian J. Goodfellow. 2019. TensorFuzz: Debugging neural networks with coverage-guided fuzzing. In International Conference on Machine Learning, ICML. PMLR, 4901--4911.Google Scholar
- Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. 2018. The building blocks of interpretability. Distill (2018). DOI:https://doi.org/10.23915/distill.00010Google Scholar
- Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016. The limitations of deep learning in adversarial settings. In European Symposium on Security and Privacy (EuroS8P). IEEE, 372--387.Google Scholar
Cross Ref
- Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 1--18.Google Scholar
Digital Library
- RTCA. 2011. DO-178C, software considerations in airborne systems and equipment certification. (2011).Google Scholar
- SASWG. 2019. Safety assurance objectives for autonomous systems. (2019).Google Scholar
- Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, ICLR.Google Scholar
- Xiaowu Sun, Haitham Khedr, and Yasser Shoukry. 2019. Formal verification of neural network controlled autonomous systems. In Hybrid Systems: Computation and Control. ACM, 147--156.Google Scholar
- Youcheng Sun, Xiaowei Huang, and Daniel Kroening. 2018. Testing deep neural networks. CoRR abs/1803.04792 (2018). arxiv:1803.04792 http://arxiv.org/abs/1803.04792Google Scholar
- Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. 2018. Concolic testing for deep neural networks. In Automated Software Engineering (ASE). ACM, 109--119.Google Scholar
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In ICLR.Google Scholar
- Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2017. DeepTest: Automated testing of deep-neural-network-driven autonomous cars. arXiv preprint arXiv:1708.08559 (2017).Google Scholar
- Cumhur Erkan Tuncali, Georgios Fainekos, Hisahiro Ito, and James Kapinski. 2018. Simulation-based adversarial test generation for autonomous vehicles with machine learning components. In Intelligent Vehicles Symposium (IV). IEEE, 1555--1562.Google Scholar
Digital Library
- Cumhur Erkan Tuncali, Hisahiro Ito, James Kapinski, and Jyotirmoy V. Deshmukh. 2018. Reasoning about safety of learning-enabled components in autonomous cyber-physical systems. In 55th Design Automation Conference (DAC). IEEE, 1--6.Google Scholar
- Zhou Wang, Eero P. Simoncelli, and Alan C. Bovik. 2003. Multiscale structural similarity for image quality assessment. In Signals, Systems and Computers, Conference Record of the Thirty-Seventh Asilomar Conference on.Google Scholar
- Matthew Wicker, Xiaowei Huang, and Marta Kwiatkowska. 2018. Feature-guided black-box safety testing of deep neural networks. In Tools and Algorithms for the Construction and Analysis of Systems, TACAS (LNCS), Vol. 10805. Springer, 408--426.Google Scholar
- Shakiba Yaghoubi and Georgios Fainekos. 2019. Gray-box adversarial testing for control systems with machine learning components. In Hybrid Systems: Computation and Control. ACM, 179--184.Google Scholar
- Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579 (2015).Google Scholar
Index Terms
Structural Test Coverage Criteria for Deep Neural Networks
Recommendations
A unified framework for evaluating test criteria in model-checking-assisted test case generation
Testing is often cited as one of the most costly operations in testing dependable systems (Heimdahl et al. 2001 ). A particular challenging task in testing is test-case generation. To improve the efficiency of test-case generation and reduce its cost, ...
Structural test coverage criteria for deep neural networks
ICSE '19: Proceedings of the 41st International Conference on Software Engineering: Companion ProceedingsDeep Neural Networks (DNNs) have a wide range of applications, and software employing them must be thoroughly tested, especially in safety-critical domains. However, traditional software test coverage metrics cannot be applied directly to DNNs. In this ...
Program Testing Complexity and Test Criteria
This paper explores the testing complexity of several classes of programs, where the testing complexity is measured in terms of the number of test data required for demonstrating program correctness by testing. It is shown that even for very restrictive ...






Comments