Abstract
In the past decade, Deep Learning (DL) systems have been widely deployed in various application domains to facilitate our daily life, e.g., natural language processing, healthcare, activity recognition, and autonomous driving. Meanwhile, it is extremely challenging to ensure the correctness of DL systems (e.g., due to their intrinsic nondeterminism), and bugs in DL systems can cause serious consequences and may even threaten human lives. In the literature, researchers have explored various techniques to test, analyze, and verify DL models, since their quality directly affects the corresponding system behaviors. Recently, researchers have also proposed novel techniques for testing the underlying operator-level DL libraries (such as TensorFlow and PyTorch), which provide general binary implementations for each high-level DL operator and are the foundation for running DL models on different hardware platforms. However, there is still limited work targeting the reliability of the emerging tensor compilers (also known as DL compilers), which aim to automatically compile high-level tensor computation graphs directly into high-performance binaries for better efficiency, portability, and scalability than traditional operator-level libraries. Therefore, in this paper, we target the important problem of tensor compiler testing, and have proposed Tzer, a practical fuzzing technique for the widely used TVM tensor compiler. Tzer focuses on mutating the low-level Intermediate Representation (IR) for TVM due to the limited mutation space for the high-level IR. More specifically, Tzer leverages both general-purpose and tensor-compiler-specific mutators guided by coverage feedback for diverse and evolutionary IR mutation; furthermore, since tensor compilers provide various passes (i.e., transformations) for IR optimization, Tzer also performs pass mutation in tandem with IR mutation for more effective fuzzing. Our experimental results show that Tzer substantially outperforms existing fuzzing techniques on tensor compiler testing, with 75% higher coverage and 50% more valuable tests than the 2nd-best technique. Also, different components of Tzer have been validated via ablation study. To date, Tzer has detected 49 previously unknown bugs for TVM, with 37 bugs confirmed and 25 bugs fixed (PR merged).
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). USENIX Association, Savannah, GA. 265–283. isbn:978-1-931971-33-1 https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadiGoogle Scholar
Digital Library
- Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. 2010. Cython: The best of both worlds. Computing in Science & Engineering, 13, 2 (2010), 31–39.Google Scholar
Digital Library
- Petter Bjørstad, Fredrik Manne, Tor Sørevik, and Marian Vajteršic. 1992. Efficient matrix multiplication on SIMD computers. SIAM J. Matrix Anal. Appl., 13, 1 (1992), 386–401.Google Scholar
Digital Library
- Google Security Blog. 2016. Guided in-process fuzzing of Chrome components. https://security.googleblog.com/2016/08/guided-in-process-fuzzing-of-chrome.htmlGoogle Scholar
- Marcel Böhme, Valentin JM Manès, and Sang Kil Cha. 2020. Boosting fuzzer efficiency: An information theoretic perspective. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 678–689.Google Scholar
Digital Library
- Marcel Böhme, Van-Thuan Pham, and Abhik Roychoudhury. 2017. Coverage-based greybox fuzzing as markov chain. IEEE Transactions on Software Engineering, 45, 5 (2017), 489–506.Google Scholar
Cross Ref
- Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2019. OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. IEEE transactions on pattern analysis and machine intelligence, 43, 1 (2019), 172–186.Google Scholar
Digital Library
- Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, and Luis Ceze. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 578–594.Google Scholar
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759.Google Scholar
- Koen Claessen, Jonas Duregård, and Michał H Pał ka. 2015. Generating constrained random data with uniform distribution. Journal of functional programming, 25 (2015).Google Scholar
- Apache TVM Community. 2020. tvm.relay.testing — tvm 0.8.dev0 documentation. https://tvm.apache.org/docs/api/python/relay/testing.htmlGoogle Scholar
- Maxime Dénès, Catalin Hritcu, Leonidas Lampropoulos, Zoe Paraskevopoulou, and Benjamin C Pierce. 2014. QuickChick: Property-based testing for Coq. In The Coq Workshop. 125, 126.Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423 Google Scholar
Cross Ref
- Andre Esteva, Alexandre Robicquet, Bharath Ramsundar, Volodymyr Kuleshov, Mark DePristo, Katherine Chou, Claire Cui, Greg Corrado, Sebastian Thrun, and Jeff Dean. 2019. A guide to deep learning in healthcare. Nature medicine, 25, 1 (2019), 24–29.Google Scholar
- Andrea Fioraldi, Dominik Maier, Heiko Eiß feldt, and Marc Heuse. 2020. AFL++: Combining Incremental Steps of Fuzzing Research. In 14th USENIX Workshop on Offensive Technologies (WOOT 20). USENIX Association.Google Scholar
- Python Software Foundation. 2021. https://docs.python.org/3/library/ctypes.htmlGoogle Scholar
- Joshua Garcia, Yang Feng, Junjie Shen, Sumaya Almanee, Yuan Xia, and Qi Alfred Chen. 2020. A comprehensive study of autonomous vehicle bugs. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 385–396.Google Scholar
Digital Library
- Google. 2015. Keras. https://keras.ioGoogle Scholar
- Google. 2016. XLA: Optimizing Compiler for Machine Learning. https://www.tensorflow.org/xlaGoogle Scholar
- Rahul Gopinath, Carlos Jensen, and Alex Groce. 2014. Code coverage for suite evaluation by developers. In Proceedings of the 36th International Conference on Software Engineering. 72–82.Google Scholar
Digital Library
- Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. 2020. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37, 3 (2020), 362–386.Google Scholar
Cross Ref
- Yixiao Guo, Jiawei Liu, Guo Li, Luo Mai, and Hao Dong. 2021. Fast and Flexible Human Pose Estimation with HyperPose. arXiv preprint arXiv:2108.11826.Google Scholar
- Christian Holler, Kim Herzig, and Andreas Zeller. 2012. Fuzzing with Code Fragments. In 21st USENIX Security Symposium (USENIX Security 12). USENIX Association, Bellevue, WA. 445–458. isbn:978-931971-95-9 https://www.usenix.org/conference/usenixsecurity12/technical-sessions/presentation/hollerGoogle Scholar
- Intel. 2017. PlaidML is a framework for making deep learning work everywhere.. https://github.com/plaidml/plaidmlGoogle Scholar
- Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: Optimizing Deep Learning Computation with Automatic Generation of Graph Substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). Association for Computing Machinery, New York, NY, USA. 47–62. isbn:9781450368735 https://doi.org/10.1145/3341301.3359630 Google Scholar
Digital Library
- Tian Jin, Gheorghe-Teodor Bercea, Tung D Le, Tong Chen, Gong Su, Haruki Imai, Yasushi Negishi, Anh Leu, Kevin O’Brien, and Kiyokuni Kawachiya. 2020. Compiling ONNX Neural Network Models Using MLIR. arXiv preprint arXiv:2008.08272.Google Scholar
- Kyungtae Kim, Dae Jeong, Chung Hwan Kim, Yeongjin Jang, Insik Shin, and Byoungyoung Lee. 2020. HFL: Hybrid Fuzzing on the Linux Kernel. https://doi.org/10.14722/ndss.2020.24018 Google Scholar
Cross Ref
- George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating Fuzz Testing. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security (CCS ’18). Association for Computing Machinery, New York, NY, USA. 2123–2138. isbn:9781450356930 https://doi.org/10.1145/3243734.3243804 Google Scholar
Digital Library
- Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. 2019. PifPaf: Composite Fields for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 11969–11978. https://doi.org/10.1109/CVPR.2019.01225 Google Scholar
Cross Ref
- Leonidas Lampropoulos, Michael Hicks, and Benjamin C Pierce. 2019. Coverage guided, property based testing. Proceedings of the ACM on Programming Languages, 3, OOPSLA (2019), 1–29.Google Scholar
Digital Library
- Leonidas Lampropoulos, Zoe Paraskevopoulou, and Benjamin C Pierce. 2017. Generating good generators for inductive relations. Proceedings of the ACM on Programming Languages, 2, POPL (2017), 1–30.Google Scholar
- Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. 2020. MLIR: A compiler infrastructure for the end of Moore’s law. arXiv preprint arXiv:2002.11054.Google Scholar
- Chris Arthur Lattner. 2002. LLVM: An infrastructure for multi-stage optimization. Ph.D. Dissertation. University of Illinois at Urbana-Champaign.Google Scholar
- Vu Le, Mehrdad Afshari, and Zhendong Su. 2014. Compiler validation via equivalence modulo inputs. ACM Sigplan Notices, 49, 6 (2014), 216–226.Google Scholar
Digital Library
- Caroline Lemieux and Koushik Sen. 2018. FairFuzz: A Targeted Mutation Strategy for Increasing Greybox Fuzz Testing Coverage. Association for Computing Machinery, New York, NY, USA. 475–485. isbn:9781450359375 https://doi.org/10.1145/3238147.3238176 Google Scholar
Digital Library
- Jun Li, Bodong Zhao, and Chao Zhang. 2018. Fuzzing: a survey. Cybersecurity, 1, 1 (2018), 1–13.Google Scholar
Cross Ref
- Mingzhen Li, Yi Liu, Xiaoyan Liu, Qingxiao Sun, Xin You, Hailong Yang, Zhongzhi Luan, Lin Gan, Guangwen Yang, and Depei Qian. 2020. The deep learning compiler: A comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 32, 3 (2020), 708–727.Google Scholar
Cross Ref
- Jiawei Liu, Yuxiang Wei, Sen Yang, Yinlin Deng, and Lingming Zhang. 2022. Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation. https://doi.org/10.5281/zenodo.6371291 Google Scholar
Digital Library
- Valentin Jean Marie Manès, HyungSeok Han, Choongwoo Han, Sang Kil Cha, Manuel Egele, Edward J Schwartz, and Maverick Woo. 2019. The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering, 47 (2019), 2312–2331.Google Scholar
Cross Ref
- Riccardo Miotto, Fei Wang, Shuang Wang, Xiaoqian Jiang, and Joel T Dudley. 2018. Deep learning for healthcare: review, opportunities and challenges. Briefings in bioinformatics, 19, 6 (2018), 1236–1246.Google Scholar
- Augustus Odena, Catherine Olsson, David Andersen, and Ian Goodfellow. 2019. TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing. In Proceedings of the 36th International Conference on Machine Learning, Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.) (Proceedings of Machine Learning Research, Vol. 97). PMLR, 4901–4911. https://proceedings.mlr.press/v97/odena19a.htmlGoogle Scholar
- David Pankratz. 2020. TVMFuzz: Fuzzing Tensor-level Intermediate Representation in TVM. https://github.com/dpankratz/TVMFuzzGoogle Scholar
- Neungsoo Park, Bo Hong, and Viktor K Prasanna. 2003. Tiling, block data layout, and memory hierarchy performance. IEEE Transactions on Parallel and Distributed Systems, 14, 7 (2003), 640–654.Google Scholar
Digital Library
- Soyeon Park, Wen Xu, Insu Yun, Daehee Jang, and Taesoo Kim. 2020. Fuzzing JavaScript Engines with Aspect-preserving Mutation. In 2020 IEEE Symposium on Security and Privacy (SP). 1629–1642. https://doi.org/10.1109/SP40000.2020.00067 Google Scholar
Cross Ref
- Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, and Luca Antiga. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32 (2019), 8026–8037.Google Scholar
- Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2019. DeepXplore: Automated Whitebox Testing of Deep Learning Systems. Commun. ACM, 62, 11 (2019), oct, 137–145. issn:0001-0782 https://doi.org/10.1145/3361566 Google Scholar
Digital Library
- Hung Viet Pham, Thibaud Lutellier, Weizhen Qi, and Lin Tan. 2019. CRADLE: Cross-backend validation to Detect and Localize bugs in Deep learning libraries. ICSE ’19. IEEE Press, 1027–1038. https://doi.org/10.1109/ICSE.2019.00107 Google Scholar
Digital Library
- Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 48, 6 (2013), 519–530.Google Scholar
Digital Library
- Qing Rao and Jelena Frtunikj. 2018. Deep Learning for Self-Driving Cars: Chances and Challenges. In 2018 IEEE/ACM 1st International Workshop on Software Engineering for AI in Autonomous Systems (SEFAIAS). 35–38.Google Scholar
Digital Library
- Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Garret Catron, Summer Deng, Roman Dzhabarov, Nick Gibson, James Hegeman, Meghan Lele, and Roman Levenstein. 2018. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint arXiv:1805.00907.Google Scholar
- Kosta Serebryany. 2016. Continuous fuzzing with libfuzzer and addresssanitizer. In 2016 IEEE Cybersecurity Development (SecDev). 157–157.Google Scholar
- Kostya Serebryany. 2017. OSS-Fuzz-Google’s continuous fuzzing service for open source software.Google Scholar
- Tyler M Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R Hammond, and Field G Van Zee. 2014. Anatomy of high-performance many-threaded matrix multiplication. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1049–1059.Google Scholar
Digital Library
- Bjarne Stroustrup. 2017. Why doesn’t C++ provide a "finally" construct? https://www.stroustrup.com/bs_faq2.html#finallyGoogle Scholar
- Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering. 303–314.Google Scholar
Cross Ref
- Philippe Tillet, H. T. Kung, and David Cox. 2019. Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. Association for Computing Machinery, New York, NY, USA. 10–19. isbn:9781450367196 https://doi.org/10.1145/3315508.3329973 Google Scholar
Digital Library
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). 30, Curran Associates, Inc.. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdfGoogle Scholar
- Zan Wang, Ming Yan, Junjie Chen, Shuang Liu, and Dongdi Zhang. 2020. Deep learning library testing via effective model generation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 788–799.Google Scholar
Digital Library
- Anjiang Wei, Yinlin Deng, Chenyuan Yang, and Lingming Zhang. 2022. Free Lunch for Testing: Fuzzing Deep-Learning Libraries from Open Source. arXiv preprint arXiv:2201.06589.Google Scholar
- Glibc Wiki. 2016. Fuzzing libc. https://sourceware.org/glibc/wiki/FuzzingLibcGoogle Scholar
- Kanit Wongsuphasawat, Daniel Smilkov, James Wexler, Jimbo Wilson, Dandelion Mane, Doug Fritz, Dilip Krishnan, Fernanda B Viégas, and Martin Wattenberg. 2017. Visualizing dataflow graphs of deep learning models in tensorflow. IEEE transactions on visualization and computer graphics, 24, 1 (2017), 1–12.Google Scholar
- Mingyuan Wu, Ling Jiang, Jiahong Xiang, Yanwei Huang, Heming Cui, Lingming Zhang, and Yuqun Zhang. 2022. One Fuzzing Strategy to Rule Them All. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).Google Scholar
Digital Library
- Xuejun Yang, Yang Chen, Eric Eide, and John Regehr. 2011. Finding and Understanding Bugs in C Compilers. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). Association for Computing Machinery, New York, NY, USA. 283–294. isbn:9781450306638 https://doi.org/10.1145/1993498.1993532 Google Scholar
Digital Library
- Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. 2018. Recent trends in deep learning based natural language processing. ieee Computational intelligenCe magazine, 13, 3 (2018), 55–75.Google Scholar
- Michal Zalewski. 2018. American Fuzzing Lop (AFL). https://lcamtuf.coredump.cx/afl/Google Scholar
- Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). 132–142.Google Scholar
Digital Library
- Qirun Zhang, Chengnian Sun, and Zhendong Su. 2017. Skeletal Program Enumeration for Rigorous Compiler Testing. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). Association for Computing Machinery, New York, NY, USA. 347–361. isbn:9781450349888 https://doi.org/10.1145/3062341.3062379 Google Scholar
Digital Library
- Jie Zhao, Bojie Li, Wang Nie, Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, and Xuefeng Jin. 2021. AKG: Automatic Kernel Generation for Neural Processing Units Using Polyhedral Transformations. PLDI 2021. Association for Computing Machinery, New York, NY, USA. 1233–1248. isbn:9781450383912 https://doi.org/10.1145/3453483.3454106 Google Scholar
Digital Library
- Yingquan Zhao, Zan Wang, Junjie Chen, Mengdi Liu, Mingyuan Wu, Yuqun Zhang, and Lingming Zhang. 2022. History-Driven Test Program Synthesis for JVM Testing. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).Google Scholar
- Rui Zhong, Yongheng Chen, Hong Hu, Hangfan Zhang, Wenke Lee, and Dinghao Wu. 2020. Squirrel: Testing database management systems with language validity and coverage feedback. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security. 955–970.Google Scholar
Digital Library
- Husheng Zhou, Wei Li, Zelun Kong, Junfeng Guo, Yuqun Zhang, Bei Yu, Lingming Zhang, and Cong Liu. 2020. DeepBillboard: Systematic Physical-World Testing of Autonomous Driving Systems. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE ’20). Association for Computing Machinery, New York, NY, USA. 347–358. isbn:9781450371216 https://doi.org/10.1145/3377811.3380422 Google Scholar
Digital Library
Index Terms
Coverage-guided tensor compiler fuzzing with joint IR-pass mutation
Recommendations
Investigating Coverage Guided Fuzzing with Mutation Testing
Internetware '22: Proceedings of the 13th Asia-Pacific Symposium on InternetwareCoverage guided fuzzing (CGF) is an effective testing technique which has detected hundreds of thousands of bugs from various software applications. It focuses on maximizing code coverage to reveal more bugs during fuzzing. However, a higher coverage ...
Finding deep compiler bugs via guided stochastic program mutation
OOPSLA '15Compiler testing is important and challenging. Equivalence Modulo Inputs (EMI) is a recent promising approach for compiler validation. It is based on mutating the unexecuted statements of an existing program under some inputs to produce new equivalent ...
Fine-Grained Coverage-Based Fuzzing
Fuzzing is a popular software testing method that discovers bugs by massively feeding target applications with automatically generated inputs. Many state-of-art fuzzers use branch coverage as a feedback metric to guide the fuzzing process. The fuzzer ...






Comments