skip to main content
research-article

TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards

Authors Info & Claims
Published:08 October 2022Publication History
Skip Abstract Section

Abstract

As deep learning inference applications are increasing in embedded devices, an embedded device tends to equip neural processing units (NPUs) in addition to a multi-core CPU and a GPU. NVIDIA Jetson AGX Xavier is an example. For fast and efficient development of deep learning applications, TensorRT is provided as the SDK for high-performance inference, including an optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. Like most deep learning frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU or NPU, not both. In this article, we present a TensorRT-based framework supporting various optimization parameters to accelerate a deep learning application targeted on an NVIDIA Jetson embedded platform with heterogeneous processors, including multi-threading, pipelining, buffer assignment, and network duplication. Since the design space of allocating layers to diverse processing elements and optimizing other parameters is huge, we devise a parameter optimization methodology that consists of a heuristic for balancing pipeline stages among heterogeneous processors and fine-tuning the process for optimizing parameters. With nine real-life benchmarks, we could achieve 101%~680% performance improvement and up to 55% energy reduction over the baseline inference using a GPU only.

REFERENCES

  1. [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.Google ScholarGoogle Scholar
  2. [2] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. YOLOv4: Optimal Speed and Accuracy of Object Detection. (2020). arXiv:cs.CV/2004.10934. Retrieved from https://arxiv.org/abs/2004.10934.Google ScholarGoogle Scholar
  3. [3] Chen Tianqi, Moreau Thierry, Jiang Ziheng, Zheng Lianmin, Yan Eddie, Shen Haichen, Cowan Meghan, Wang Leyuan, Hu Yuwei, Ceze Luis, et al. 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578594.Google ScholarGoogle Scholar
  4. [4] CodaLab. 2021. Retrieved July 1, 2021 from https://competitions.codalab.org/.Google ScholarGoogle Scholar
  5. [5] Densenet201+Yolo. 2020. Retrieved July 1, 2021 from https://github.com/AlexeyAB/darknet/.Google ScholarGoogle Scholar
  6. [6] Fowler Martin. 2012. Patterns of Enterprise Application Architecture. Addison-Wesley.Google ScholarGoogle Scholar
  7. [7] Lite Google TensorFlow. 2021. Retrieved July 1, 2021 from https://www.tensorflow.org/lite/.Google ScholarGoogle Scholar
  8. [8] Hegde Gopalakrishna, Siddhartha, Ramasamy Nachiappan, and Kapre Nachiket. 2016. CaffePresso: An optimized library for deep learning on embedded accelerator-based platforms. In International Conference on Compliers, Architectures, and Synthesis of Embedded Systems (CASES’16). 110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Huynh Loc N., Lee Youngki, and Balan Rajesh Krishna. 2017. Deepmon: Mobile GPU-based deep learning framework for continuous vision applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. 8295.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Jeong EunJin, Kim Jangryul, Tan Samnieng, Lee Jaeseong, and Ha Soonhoi. 2022. Deep learning inference parallelization on heterogeneous processors with TensorRT. IEEE Embedded Systems Letters 14, 1 (2022), 15–18. Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Jia Yangqing, Shelhamer Evan, Donahue Jeff, Karayev Sergey, Long Jonathan, Girshick Ross B., Guadarrama Sergio, and Darrell Trevor. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). 675–678. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Kang Duseok, Kang DongHyun, Kang Jintaek, Yoo Sungjoo, and Ha Soonhoi. 2018. Joint optimization of speed, accuracy, and energy for embedded image recognition systems. In Design, Automation Test in Europe Conference Exhibition (DATE’18). 715720. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Kang Duseok, Kim Euiseok, Bae Inpyo, Egger Bernhard, and Ha Soonhoi. 2018. C-GOOD: C-code generation framework for optimized on-device deep learning. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Kang Duseok, Oh Jinwoo, Choi Jongwoo, Yi Youngmin, and Ha Soonhoi. 2020. Scheduling of deep learning applications onto heterogeneous processors in an embedded device. IEEE Access 8 (2020), 4398043991. Google ScholarGoogle Scholar
  15. [15] Lane Nicholas D., Bhattacharya Sourav, Georgiev Petko, Forlivesi Claudio, Jiao Lei, Qendro Lorena, and Kawsar Fahim. 2016. DeepX: A software accelerator for low-power deep learning inference on mobile devices. In 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN’16). 112. Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Oskouei Seyyed Salar Latifi, Golestani Hossein, Hashemi Matin, and Ghiasi Soheil. 2016. CNNdroid: GPU-accelerated execution of trained deep convolutional neural networks on Android. In Proceedings of the 24th ACM International Conference on Multimedia. 12011205.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Mathur Akhil, Lane Nicholas D., Bhattacharya Sourav, Boran Aidan, Forlivesi Claudio, and Kawsar Fahim. 2017. Deepeye: Resource efficient local execution of multiple deep vision models using wearable commodity hardware. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. 6881.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Minakova Svetlana, Tang Erqian, and Stefanov Todor. 2020. Combining task- and data-level parallelism for high-throughput CNN inference on embedded CPUs-GPUs MPSoCs. In Embedded Computer Systems: Architectures, Modeling, and Simulation, Orailoglu Alex, Jung Matthias, and Reichenbach Marc (Eds.). Springer International Publishing, Cham, 1835. Google ScholarGoogle Scholar
  19. [19] Mirhoseini Azalia, Goldie Anna, Pham Hieu, Steiner Benoit, Le Quoc V., and Dean Jeff. 2018. A hierarchical model for device placement. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  20. [20] Platforms NVIDIA Jetson. 2021. Retrieved July 1, 2021 from https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/.Google ScholarGoogle Scholar
  21. [21] TensorRT NVIDIA. 2021. Retrieved July 1, 2021 from https://developer.nvidia.com/tensorrt/.Google ScholarGoogle Scholar
  22. [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32.Google ScholarGoogle Scholar
  23. [23] Pujol Roger, Tabani Hamid, Kosmidis Leonidas, Mezzetti Enrico, Ferrer Jaume Abella, and Cazorla Francisco J.. 2019. Generating and exploiting deep learning variants to increase heterogeneous resource utilization in the NVIDIA Xavier. In 31st Euromicro Conference on Real-Time Systems (ECRTS’19), Vol. 23.Google ScholarGoogle Scholar
  24. [24] S. Rallapalli, H. Qiu, A. J. Bency, S. Karthikeyan, R. Govindan, B. S. Manjunath, and R. Urgaonkar. 2016. Are Very Deep Neural Networks Feasible on Mobile Devices? Technical Report. University of Southern California.Google ScholarGoogle Scholar
  25. [25] Redmon Joseph. 2013–2016. Darknet: Open Source Neural Networks in C. Retrieved February 14, 2021 from http://pjreddie.com/darknet/.Google ScholarGoogle Scholar
  26. [26] Redmon Joseph and Farhadi Ali. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 72637271.Google ScholarGoogle Scholar
  27. [27] Redmon Joseph and Farhadi Ali. 2018. Yolov3: An incremental improvement. arXiv:1804.02767Google ScholarGoogle Scholar
  28. [28] Tang Linpeng, Wang Yida, Willke Theodore L., and Li Kai. 2018. Scheduling computation graphs of deep learning models on manycore CPUs. arXiv:1807.09667Google ScholarGoogle Scholar
  29. [29] Verucchi Micaela, Brilli Gianluca, Sapienza Davide, Verasani Mattia, Arena Marco, Gatti Francesco, Capotondi Alessandro, Cavicchioli Roberto, Bertogna Marko, and Solieri Marco. 2020. A systematic assessment of embedded neural networks for object detection. In 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA’20), Vol. 1. IEEE, 937944.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Wang Chien-Yao, Bochkovskiy Alexey, and Liao Hong-Yuan Mark. 2021. Scaled-Yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1302913038.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Wang Chien-Yao, Liao Hong-Yuan Mark, Wu Yueh-Hua, Chen Ping-Yang, Hsieh Jun-Wei, and Yeh I-Hau. 2020. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 390391.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Wang Siqi, Ananthanarayanan Gayathri, Zeng Yifan, Goel Neeraj, Pathania Anuj, and Mitra Tulika. 2019. High-throughput CNN inference on embedded ARM Big.LITTLE multicore processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 10 (2019), 22542267.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Xiang Yecheng and Kim Hyoseung. 2019. Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference. In IEEE Real-Time Systems Symposium (RTSS’19). IEEE, 392405.Google ScholarGoogle Scholar
  34. [34] Xu Xiaowei, Zhang Xinyi, Yu Bei, Hu X Sharon, Rowen Christopher, Hu Jingtong, and Shi Yiyu. 2021. DAC-SDC low power object detection challenge for UAV applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 2 (2021), 392–403.Google ScholarGoogle Scholar
  35. [35] Yang Ming, Wang Shige, Bakita Joshua, Vu Thanh, Smith F. Donelson, Anderson James H., and Frahm Jan-Michael. 2019. Re-thinking CNN frameworks for time-sensitive autonomous-driving applications: Addressing an industrial challenge. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’19). IEEE, 305317.Google ScholarGoogle Scholar
  36. [36] Zhou Husheng, Bateni Soroush, and Liu Cong. 2018. S^3DNN: Supervised streaming and scheduling for GPU-accelerated real-time DNN workloads. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’18). IEEE, 190201.Google ScholarGoogle Scholar

Index Terms

  1. TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Embedded Computing Systems
            ACM Transactions on Embedded Computing Systems  Volume 21, Issue 5
            September 2022
            526 pages
            ISSN:1539-9087
            EISSN:1558-3465
            DOI:10.1145/3561947
            • Editor:
            • Tulika Mitra
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 8 October 2022
            • Online AM: 26 January 2022
            • Accepted: 26 December 2021
            • Revised: 24 November 2021
            • Received: 15 July 2021
            Published in tecs Volume 21, Issue 5

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!