Abstract
As deep learning inference applications are increasing in embedded devices, an embedded device tends to equip neural processing units (NPUs) in addition to a multi-core CPU and a GPU. NVIDIA Jetson AGX Xavier is an example. For fast and efficient development of deep learning applications, TensorRT is provided as the SDK for high-performance inference, including an optimizer and runtime that delivers low latency and high throughput for deep learning inference applications. Like most deep learning frameworks, TensorRT assumes that the inference is executed on a single processing element, GPU or NPU, not both. In this article, we present a TensorRT-based framework supporting various optimization parameters to accelerate a deep learning application targeted on an NVIDIA Jetson embedded platform with heterogeneous processors, including multi-threading, pipelining, buffer assignment, and network duplication. Since the design space of allocating layers to diverse processing elements and optimizing other parameters is huge, we devise a parameter optimization methodology that consists of a heuristic for balancing pipeline stages among heterogeneous processors and fine-tuning the process for optimizing parameters. With nine real-life benchmarks, we could achieve 101%~680% performance improvement and up to 55% energy reduction over the baseline inference using a GPU only.
- [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.Google Scholar
- [2] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. 2020. YOLOv4: Optimal Speed and Accuracy of Object Detection. (2020). arXiv:cs.CV/2004.10934. Retrieved from https://arxiv.org/abs/2004.10934.Google Scholar
- [3] . 2018. {TVM}: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578–594.Google Scholar
- [4] . 2021. Retrieved July 1, 2021 from https://competitions.codalab.org/.Google Scholar
- [5] . 2020. Retrieved July 1, 2021 from https://github.com/AlexeyAB/darknet/.Google Scholar
- [6] . 2012. Patterns of Enterprise Application Architecture. Addison-Wesley.Google Scholar
- [7] . 2021. Retrieved July 1, 2021 from https://www.tensorflow.org/lite/.Google Scholar
- [8] . 2016. CaffePresso: An optimized library for deep learning on embedded accelerator-based platforms. In International Conference on Compliers, Architectures, and Synthesis of Embedded Systems (CASES’16). 1–10. Google Scholar
Digital Library
- [9] . 2017. Deepmon: Mobile GPU-based deep learning framework for continuous vision applications. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. 82–95.Google Scholar
Digital Library
- [10] . 2022. Deep learning inference parallelization on heterogeneous processors with TensorRT. IEEE Embedded Systems Letters 14, 1 (2022), 15–18. Google Scholar
Cross Ref
- [11] . 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia (MM’14). 675–678. Google Scholar
Digital Library
- [12] . 2018. Joint optimization of speed, accuracy, and energy for embedded image recognition systems. In Design, Automation Test in Europe Conference Exhibition (DATE’18). 715–720. Google Scholar
Cross Ref
- [13] . 2018. C-GOOD: C-code generation framework for optimized on-device deep learning. In IEEE/ACM International Conference on Computer-Aided Design (ICCAD’18). 1–8. Google Scholar
Digital Library
- [14] . 2020. Scheduling of deep learning applications onto heterogeneous processors in an embedded device. IEEE Access 8 (2020), 43980–43991. Google Scholar
- [15] . 2016. DeepX: A software accelerator for low-power deep learning inference on mobile devices. In 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN’16). 1–12. Google Scholar
Cross Ref
- [16] . 2016. CNNdroid: GPU-accelerated execution of trained deep convolutional neural networks on Android. In Proceedings of the 24th ACM International Conference on Multimedia. 1201–1205.Google Scholar
Digital Library
- [17] . 2017. Deepeye: Resource efficient local execution of multiple deep vision models using wearable commodity hardware. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. 68–81.Google Scholar
Digital Library
- [18] . 2020. Combining task- and data-level parallelism for high-throughput CNN inference on embedded CPUs-GPUs MPSoCs. In Embedded Computer Systems: Architectures, Modeling, and Simulation, , , and (Eds.). Springer International Publishing, Cham, 18–35. Google Scholar
- [19] . 2018. A hierarchical model for device placement. In International Conference on Learning Representations.Google Scholar
- [20] . 2021. Retrieved July 1, 2021 from https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/.Google Scholar
- [21] . 2021. Retrieved July 1, 2021 from https://developer.nvidia.com/tensorrt/.Google Scholar
- [22] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32.Google Scholar
- [23] . 2019. Generating and exploiting deep learning variants to increase heterogeneous resource utilization in the NVIDIA Xavier. In 31st Euromicro Conference on Real-Time Systems (ECRTS’19), Vol. 23.Google Scholar
- [24] S. Rallapalli, H. Qiu, A. J. Bency, S. Karthikeyan, R. Govindan, B. S. Manjunath, and R. Urgaonkar. 2016. Are Very Deep Neural Networks Feasible on Mobile Devices? Technical Report. University of Southern California.Google Scholar
- [25] . 2013–2016. Darknet: Open Source Neural Networks in C. Retrieved February 14, 2021 from http://pjreddie.com/darknet/.Google Scholar
- [26] . 2017. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7263–7271.Google Scholar
- [27] . 2018. Yolov3: An incremental improvement. arXiv:1804.02767Google Scholar
- [28] . 2018. Scheduling computation graphs of deep learning models on manycore CPUs. arXiv:1807.09667Google Scholar
- [29] . 2020. A systematic assessment of embedded neural networks for object detection. In 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA’20), Vol. 1. IEEE, 937–944.Google Scholar
Cross Ref
- [30] . 2021. Scaled-Yolov4: Scaling cross stage partial network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13029–13038.Google Scholar
Cross Ref
- [31] . 2020. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 390–391.Google Scholar
Cross Ref
- [32] . 2019. High-throughput CNN inference on embedded ARM Big.LITTLE multicore processors. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 10 (2019), 2254–2267.Google Scholar
Cross Ref
- [33] . 2019. Pipelined data-parallel CPU/GPU scheduling for multi-DNN real-time inference. In IEEE Real-Time Systems Symposium (RTSS’19). IEEE, 392–405.Google Scholar
- [34] . 2021. DAC-SDC low power object detection challenge for UAV applications. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 2 (2021), 392–403.Google Scholar
- [35] . 2019. Re-thinking CNN frameworks for time-sensitive autonomous-driving applications: Addressing an industrial challenge. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’19). IEEE, 305–317.Google Scholar
- [36] . 2018. S^3DNN: Supervised streaming and scheduling for GPU-accelerated real-time DNN workloads. In IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS’18). IEEE, 190–201.Google Scholar
Index Terms
TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards
Recommendations
Constructing a Mobility and Acceleration Computing Platform with NVIDIA Jetson TK1
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and SystemsCurrent high-end graphics processing units (GPUs), which contain up to thousand cores per-chip, are widely used in the high performance computing community. However, in the past, the cost and power consumption of constructing a high performance platform ...
A batched GEMM optimization framework for deep learning
AbstractGeneralized matrix multiplication (GEMM) is one of the most widely utilized algorithms in many fields such as deep learning, astrophysics, signal processing, and advanced physical analysis. It plays an extremely important role in deep learning, ...






Comments