Abstract
Critical real-time systems require strict resource provisioning in terms of memory and timing. The constant need for higher performance in these systems has led industry to recently include GPUs. However, GPU software ecosystems are by their nature closed source, forcing system engineers to consider them as black boxes, complicating resource provisioning. In this work, we reverse engineer the internal operations of the GPU system software to increase the understanding of their observed behaviour and how resources are internally managed. We present our methodology that is incorporated in GMAI (GPU Memory Allocation Inspector), a tool that allows system engineers to accurately determine the exact amount of resources required by their critical systems, avoiding underprovisioning. We first apply our methodology on a wide range of GPU hardware from different vendors showing its generality in obtaining the properties of the GPU memory allocators. Next, we demonstrate the benefits of such knowledge in resource provisioning of two case studies from the automotive domain, where the actual memory consumption is up to 5.6× more than the memory requested by the application.
- T. Amert, N. Otterness, M. Yang, J. H. Anderson, and F. Donelson Smith. 2018. GPU scheduling on the NVIDIA TX2: Hidden details revealed. In Proceedings of the Real-time Systems Symposium, Vol. 2018-January.Google Scholar
- ARINC. 2010. Avionics Application Software Standard Interface: ARINC Specification 653P1-3. Aeronautical Radio. Retrieved from https://www.aviation-ia.com/product-categories/600-series.Google Scholar
- AUTOSAR. 2019. AUTOSAR. Retrieved on April 2019 from https://www.autosar.org.Google Scholar
- E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. 2000. Hoard: A scalable memory allocator for multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’00). 117--128.Google Scholar
- A. J. Calderón, L. Kosmidis, C. F. Nicolás, F. J. Cazorla, and P. Onaindia. 2019. GMAI: GPU Memory Allocation Inspector. Retrieved from https://github.com/ajcalderont/gmai.Google Scholar
- X. Chen, A. Slowinska, and H. Bos. 2013. Who allocated my memory? Detecting custom memory allocators in C binaries. In Proceedings of the Working Conference on Reverse Engineering (WCRE’13). 22--31.Google Scholar
- R. L. Davidson and C. P. Bridges. 2018. Error resilient GPU accelerated image processing for space applications. IEEE Trans. Parallel Distrib. Syst. 29, 9 (2018), 1990--2003.Google Scholar
Cross Ref
- Free Software Foundation. 2019. The GNU Allocator. Retrieved from https://www.gnu.org/software/libc/manual/html_node/The-GNU-Allocator.html.Google Scholar
- Green Hills Software. 1996. Integrity RTOS. Retrieved from https://www.ghs.com/products/rtos/integrity.html.Google Scholar
- Y. Hasan and J. M. Chang. 2006. A tunable hybrid memory allocator. J. Syst. Softw. 79, 8 (2006), 1051--1063.Google Scholar
Digital Library
- X. Huang, C. I. Rodrigues, S. Jones, I. Buck, and W. Hwu. 2010. XMalloc: A scalable lock-free dynamic memory allocator for many-core machines. In Proceedings of the 10th IEEE International Conference on Computer and Information Technology (CIT’10) and 7th IEEE International Conference on Embedded Software and Systems (ICESS’10) (ScalCom’10). 1134--1139.Google Scholar
- X. Huang, C. I. Rodrigues, S. Jones, I. Buck, and W.-m. Hwu. 2013. Scalable SIMD-parallel memory allocation for many-core machines. J. Supercomput. 64, 3 (2013), 1008--1020.Google Scholar
Digital Library
- Intel Corporation. 2014. Getting the Most from OpenCL 1.2: How to Increase Performance by Minimizing Buffer Copies on Intel Processor Graphics. Retrieved on October 2019 from https://software.intel.com/en-us/articles/getting-the-most-from-opencl-12-how-to-increase-performance-by-minimizing-buffer-copies-on-intel-processor-graphics.Google Scholar
- L. Kosmidis, C. Maxim, V. Jegu, F. Vatrinet, and F. J. Cazorla. 2018. Industrial experiences with resource management under software randomization in ARINC653 avionics environments. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design, Digest of Technical Papers (ICCAD’18).Google Scholar
- X. Mei and X. Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Trans. Parallel Distrib. Syst. 28, 1 (2017), 72--86.Google Scholar
Digital Library
- NVIDIA Corporation. 2019. Self Driving Cars. Retrieved on April 2019 from https://www.nvidia.com/en-us/self-driving-cars.Google Scholar
- U. Ozgunalp. 2018. Combination of the symmetrical local threshold and the sobel edge detector for lane feature extraction. In Proceedings of the 9th International Conference on Computational Intelligence and Communication Networks (CICN’17), Vol. 2018-January. 24--28.Google Scholar
- V. Shah and A. Shah. 2019. Proposed Memory Allocation Algorithm for NUMA-based Soft Real-Time Operating System. Advances in Intelligent Systems and Computing, Vol. 814. 3–11.Google Scholar
Cross Ref
- A. Slowinska, T. Stancescu, and H. Bos. 2011. Howard: A dynamic excavator for reverse engineering data structures. In Proceedings of the 18th Annual Network and Distributed System Security Symposium (NDSS’11).Google Scholar
- M. Steinberger, M. Kenzel, B. Kainz, and D. Schmalstieg. 2012. ScatterAlloc: Massively parallel dynamic memory allocation for the GPU. In Proceedings of the Innovative Parallel Computing Conference (InPar’12).Google Scholar
- M. M. Trompouki, L. Kosmidis, and N. Navarro. 2017. An open benchmark implementation for multi-CPU multi-GPU pedestrian detection in automotive systems. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers (ICCAD’17), Vol. 2017-November. 305--312.Google Scholar
- H. Vishwanathan, D. L. Peters, and J. Z. Zhang. 2017. Traffic sign recognition in autonomous vehicles using edge detection. In Proceedings of the ASME Dynamic Systems and Control Conference (DSCC’17), Vol. 1.Google Scholar
- S. Widmer, D. Wodniok, N. Weber, and M. Goesele. 2013. Fast dynamic memory allocator for massively parallel architectures. In Proceedings of the 6th ACM Workshop on General Purpose Processor Using Graphics Processing Units. 120--126.Google Scholar
- P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles. 1995. Dynamic Storage Allocation: A Survey and Critical Review. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 986. 1–116.Google Scholar
- H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’10). 235--246.Google Scholar
- M. Yang, N. Otterness, T. Amert, J. Bakita, J. H. Anderson, and F. D. Smith. 2018. Avoiding pitfalls when using NVIDIA GPUs for real-time tasks in autonomous systems. In Leibniz International Proceedings in Informatics, LIPIcs, Vol. 106.Google Scholar
- R. Younis and N. Bastaki. 2018. Accelerated fog removal from real images for car detection. In Proceedings of the 9th IEEE-GCC Conference and Exhibition (GCCCE’17).Google Scholar
- X. Yu, H. Wang, W. Feng, H. Gong, and G. Cao. 2019. GPU-based iterative medical CT image reconstructions. J. Sig. Proc. Syst. 91, 3--4 (2019), 321--338.Google Scholar
Digital Library
Index Terms
GMAI: Understanding and Exploiting the Internals of GPU Resource Allocation in Critical Systems
Recommendations
The Loop-of-Stencil-Reduce Paradigm
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03In this paper we advocate the Loop-of-stencil-reduce pattern as a way to simplify the parallel programming of heterogeneous platforms (multicore+GPUs). Loop-of-Stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, ...
Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming
ISCA '16: Proceedings of the 43rd International Symposium on Computer ArchitectureAs technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, ...
The Loop-of-Stencil-Reduce Paradigm
TRUSTCOM-BIGDATASE-ISPA '15: Proceedings of the 2015 IEEE Trustcom/BigDataSE/ISPA - Volume 03In this paper we advocate the Loop-of-stencil-reduce pattern as a way to simplify the parallel programming of heterogeneous platforms (multicore+GPUs). Loop-of-Stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, ...






Comments