Abstract

As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.
- [n. d.]. CUDA-GDB. http://docs.nvidia.com/cuda/cuda-gdb/#axzz4PHxjHEUBGoogle Scholar
- [n. d.]. GP100 Pascal Whitepaper. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecturewhitepaper. pdfGoogle Scholar
- [n. d.]. NVBitFI. https://github.com/NVlabs/nvbitfi.Google Scholar
- [n. d.]. NVIDIA Fermi Architecture Whitepaper. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_ fermi_compute_architecture_whitepaper.pdfGoogle Scholar
- [n. d.]. NVIDIA Kepler GK110 Architecture Whitepaper.Google Scholar
- Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.Google Scholar
Cross Ref
- Subho S Banerjee, Saurabh Jha, James Cyriac, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2018. Hands off the wheel in autonomous vehicles?: A systems perspective on over a million miles of field data. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 586--597.Google Scholar
Cross Ref
- Jon Calhoun, Luke Olson, and Marc Snir. 2014. FlipIt: An LLVM based fault injector for HPC. In European Conference on Parallel Processing. Springer, 547--558. Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 1, Article 1. Publication date: March 2021. SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing 1:23Google Scholar
Digital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). Ieee, 44--54.Google Scholar
Digital Library
- Zitao Chen, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben. 2019. BinFI: an efficient fault injector for safety-critical machine learning systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--23.Google Scholar
Digital Library
- Hyungmin Cho, Shahrzad Mirkhani, Chen-Yong Cher, Jacob A Abraham, and Subhasish Mitra. 2013. Quantitative evaluation of soft error injection techniques for robust system design. In Proceedings of the 50th Annual Design Automation Conference. ACM, 101.Google Scholar
Digital Library
- Lide Duan, Bin Li, and Lu Peng. 2009. Versatile prediction and fast estimation of architectural vulnerability factor from processor performance metrics. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, 129--140.Google Scholar
Cross Ref
- Anders Eklund, Paul Dufort, Daniel Forsberg, and Stephen M LaConte. 2013. Medical image processing on the GPU--Past, present and future. Medical image analysis 17, 8 (2013), 1073--1094.Google Scholar
- Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. 2014. GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 221--230. [15] R Foster. 2012. How to harness big data for improving public health. Government Health IT (2012).Google Scholar
Cross Ref
- Vinicius Fratin, Daniel Oliveira, Caio Lunardi, Fernando Santos, Gennaro Rodrigues, and Paolo Rech. 2018. Codedependent and architecture-dependent reliability behaviors. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 13--26.Google Scholar
Cross Ref
- Qian Gong, Phil DeMar, and Wenji Wu. 2017. Deep Packet/Flow Analysis using GPUs. Technical Report.Google Scholar
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--10.Google Scholar
Cross Ref
- Siva Kumar Sastry Hari, Sarita V Adve, Helia Naeimi, and Pradeep Ramachandran. 2012. Relyzer: Exploiting applicationlevel fault equivalence to analyze application resiliency to transient faults. In ACM SIGPLAN Notices, Vol. 47. ACM, 123--134.Google Scholar
- Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, StephenWKeckler, and Joel Emer. 2015. SASSIFI: Evaluating resilience of GPU applications. In Proceedings of the Workshop on Silicon Errors in Logic-System Effects.Google Scholar
- Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, and Hsien-Hsin S Lee. 2014. GPUMech: GPU performance modeling technique based on interval analysis. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 268--279.Google Scholar
Digital Library
- Saurabh Jha, Subho S. Banerjee, Timothy Tsai, Siva Kumar Sastry Hari, Michael B. Sullivan, Zbigniew T. Kalbarczyk, Stephen W. Keckler, and Ravishankar K. Iyer. 2019. ML-Based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection. In 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2019, Portland, OR, USA, June 24--27, 2019. IEEE, 112--124. https://doi.org/10.1109/DSN.2019.00025Google Scholar
- Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406.Google Scholar
- Manolis Kaliorakis, Dimitris Gizopoulos, Ramon Canal, and Antonio Gonzalez. 2017. MeRLiN: Exploiting Dynamic Instruction Behavior for Fast and Accurate Microarchitecture Level Reliability Assessment. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 241--254.Google Scholar
Digital Library
- David B Kirk and W Hwu Wen-Mei. 2016. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann.Google Scholar
- Guanpeng Li and Karthik Pattabiraman. 2018. Modeling input-dependent error propagation in programs. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 279--290.Google Scholar
Cross Ref
- Guanpeng Li, Karthik Pattabiraman, Chen-Yang Cher, and Pradip Bose. 2016. Understanding error propagation in GPGPU applications. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 240--251.Google Scholar
- Guanpeng Li, Karthik Pattabiraman, Siva Kumar Sastry Hari, Michael Sullivan, and Timothy Tsai. 2018. Modeling soft-error propagation in programs. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 27--38.Google Scholar
Cross Ref
- Abdulrahman Mahmoud, Neeraj Aggarwal, Alex Nobbe, Jose Vicarte, Sarita Adve, Christopher Fletcher, Iuri Frosio, and Siva Hari. 2020. PyTorchFI: A Runtime Perturbation Tool for DNNs. 25--31. https://doi.org/10.1109/DSNW50199.2020.00014Google Scholar
- Abdulrahman Mahmoud, Radha Venkatagiri, Khalique Ahmed, Sasa Misailovic, Darko Marinov, ChristopherWFletcher, and Sarita V Adve. 2019. Minotaur: Adapting Software Testing Techniques for Hardware Errors. In Proceedings of the Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 1, Article 1. Publication date: March 2021. 1:24 Lishan Yang et al. Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 1087--1103.Google Scholar
Digital Library
- Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, and Lizy Kurian John. 2012. A first-order mechanistic model for architectural vulnerability factor. In 2012 39th Annual International Symposium on Computer Architecture (ISCA). IEEE, 273--284.Google Scholar
Cross Ref
- Bin Nie, Adwait Jog, and Evgenia Smirni. 2020. Characterizing Accuracy-Aware Resilience of GPGPU Applications. In 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020, Melbourne, Australia, May 11--14, 2020. IEEE, 111--120.Google Scholar
- Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, and James H Rogers. 2016. A large-scale study of soft-errors on GPUs in the field. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 519--530.Google Scholar
Cross Ref
- Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. [n. d.]. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In MASCOTS 2017. 22--31.Google Scholar
- Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. [n. d.]. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In DSN 2018. 95--106.Google Scholar
Cross Ref
- Bin Nie, Lishan Yang, Adwait Jog, and Evgenia Smirni. 2018. Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 749--761.Google Scholar
- NVIDIA. [n. d.]. Computational Finance. http://www.nvidia.com/object/computational_finance.htmlGoogle Scholar
- NVIDIA. [n. d.]. Researchers Deploy GPUs to Build World's Largest Artificial Neural Network. https://nvidianews. nvidia.com/news/researchers-deploy-gpus-to-build-world-s-largest-artificial-neural-networkGoogle Scholar
- NVIDIA. 2011. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-code-samplesGoogle Scholar
- Jin-Hong Park, Munehiro Tada, Duygu Kuzum, Pawan Kapur, Hyun-Yong Yu, Krishna C Saraswat, et al. 2008. Low temperature (? 380° C) and high performance Ge CMOS technology with novel source/drain by metal-induced dopants activation and high-k/metal gate stack for monolithic 3D integration. In Electron Devices Meeting, 2008. IEDM 2008. IEEE International. IEEE, 1--4.Google Scholar
Cross Ref
- Guillem Pratx and Lei Xing. 2011. GPU computing in medical physics: A review. Medical physics 38, 5 (2011), 2685--2697.Google Scholar
- Fritz G Previlon, Charu Kalra, Devesh Tiwari, and David R Kaeli. 2019. PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 308--311.Google Scholar
- Behrooz Sangchoolie, Karthik Pattabiraman, and Johan Karlsson. 2017. One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors. In 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2017, Denver, CO, USA, June 26--29, 2017. IEEE Computer Society, 97--108. https://doi.org/10.1109/DSN.2017.30Google Scholar
Cross Ref
- Hamid Sarbazi-Azad. 2016. Advances in GPU Research and Practice. Morgan Kaufmann.Google Scholar
- I Schmerken. 2009. Wall street accelerates options analysis with GPU technology. Wall Street Technology 11 (2009).Google Scholar
- Vilas Sridharan and David R Kaeli. 2008. Quantifying software vulnerability. In Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies. ACM, 323--328.Google Scholar
Digital Library
- Vilas Sridharan and David R Kaeli. 2009. Eliminating microarchitectural dependency from architectural vulnerability. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, 117--128.Google Scholar
Cross Ref
- Sam S. Stone, Justin P. Haldar, Stephanie C. Tsao, Wen mei W. Hwu, Bradley P. Sutton, and Zhi-Pei Liang. 2008. Accelerating advanced MRI reconstructions on GPUs. J. Parallel Distrib. Comput. 68, 10 (2008), 1307--1318.Google Scholar
Digital Library
- Sotiris Tselonis and Dimitris Gizopoulos. 2016. GUFI: A framework for GPUs reliability assessment. In Performance Analysis of Systems and Software (ISPASS), 2016 IEEE International Symposium on. IEEE, 90--100.Google Scholar
Cross Ref
- Radha Venkatagiri, Abdulrahman Mahmoud, Siva Kumar Sastry Hari, and Sarita V Adve. 2016. Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--14.Google Scholar
Digital Library
- Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 372--383.Google Scholar
Digital Library
- Xiebing Wang, Kai Huang, Alois Knoll, and Xuehai Qian. 2019. A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 506--518.Google Scholar
Cross Ref
- Lishan Yang, Bin Nie, Adwait Jog, and Evgenia Smirni. 2021. Enabling Software Resilience in GPGPU Applications via Partial Thread Protection. In 43rd International Conference on Software Engineering, 23--29 May 2021 (to appear). Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 1, Article 1. Publication date: March 2021. SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing 1:25Google Scholar
Digital Library
- Lishan Yang, Bin Nie, Adwait Jog, and Evgenia Smirni. 2021. Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults. IEEE Trans. Comput. 70, 1 (2021), 30--44.Google Scholar
Cross Ref
- Amir Yazdanbakhsh, Divya Mahajan, Hadi Esmaeilzadeh, and Pejman Lotfi-Kamran. 2017. Axbench: A multiplatform benchmark suite for approximate computing. IEEE Design & Test 34, 2 (2017), 60--68.Google Scholar
Cross Ref
- Keun Soo Yim, Cuong Pham, Mushfiq Saleheen, ZbigniewKalbarczyk, and Ravishankar Iyer. 2011. Hauberk: Lightweight silent data corruption error detector for GPGPU. In 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE, 287--300.Google Scholar
Digital Library
Index Terms
SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing
Recommendations
SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing
SIGMETRICS '21: Abstract Proceedings of the 2021 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer SystemsAs Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to ...
SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing
SIGMETRICS '21As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to ...
Neither more nor less: optimizing thread-level parallelism for GPGPUs
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesGeneral-purpose graphics processing units (GPGPUs) are at their best in accelerating computation by exploiting abundant thread-level parallelism (TLP) offered by many classes of HPC applications. To facilitate such high TLP, emerging programming models ...






Comments