skip to main content
research-article
Public Access

SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing

Published:22 February 2021Publication History
Skip Abstract Section

Abstract

As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.

References

  1. [n. d.]. CUDA-GDB. http://docs.nvidia.com/cuda/cuda-gdb/#axzz4PHxjHEUBGoogle ScholarGoogle Scholar
  2. [n. d.]. GP100 Pascal Whitepaper. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecturewhitepaper. pdfGoogle ScholarGoogle Scholar
  3. [n. d.]. NVBitFI. https://github.com/NVlabs/nvbitfi.Google ScholarGoogle Scholar
  4. [n. d.]. NVIDIA Fermi Architecture Whitepaper. http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_ fermi_compute_architecture_whitepaper.pdfGoogle ScholarGoogle Scholar
  5. [n. d.]. NVIDIA Kepler GK110 Architecture Whitepaper.Google ScholarGoogle Scholar
  6. Ali Bakhoda, George L Yuan, Wilson WL Fung, Henry Wong, and Tor M Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Performance Analysis of Systems and Software, 2009. ISPASS 2009. IEEE International Symposium on. IEEE, 163--174.Google ScholarGoogle ScholarCross RefCross Ref
  7. Subho S Banerjee, Saurabh Jha, James Cyriac, Zbigniew T Kalbarczyk, and Ravishankar K Iyer. 2018. Hands off the wheel in autonomous vehicles?: A systems perspective on over a million miles of field data. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 586--597.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jon Calhoun, Luke Olson, and Marc Snir. 2014. FlipIt: An LLVM based fault injector for HPC. In European Conference on Parallel Processing. Springer, 547--558. Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 1, Article 1. Publication date: March 2021. SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing 1:23Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE International Symposium on Workload Characterization (IISWC). Ieee, 44--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zitao Chen, Guanpeng Li, Karthik Pattabiraman, and Nathan DeBardeleben. 2019. BinFI: an efficient fault injector for safety-critical machine learning systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hyungmin Cho, Shahrzad Mirkhani, Chen-Yong Cher, Jacob A Abraham, and Subhasish Mitra. 2013. Quantitative evaluation of soft error injection techniques for robust system design. In Proceedings of the 50th Annual Design Automation Conference. ACM, 101.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Lide Duan, Bin Li, and Lu Peng. 2009. Versatile prediction and fast estimation of architectural vulnerability factor from processor performance metrics. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, 129--140.Google ScholarGoogle ScholarCross RefCross Ref
  13. Anders Eklund, Paul Dufort, Daniel Forsberg, and Stephen M LaConte. 2013. Medical image processing on the GPU--Past, present and future. Medical image analysis 17, 8 (2013), 1073--1094.Google ScholarGoogle Scholar
  14. Bo Fang, Karthik Pattabiraman, Matei Ripeanu, and Sudhanva Gurumurthi. 2014. GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on. IEEE, 221--230. [15] R Foster. 2012. How to harness big data for improving public health. Government Health IT (2012).Google ScholarGoogle ScholarCross RefCross Ref
  15. Vinicius Fratin, Daniel Oliveira, Caio Lunardi, Fernando Santos, Gennaro Rodrigues, and Paolo Rech. 2018. Codedependent and architecture-dependent reliability behaviors. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 13--26.Google ScholarGoogle ScholarCross RefCross Ref
  16. Qian Gong, Phil DeMar, and Wenji Wu. 2017. Deep Packet/Flow Analysis using GPUs. Technical Report.Google ScholarGoogle Scholar
  17. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar), 2012. IEEE, 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  18. Siva Kumar Sastry Hari, Sarita V Adve, Helia Naeimi, and Pradeep Ramachandran. 2012. Relyzer: Exploiting applicationlevel fault equivalence to analyze application resiliency to transient faults. In ACM SIGPLAN Notices, Vol. 47. ACM, 123--134.Google ScholarGoogle Scholar
  19. Siva Kumar Sastry Hari, Timothy Tsai, Mark Stephenson, StephenWKeckler, and Joel Emer. 2015. SASSIFI: Evaluating resilience of GPU applications. In Proceedings of the Workshop on Silicon Errors in Logic-System Effects.Google ScholarGoogle Scholar
  20. Jen-Cheng Huang, Joo Hwan Lee, Hyesoon Kim, and Hsien-Hsin S Lee. 2014. GPUMech: GPU performance modeling technique based on interval analysis. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 268--279.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Saurabh Jha, Subho S. Banerjee, Timothy Tsai, Siva Kumar Sastry Hari, Michael B. Sullivan, Zbigniew T. Kalbarczyk, Stephen W. Keckler, and Ravishankar K. Iyer. 2019. ML-Based Fault Injection for Autonomous Vehicles: A Case for Bayesian Fault Injection. In 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2019, Portland, OR, USA, June 24--27, 2019. IEEE, 112--124. https://doi.org/10.1109/DSN.2019.00025Google ScholarGoogle Scholar
  22. Adwait Jog, Onur Kayiran, Nachiappan Chidambaram Nachiappan, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. In ACM SIGPLAN Notices, Vol. 48. ACM, 395--406.Google ScholarGoogle Scholar
  23. Manolis Kaliorakis, Dimitris Gizopoulos, Ramon Canal, and Antonio Gonzalez. 2017. MeRLiN: Exploiting Dynamic Instruction Behavior for Fast and Accurate Microarchitecture Level Reliability Assessment. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 241--254.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. David B Kirk and W Hwu Wen-Mei. 2016. Programming massively parallel processors: a hands-on approach. Morgan Kaufmann.Google ScholarGoogle Scholar
  25. Guanpeng Li and Karthik Pattabiraman. 2018. Modeling input-dependent error propagation in programs. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 279--290.Google ScholarGoogle ScholarCross RefCross Ref
  26. Guanpeng Li, Karthik Pattabiraman, Chen-Yang Cher, and Pradip Bose. 2016. Understanding error propagation in GPGPU applications. In High Performance Computing, Networking, Storage and Analysis, SC16: International Conference for. IEEE, 240--251.Google ScholarGoogle Scholar
  27. Guanpeng Li, Karthik Pattabiraman, Siva Kumar Sastry Hari, Michael Sullivan, and Timothy Tsai. 2018. Modeling soft-error propagation in programs. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 27--38.Google ScholarGoogle ScholarCross RefCross Ref
  28. Abdulrahman Mahmoud, Neeraj Aggarwal, Alex Nobbe, Jose Vicarte, Sarita Adve, Christopher Fletcher, Iuri Frosio, and Siva Hari. 2020. PyTorchFI: A Runtime Perturbation Tool for DNNs. 25--31. https://doi.org/10.1109/DSNW50199.2020.00014Google ScholarGoogle Scholar
  29. Abdulrahman Mahmoud, Radha Venkatagiri, Khalique Ahmed, Sasa Misailovic, Darko Marinov, ChristopherWFletcher, and Sarita V Adve. 2019. Minotaur: Adapting Software Testing Techniques for Hardware Errors. In Proceedings of the Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 1, Article 1. Publication date: March 2021. 1:24 Lishan Yang et al. Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 1087--1103.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, and Lizy Kurian John. 2012. A first-order mechanistic model for architectural vulnerability factor. In 2012 39th Annual International Symposium on Computer Architecture (ISCA). IEEE, 273--284.Google ScholarGoogle ScholarCross RefCross Ref
  31. Bin Nie, Adwait Jog, and Evgenia Smirni. 2020. Characterizing Accuracy-Aware Resilience of GPGPU Applications. In 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGRID 2020, Melbourne, Australia, May 11--14, 2020. IEEE, 111--120.Google ScholarGoogle Scholar
  32. Bin Nie, Devesh Tiwari, Saurabh Gupta, Evgenia Smirni, and James H Rogers. 2016. A large-scale study of soft-errors on GPUs in the field. In High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on. IEEE, 519--530.Google ScholarGoogle ScholarCross RefCross Ref
  33. Bin Nie, Ji Xue, Saurabh Gupta, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. [n. d.]. Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities. In MASCOTS 2017. 22--31.Google ScholarGoogle Scholar
  34. Bin Nie, Ji Xue, Saurabh Gupta, Tirthak Patel, Christian Engelmann, Evgenia Smirni, and Devesh Tiwari. [n. d.]. Machine Learning Models for GPU Error Prediction in a Large Scale HPC System. In DSN 2018. 95--106.Google ScholarGoogle ScholarCross RefCross Ref
  35. Bin Nie, Lishan Yang, Adwait Jog, and Evgenia Smirni. 2018. Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications. In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 749--761.Google ScholarGoogle Scholar
  36. NVIDIA. [n. d.]. Computational Finance. http://www.nvidia.com/object/computational_finance.htmlGoogle ScholarGoogle Scholar
  37. NVIDIA. [n. d.]. Researchers Deploy GPUs to Build World's Largest Artificial Neural Network. https://nvidianews. nvidia.com/news/researchers-deploy-gpus-to-build-world-s-largest-artificial-neural-networkGoogle ScholarGoogle Scholar
  38. NVIDIA. 2011. CUDA C/C++ SDK Code Samples. http://developer.nvidia.com/cuda-cc-sdk-code-samplesGoogle ScholarGoogle Scholar
  39. Jin-Hong Park, Munehiro Tada, Duygu Kuzum, Pawan Kapur, Hyun-Yong Yu, Krishna C Saraswat, et al. 2008. Low temperature (? 380° C) and high performance Ge CMOS technology with novel source/drain by metal-induced dopants activation and high-k/metal gate stack for monolithic 3D integration. In Electron Devices Meeting, 2008. IEDM 2008. IEEE International. IEEE, 1--4.Google ScholarGoogle ScholarCross RefCross Ref
  40. Guillem Pratx and Lei Xing. 2011. GPU computing in medical physics: A review. Medical physics 38, 5 (2011), 2685--2697.Google ScholarGoogle Scholar
  41. Fritz G Previlon, Charu Kalra, Devesh Tiwari, and David R Kaeli. 2019. PCFI: Program Counter Guided Fault Injection for Accelerating GPU Reliability Assessment. In 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 308--311.Google ScholarGoogle Scholar
  42. Behrooz Sangchoolie, Karthik Pattabiraman, and Johan Karlsson. 2017. One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors. In 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2017, Denver, CO, USA, June 26--29, 2017. IEEE Computer Society, 97--108. https://doi.org/10.1109/DSN.2017.30Google ScholarGoogle ScholarCross RefCross Ref
  43. Hamid Sarbazi-Azad. 2016. Advances in GPU Research and Practice. Morgan Kaufmann.Google ScholarGoogle Scholar
  44. I Schmerken. 2009. Wall street accelerates options analysis with GPU technology. Wall Street Technology 11 (2009).Google ScholarGoogle Scholar
  45. Vilas Sridharan and David R Kaeli. 2008. Quantifying software vulnerability. In Proceedings of the 2008 Workshop on Radiation Effects and Fault Tolerance in Nanometer Technologies. ACM, 323--328.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Vilas Sridharan and David R Kaeli. 2009. Eliminating microarchitectural dependency from architectural vulnerability. In 2009 IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, 117--128.Google ScholarGoogle ScholarCross RefCross Ref
  47. Sam S. Stone, Justin P. Haldar, Stephanie C. Tsao, Wen mei W. Hwu, Bradley P. Sutton, and Zhi-Pei Liang. 2008. Accelerating advanced MRI reconstructions on GPUs. J. Parallel Distrib. Comput. 68, 10 (2008), 1307--1318.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Sotiris Tselonis and Dimitris Gizopoulos. 2016. GUFI: A framework for GPUs reliability assessment. In Performance Analysis of Systems and Software (ISPASS), 2016 IEEE International Symposium on. IEEE, 90--100.Google ScholarGoogle ScholarCross RefCross Ref
  49. Radha Venkatagiri, Abdulrahman Mahmoud, Siva Kumar Sastry Hari, and Sarita V Adve. 2016. Approxilyzer: Towards a systematic framework for instruction-level approximate computing and its application to hardware resiliency. In Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium on. IEEE, 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. NVBit: A Dynamic Binary Instrumentation Framework for NVIDIA GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 372--383.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xiebing Wang, Kai Huang, Alois Knoll, and Xuehai Qian. 2019. A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 506--518.Google ScholarGoogle ScholarCross RefCross Ref
  52. Lishan Yang, Bin Nie, Adwait Jog, and Evgenia Smirni. 2021. Enabling Software Resilience in GPGPU Applications via Partial Thread Protection. In 43rd International Conference on Software Engineering, 23--29 May 2021 (to appear). Proc. ACM Meas. Anal. Comput. Syst., Vol. 5, No. 1, Article 1. Publication date: March 2021. SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing 1:25Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Lishan Yang, Bin Nie, Adwait Jog, and Evgenia Smirni. 2021. Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults. IEEE Trans. Comput. 70, 1 (2021), 30--44.Google ScholarGoogle ScholarCross RefCross Ref
  54. Amir Yazdanbakhsh, Divya Mahajan, Hadi Esmaeilzadeh, and Pejman Lotfi-Kamran. 2017. Axbench: A multiplatform benchmark suite for approximate computing. IEEE Design & Test 34, 2 (2017), 60--68.Google ScholarGoogle ScholarCross RefCross Ref
  55. Keun Soo Yim, Cuong Pham, Mushfiq Saleheen, ZbigniewKalbarczyk, and Ravishankar Iyer. 2011. Hauberk: Lightweight silent data corruption error detector for GPGPU. In 2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE, 287--300.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!