skip to main content
research-article

CommGuard: Mitigating Communication Errors in Error-Prone Parallel Execution

Published:14 March 2015Publication History
Skip Abstract Section

Abstract

As semiconductor technology scales towards ever-smaller transistor sizes, hardware fault rates are increasing. Since important application classes (e.g., multimedia, streaming workloads) are data-error-tolerant, recent research has proposed techniques that seek to save energy or improve yield by exploiting error tolerance at the architecture/microarchitecture level. Even seemingly error-tolerant applications, however, will crash or hang due to control-flow/memory addressing errors. In parallel computation, errors involving inter-thread communication can have equally catastrophic effects. Our work explores techniques that mitigate the impact of potentially catastrophic errors in parallel computation, while still garnering power, cost, or yield benefits from data error tolerance. Our proposed CommGuard solution uses FSM-based checkers to pad and discard data in order to maintain semantic alignment between program control flow and the data communicated between processors. CommGuard techniques are low overhead and they exploit application information already provided by some parallel programming languages (e.g. StreamIt). By converting potentially catastrophic communication errors into potentially tolerable data errors, CommGuard allows important streaming applications like JPEG and MP3 decoding to execute without crashing and to sustain good output quality, even for errors as frequent as every 500μs.

References

  1. A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, and S.-L. Lu, "Energy-efficient cache design using variable-strength error-correcting codes," in Proceedings of the Annual International Symposium on Computer Architecture, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Budimlic, M. Burke, V. Cave, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar, "Concurrent collections," Scientific Programming, vol. 18, no. 3-4, pp. 203--217, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Carbin, S. Misailovic, and M. C. Rinard, "Verifying quantitative reliability for programs that execute on unreliable hardware," in Proceedings of the International Conference on Object Oriented Programming Systems Languages and Applications, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, "Denovo: Rethinking the memory hierarchy for disciplined parallelism," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Clemens, B. Sierawski, K. Warren, M. Mendenhall, N. Dodds, R. Weller, R. Reed, P. Dodd, M. Shaneyfelt, J. Schwank, S. Wender, and R. Baumann, "The effects of neutron energy and high-z materials on single event upsets and multiple cell upsets," IEEE Transactions on Nuclear Science, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  7. C. Constantinescu, "Trends and challenges in vlsi circuit reliability," IEEE Micro, vol. 23, no. 4, pp. 14--19, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, no. 1, pp. 107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. W. Dijkstra, "Self-stabilizing systems in spite of distributed control," Commun. ACM, vol. 17, no. 11, pp. 643--644, 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. h. Eom and B. Demsky, "Self-stabilizing java," in Proceedings of the Conference on Programming Language Design and Implementation, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, "Razor: A low-power pipeline based on circuit-level timing speculation," in Proceedings of the Annual International Symposium on Microarchitecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Gielen, P. De Wit, E. Maricau, J. Loeckx, J. Martín-Martínez, B. Kaczer, G. Groeseneken, R. Rodríguez, and M. Nafría, "Emerging yield and reliability challenges in nanometer cmos technologies," in Proceedings of the Conference on Design, Automation and Test in Europe, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Hegde and N. R. Shanbhag, "Energy-efficient signal processing via algorithmic noise-tolerance," in Proceedings of the International Symposium on Low Power Electronics and Design, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Huang, M. Stan, S. Gurumurthi, R. Ribando, and K. Skadron, "Interaction of scaling trends in processor architecture and cooling," in Semiconductor Thermal Measurement and Management Sym., 2010.Google ScholarGoogle Scholar
  15. Intel Corporation, vol. 3A, pp. 8--16, 2014. {Online}. Available: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdfGoogle ScholarGoogle Scholar
  16. ITRS, "ITRS process integration, devices, and structures," http://public.itrs.net/Links/2011ITRS/2011Chapters/2011PIDS.pdf, ITRS, 2011.Google ScholarGoogle Scholar
  17. K. Kuhn, M. Giles, D. Becher, P. Kolar, A. Kornfeld, R. Kotlyar, S. Ma, A. Maheshwari, and S. Mudanai, "Process technology variation," IEEE Transactions on Electron Devices, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  18. L. Leem, H. Cho, J. Bau, Q. A. Jacobson, and S. Mitra, "ERSA: Error resilient system architecture for probabilistic applications," in Proceedings of the Conference on Design, Automation and Test in Europe, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, "Flikker: Saving dram refresh-power through critical data partitioning," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," Computer, vol. 35, no. 2, pp. 50--58, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Meixner, M. E. Bauer, and D. Sorin, "Argus: Low-cost, comprehensive error detection in simple cores," in Proceedings of the Annual International Symposium on Microarchitecture, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," in Proceedings of the Annual International Symposium on Computer Architecture, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance micro- processor," in Proceedings of the Annual International Symposium on Microarchitecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Mukherjee, Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "EnerJ: Approximate data types for safe and general low-power computation," in Proceedings of the Conference on Programming Language Design and Implementation, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Sierawski, R. Reed, M. Mendenhall, R. Weller, R. Schrimpf, S.-J. Wen, R. Wong, N. Tam, and R. Baumann, "Effects of scaling on muon-induced soft errors," in International Reliability Physics Symposium, 2011.Google ScholarGoogle Scholar
  27. T. Stathaki, Image Fusion: Algorithms and Applications. Academic Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Stoll and K. Brandenburg, "The iso/mpeg-audio codec: A generic standard for coding of high quality digital audio," in Audio Engineering Society Convention, 1992.Google ScholarGoogle Scholar
  29. W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A language for streaming applications," in Proceedings of the International Conference on Compiler Construction, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Thomas and K. Pattabiraman, "Error detector placement for soft computation," in Proceedings of the Conference on Dependable Systems and Networks, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. K. Wallace, "The JPEG still picture compression standard," Commun. ACM, vol. 34, no. 4, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Yetim, M. Martonosi, and S. Malik, "Extracting useful computation from error-prone processors for streaming applications," in Proceedings of the Conference on Design, Automation and Test in Europe, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CommGuard: Mitigating Communication Errors in Error-Prone Parallel Execution

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 50, Issue 4
      ASPLOS '15
      April 2015
      676 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2775054
      • Editor:
      • Andy Gill
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2015
        720 pages
        ISBN:9781450328357
        DOI:10.1145/2694344

      Copyright © 2015 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 March 2015

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!