Abstract
As semiconductor technology scales towards ever-smaller transistor sizes, hardware fault rates are increasing. Since important application classes (e.g., multimedia, streaming workloads) are data-error-tolerant, recent research has proposed techniques that seek to save energy or improve yield by exploiting error tolerance at the architecture/microarchitecture level. Even seemingly error-tolerant applications, however, will crash or hang due to control-flow/memory addressing errors. In parallel computation, errors involving inter-thread communication can have equally catastrophic effects. Our work explores techniques that mitigate the impact of potentially catastrophic errors in parallel computation, while still garnering power, cost, or yield benefits from data error tolerance. Our proposed CommGuard solution uses FSM-based checkers to pad and discard data in order to maintain semantic alignment between program control flow and the data communicated between processors. CommGuard techniques are low overhead and they exploit application information already provided by some parallel programming languages (e.g. StreamIt). By converting potentially catastrophic communication errors into potentially tolerable data errors, CommGuard allows important streaming applications like JPEG and MP3 decoding to execute without crashing and to sustain good output quality, even for errors as frequent as every 500μs.
- A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, and S.-L. Lu, "Energy-efficient cache design using variable-strength error-correcting codes," in Proceedings of the Annual International Symposium on Computer Architecture, 2011. Google Scholar
Digital Library
- N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, "The gem5 simulator," SIGARCH Computer Architecture News, vol. 39, no. 2, pp. 1--7, 2011. Google Scholar
Digital Library
- Z. Budimlic, M. Burke, V. Cave, K. Knobe, G. Lowney, R. Newton, J. Palsberg, D. Peixotto, V. Sarkar, F. Schlimbach, and S. Tasirlar, "Concurrent collections," Scientific Programming, vol. 18, no. 3-4, pp. 203--217, 2010. Google Scholar
Digital Library
- M. Carbin, S. Misailovic, and M. C. Rinard, "Verifying quantitative reliability for programs that execute on unreliable hardware," in Proceedings of the International Conference on Object Oriented Programming Systems Languages and Applications, 2013. Google Scholar
Digital Library
- B. Choi, R. Komuravelli, H. Sung, R. Smolinski, N. Honarmand, S. V. Adve, V. S. Adve, N. P. Carter, and C.-T. Chou, "Denovo: Rethinking the memory hierarchy for disciplined parallelism," in Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2011. Google Scholar
Digital Library
- M. Clemens, B. Sierawski, K. Warren, M. Mendenhall, N. Dodds, R. Weller, R. Reed, P. Dodd, M. Shaneyfelt, J. Schwank, S. Wender, and R. Baumann, "The effects of neutron energy and high-z materials on single event upsets and multiple cell upsets," IEEE Transactions on Nuclear Science, 2011.Google Scholar
Cross Ref
- C. Constantinescu, "Trends and challenges in vlsi circuit reliability," IEEE Micro, vol. 23, no. 4, pp. 14--19, 2003. Google Scholar
Digital Library
- J. Dean and S. Ghemawat, "MapReduce: Simplified data processing on large clusters," Commun. ACM, vol. 51, no. 1, pp. 107--113, 2008. Google Scholar
Digital Library
- E. W. Dijkstra, "Self-stabilizing systems in spite of distributed control," Commun. ACM, vol. 17, no. 11, pp. 643--644, 1974. Google Scholar
Digital Library
- Y. h. Eom and B. Demsky, "Self-stabilizing java," in Proceedings of the Conference on Programming Language Design and Implementation, 2012. Google Scholar
Digital Library
- D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, "Razor: A low-power pipeline based on circuit-level timing speculation," in Proceedings of the Annual International Symposium on Microarchitecture, 2003. Google Scholar
Digital Library
- G. Gielen, P. De Wit, E. Maricau, J. Loeckx, J. Martín-Martínez, B. Kaczer, G. Groeseneken, R. Rodríguez, and M. Nafría, "Emerging yield and reliability challenges in nanometer cmos technologies," in Proceedings of the Conference on Design, Automation and Test in Europe, 2008. Google Scholar
Digital Library
- R. Hegde and N. R. Shanbhag, "Energy-efficient signal processing via algorithmic noise-tolerance," in Proceedings of the International Symposium on Low Power Electronics and Design, 1999. Google Scholar
Digital Library
- W. Huang, M. Stan, S. Gurumurthi, R. Ribando, and K. Skadron, "Interaction of scaling trends in processor architecture and cooling," in Semiconductor Thermal Measurement and Management Sym., 2010.Google Scholar
- Intel Corporation, vol. 3A, pp. 8--16, 2014. {Online}. Available: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-manual-325462.pdfGoogle Scholar
- ITRS, "ITRS process integration, devices, and structures," http://public.itrs.net/Links/2011ITRS/2011Chapters/2011PIDS.pdf, ITRS, 2011.Google Scholar
- K. Kuhn, M. Giles, D. Becher, P. Kolar, A. Kornfeld, R. Kotlyar, S. Ma, A. Maheshwari, and S. Mudanai, "Process technology variation," IEEE Transactions on Electron Devices, 2011.Google Scholar
Cross Ref
- L. Leem, H. Cho, J. Bau, Q. A. Jacobson, and S. Mitra, "ERSA: Error resilient system architecture for probabilistic applications," in Proceedings of the Conference on Design, Automation and Test in Europe, 2010. Google Scholar
Digital Library
- S. Liu, K. Pattabiraman, T. Moscibroda, and B. G. Zorn, "Flikker: Saving dram refresh-power through critical data partitioning," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2011. Google Scholar
Digital Library
- P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner, "Simics: A full system simulation platform," Computer, vol. 35, no. 2, pp. 50--58, 2002. Google Scholar
Digital Library
- A. Meixner, M. E. Bauer, and D. Sorin, "Argus: Low-cost, comprehensive error detection in simple cores," in Proceedings of the Annual International Symposium on Microarchitecture, 2007. Google Scholar
Digital Library
- S. S. Mukherjee, M. Kontz, and S. K. Reinhardt, "Detailed design and evaluation of redundant multithreading alternatives," in Proceedings of the Annual International Symposium on Computer Architecture, 2002. Google Scholar
Digital Library
- S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance micro- processor," in Proceedings of the Annual International Symposium on Microarchitecture, 2003. Google Scholar
Digital Library
- S. Mukherjee, Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., 2008. Google Scholar
Digital Library
- A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman, "EnerJ: Approximate data types for safe and general low-power computation," in Proceedings of the Conference on Programming Language Design and Implementation, 2011. Google Scholar
Digital Library
- B. Sierawski, R. Reed, M. Mendenhall, R. Weller, R. Schrimpf, S.-J. Wen, R. Wong, N. Tam, and R. Baumann, "Effects of scaling on muon-induced soft errors," in International Reliability Physics Symposium, 2011.Google Scholar
- T. Stathaki, Image Fusion: Algorithms and Applications. Academic Press, 2008. Google Scholar
Digital Library
- G. Stoll and K. Brandenburg, "The iso/mpeg-audio codec: A generic standard for coding of high quality digital audio," in Audio Engineering Society Convention, 1992.Google Scholar
- W. Thies, M. Karczmarek, and S. P. Amarasinghe, "StreamIt: A language for streaming applications," in Proceedings of the International Conference on Compiler Construction, 2002. Google Scholar
Digital Library
- A. Thomas and K. Pattabiraman, "Error detector placement for soft computation," in Proceedings of the Conference on Dependable Systems and Networks, 2013. Google Scholar
Digital Library
- G. K. Wallace, "The JPEG still picture compression standard," Commun. ACM, vol. 34, no. 4, 1991. Google Scholar
Digital Library
- Y. Yetim, M. Martonosi, and S. Malik, "Extracting useful computation from error-prone processors for streaming applications," in Proceedings of the Conference on Design, Automation and Test in Europe, 2013. Google Scholar
Digital Library
Index Terms
CommGuard: Mitigating Communication Errors in Error-Prone Parallel Execution
Recommendations
CommGuard: Mitigating Communication Errors in Error-Prone Parallel Execution
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsAs semiconductor technology scales towards ever-smaller transistor sizes, hardware fault rates are increasing. Since important application classes (e.g., multimedia, streaming workloads) are data-error-tolerant, recent research has proposed techniques ...
CommGuard: Mitigating Communication Errors in Error-Prone Parallel Execution
ASPLOS'15As semiconductor technology scales towards ever-smaller transistor sizes, hardware fault rates are increasing. Since important application classes (e.g., multimedia, streaming workloads) are data-error-tolerant, recent research has proposed techniques ...
A Detailed Performance Analysis of the Interpolation Supplemented Lattice Boltzmann Method on the Cray T3E and Cray X1A Detailed Performance Analysis of the Interpolation Supplemented Lattice Boltzmann Method on the Cray T3E and Cray X1
A detailed study of the parallel performance of the interpolation supplemented lattice Boltzmann (ISLB) method using SHMEM and MPI on the Cray T3E-900 and Cray X1 architectures is presented. The noteworthy feature of the ...







Comments