skip to main content
research-article

Reliability-Aware Adaptations for Shared Last-Level Caches in Multi-Cores

Authors Info & Claims
Published:01 September 2016Publication History
Skip Abstract Section

Abstract

On account of their large footprint, on-chip last-level caches in multi-core systems are one of the most vulnerable components to soft errors. However, vulnerability to soft errors highly depends on the configuration and parameters of the last-level cache, especially when executing different applications concurrently. In this article we propose a novel reliability-aware reconfigurable last-level cache architecture (R2Cache) and cache vulnerability model for multi-cores. R2Cache supports various reliability-wise efficient cache configurations (i.e., cache parameter selection and cache partitioning) for different concurrently executing applications. The proposed vulnerability model takes into account the vulnerability of both the data and tag arrays as well as the active cache area for applications in different execution phases. To enable runtime adaptations, we introduce a lightweight online vulnerability predictor that exploits the knowledge of performance metrics like number of L2 misses to accurately estimate the cache vulnerability to soft errors. Based on the predicted vulnerabilities of different concurrently executing applications in the current execution epoch, our runtime reliability manager reconfigures the cache such that, for the next execution epoch, the total vulnerability for all concurrently executing applications is minimized under user-provided tolerable performance/energy overheads. In scenarios where single-bit error correction for cache lines may be afforded, vulnerability-aware reconfigurations can be leveraged to increase the reliability of the last-level cache against multi-bit errors. Compared to state-of-the-art vulnerability-minimizing and reconfigurable caches, the proposed architecture provides 35.27% and 23.42% vulnerability savings, respectively, when averaged across numerous experiments, while reducing the vulnerability by more than 65% and 60%, respectively, for selected applications and application phases.

References

  1. A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, and S.-L. Lu. 2011. Energy-efficient cache design using variable-strength error-correcting codes. In International Symposium on Computer Architecture (ISCA). 461--472. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. C. Baumann. 2005. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Trans. Device Mater. Reliab. 5, 3 (2005), 305--316.Google ScholarGoogle Scholar
  3. C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In International Conference on Parallel Architecture and Compilation Techniques (PACT). 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. L. Binkert, B. M. Beckmann, G. Black, S. K. Reinhardt, A. G. Saidi, A. Basu, J. Hestness, D. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. 2011. The gem5 simulator. SIGARCH Comput. Arch. News 39, 2 (2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. E. Carlson, W. Heirman, K. Van Craeynest, and L. Eeckhout. 2014. BarrierPoint: Sampled simulation of multi-threaded applications. In International Symposium on Performance Analysis of Systems and Software (ISPASS). 2--12.Google ScholarGoogle Scholar
  6. C.-L. Chen and M. Y. (Ben) Hsiao. 1984. Error-correcting codes for semiconductor memory applications: A state-of-the-art review. IBM J. Res. Dev. 28, 2 (1984), 124--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Dixit and A. Wood. 2011. The impact of new technology on soft error rates. In IEEE International Reliability Physics Symposium (IRPS). 5B.4.1--5B.4.7.Google ScholarGoogle Scholar
  8. L. Duan, B. Li, and L. Peng. 2009. Versatile prediction and fast estimation of architectural vulnerability factor from processor performance metrics. In International Conference on High-Performance Computer Architecture (HPCA). 129--140.Google ScholarGoogle Scholar
  9. M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. 2001. MiBench: A free, commercially representative embedded benchmark suite. IEEE International Workshop on Workload Characterization (IISWC) (2001), 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Haghdoost, H. Asadi, and A. Baniasadi. 2010. System-level vulnerability estimation for data caches. In IEEE Pacific Rim International Symposium on Dependable Computing (PRDC). 157--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Henkel, L. Bauer, J. Becker, O. Bringmann, U. Brinkschulte, S. Chakraborty, M. Engel, R. Ernst, H. Härtig, L. Hedrich, A. Herkersdorf, R. Kapitza, D. Lohmann, P. Marwedel, M. Platzner, W. Rosenstiel, U. Schlichtmann, O. Spinczyk, M. Tahoori, J. Teich, N. Wehn, and H.-J. Wunderlich. 2011. Design and architectures for dependable embedded systems. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 365--374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. IBM. 2015. Power Servers. http://www-03.ibm.com/systems/power/hardware/. (2015).Google ScholarGoogle Scholar
  13. Intel. 2015. Itanium Processor. http://ark.intel.com/products/family/451/Intel-Itanium-Processor. (2015).Google ScholarGoogle Scholar
  14. R. Jeyapaul and A. Shrivastava. 2013. Enabling energy efficient reliability in embedded systems through smart cache cleaning. ACM Trans. Des. Automat. Electron. Syst. 18, 4 (2013), 53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. E. Kessler, E. J. McLellan, and D. A. Webb. 1998. The alpha 21264 microprocessor architecture. In International Conference on Computer Design (ICCD). 90--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. C. Hoe. 2007. Multi-bit error tolerant caches using two-dimensional error coding. In International Symposium on Microarchitecture (MICRO). 197--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Kriebel, A. Subramaniyan, S. Rehman, S. J. B. Ahandagbe, M. Shafique, and J. Henkel. 2015. R2Cache: Reliability-aware reconfigurable last-level cache architecture for multi-cores. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. L. Li, V. Degalahal, N. Vijaykrishnan, M. T. Kandemir, and M. J. Irwin. 2004. Soft error and energy consumption interactions: A data cache perspective. In International Symposium on Low Power Electronics and Design (ISLPED). 132--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. S. Mukherjee, J. S. Emer, and S. K. Reinhardt. 2005. The soft error problem: An architectural perspective. In International Conference on High-Performance Computer Architecture (HPCA). 243--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. S. Mukherjee, C. T. Weaver, J. S. Emer, S. K. Reinhardt, and T. M. Austin. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In International Symposium on Microarchitecture (MICRO). 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. 2008. Architecting efficient interconnects for large caches with CACTI 6.0. IEEE Micro 28, 1 (2008), 69--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. K. Qureshi and Z. Chishti. 2013. Operating SECDED-based caches at ultra-low voltage with FLAIR. In International Conference on Dependable Systems and Networks (DSN). 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. K. Qureshi and Y. N. Patt. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In International Symposium on Microarchitecture (MICRO). 423--432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Rawlins and A. Gordon-Ross. 2013. A cache tuning heuristic for multicore architectures. IEEE Trans. Comput. 62, 8 (2013), 1570--1583. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Rehman, M. Shafique, F. Kriebel, and J. Henkel. 2011. Reliable software for unreliable hardware: Embedded code generation aiming at reliability. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 237--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Sembrant, D. Black-Schaffer, and E. Hagersten. 2012. Phase behavior in serial and parallel applications. In International Symposium on Workload Characterization (IISWC). 47--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder. 2003. Discovering and exploiting program phases. IEEE Micro 23, 6 (2003), 84--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Srikantaiah, E. Kultursay, T. Zhang, M. T. Kandemir, M. J. Irwin, and Y. Xie. 2011. MorphCache: A reconfigurable adaptive multi-level cache hierarchy. In International Conference on High-Performance Computer Architecture (HPCA). 231--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. T. Sundararajan, T. M. Jones, and N. P. Topham. 2013a. RECAP: Region-aware cache partitioning. In International Conference on Computer Design (ICCD). 294--301.Google ScholarGoogle Scholar
  30. K. T. Sundararajan, T. M. Jones, and N. P. Topham. 2013b. The smart cache: An energy-efficient cache architecture through dynamic adaptation. Int. J. Parallel Program. 41, 2 (2013), 305--330.Google ScholarGoogle ScholarCross RefCross Ref
  31. S. Wang, J. S. Hu, and S. G. Ziavras. 2009. On the characterization and optimization of on-chip cache reliability against soft errors. IEEE Transactions on Computers (TC) 58, 9 (2009), 1171--1184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Wang, P. Mishra, and S. Ranka. 2011. Dynamic cache reconfiguration and partitioning for energy optimization in real-time multi-core systems. In Design Automation Conference (DAC). 948--953. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Wilkening, V. Sridharan, S. Li, F. Previlon, S. Gurumurthi, and D. R. Kaeli. 2014. Calculating architectural vulnerability factors for spatial multi-bit transient faults. In International Symposium on Microarchitecture (MICRO). 293--305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In International Symposium on Computer Architecture (ISCA). 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. H. Yoon and M. Erez. 2009. Memory mapped ECC: Low-cost error protection for last level caches. In International Symposium on Computer Architecture (ISCA). 116--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Zhang, F. Vahid, and R. L. Lysecky. 2004. A self-tuning cache architecture for embedded systems. ACM Transactions on Embedded Computing Systems (TECS) 3, 2 (2004), 407--425. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Zhang, F. Vahid, and W. A. Najjar. 2003. A highly-configurable cache architecture for embedded systems. In International Symposium on Computer Architecture (ISCA). 136--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. W. Zhang. 2005. Computing cache vulnerability to transient errors and its implication. In International Symposium on Defect and Fault-Tolerance in VLSI Systems (DFT). 427--435. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. W. Zhang, S. Gurumurthi, M. T. Kandemir, and A. Sivasubramaniam. 2003. ICR: In-cache replication for enhancing data cache reliability. In International Conference on Dependable Systems and Networks (DSN). 291--300.Google ScholarGoogle Scholar
  40. Y. Zou and S. Pasricha. 2014. HEFT: A hybrid system-level framework for enabling energy-efficient fault-tolerance in NoC based MPSoCs. In International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS). 4:1--4:10. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reliability-Aware Adaptations for Shared Last-Level Caches in Multi-Cores

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!