Abstract
Advances in technology scaling increasingly make emerging Chip MultiProcessor (CMP) platforms more susceptible to failures that cause various reliability challenges. In such platforms, error-prone on-chip memories (caches) continue to dominate the chip area. Also, Network-on-Chip (NoC) fabrics are increasingly used to manage the scalability of these architectures. We present a novel solution for efficient implementation of fault-tolerant design of Last-Level Cache (LLC) in CMP architectures. The proposed approach leverages the interconnection network fabric to protect the LLC cache banks against permanent faults in an efficient and scalable way. During an LLC access to a faulty block, the network detects and corrects the faults, returning the fault-free data to the requesting core. Leveraging the NoC interconnection fabric, designers can implement any cache fault-tolerant scheme in an efficient, modular, and scalable manner for emerging multicore/manycore platforms. We propose four different policies for implementing a remapping-based fault-tolerant scheme leveraging the NoC fabric in different settings. The proposed policies enable design trade-offs between NoC traffic (packets sent through the network) and the intrinsic parallelism of these communication mechanisms, allowing designers to tune the system based on design constraints. We perform an extensive design space exploration on NoC benchmarks to demonstrate the usability and efficacy of our approach. In addition, we perform sensitivity analysis to observe the behavior of various policies in reaction to improvements in the NoC architecture. The overheads of leveraging the NoC fabric are minimal: on an 8-core, 16-cache-bank CMP we demonstrate reliable access to LLCs with additional overheads of less than 3% in area and less than 7% in power.
- A. Agarwal, B. C. Paul, H. Mahmoodi-Meimand, A. Datta, and K. Roy. 2005. A process-tolerant cache architecture for improved yield in nanoscale technologies. IEEE Trans. VLSI Syst. 13, 1, 27--38. Google Scholar
Digital Library
- N. Aggarwal, P. Ranganathan, N. P. Jouppi, and J. E. Smith. 2007. Configurable isolation: Building high availability systems with commodity multi-core processors. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA'07). 470--481. Google Scholar
Digital Library
- A. Alameldeen, I. Wagner, Z. Chishti, W. Wu, and S.-L. Lu. 2011. Energy-efficient cache design using variable-strength error-correcting codes. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA'11). 461--471. Google Scholar
Digital Library
- F. Angiolini, D. Atienza, S. Murali, L. Benini, and Micheli, G. D. 2006. Reliability support for on-chip memories using networks-on-chip. In Proceedings of the International Conference on Computer Design (ICCD'06).Google Scholar
- A. Ansari, S. Feng, S. Gupta, and S. Mahlke. 2011. Archipelago: A polymorphic cache design for enabling robust near-threshold operation. In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA'11). 539--550. Google Scholar
Digital Library
- ASU. 2012. Predictive technology model (ptm). http://ptm.asu.edu.Google Scholar
- A. Banaiyanmofrad, H. Homayoun, and N. Dutt. 2011. FFT-cache: A flexible fault-tolerant cache architecture for ultra low voltage operation. In Proceedings of the 14th International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES'11). 95--104. Google Scholar
Digital Library
- A. Banaiyanmofrad, G. Girao, and N. Dutt. 2012. A novel noc--based design for fault-tolerance of last-level caches in cmps. In Proceedings of the 8th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES/ISSS'12). 63--72. Google Scholar
Digital Library
- B. M. Beckmann and D. A. Wood. 2004. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th International Symposium on Microarhitecture (MICRO'04). 319--330. Google Scholar
Digital Library
- D. Bertozzi, L. Benini, and G. D. Micheli. 2000. Error control schemes for on-chip communication links: The energy--reliability tradeoff. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 24, 6, 818--831. Google Scholar
Digital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). 72--81. Google Scholar
Digital Library
- P. Bogdan, T. Dumitras, and R. Marculescu. 2007. Stochastic communication: A new paradigm for fault-tolerant networks-on-chip. http://www.hindawi.com/journals/vlsi/2007/095348/abs/.Google Scholar
- B. Calhoun and A. Chandrakasan. 2006. A 256 kb sub-threshold sram in 65nm cmos. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC'06).Google Scholar
- C. Chen and M. Hsiao. 1984. Error-correcting codes for semiconductor memory applications: A state of the art review. IBM J. Res. Devel. 28, 2, 124--134. Google Scholar
Digital Library
- R. Das, A. K. Mishra, C. Nicopoulos, P. Dongkook, V. Narayanan, et al. 2008. Performance and power optimization through data compression in network-on-chip architectures. In Proceedings of the 14th International Symposium on High Performance Computer Architecture (HPCA'08). 215--225.Google Scholar
Cross Ref
- A. Eghbal, H. Pedram, P. M. Yaghini, and H. R. Zarandi. 2010. Designing a fault-tolerant noc router architecture. Int. J. Electron. 97, 10, 1181--1192.Google Scholar
Cross Ref
- N. Enright-Jerger, L.-S. Peh, and M. Lipasti. 2008. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'08). 35--46. Google Scholar
Digital Library
- X. Fu, T. Li, and J. A. B. Fortes. 2010. Architecting reliable multi-core network-on-chip for small scale processing technology. In Proceedings of the Design Automation Conference (DSN'10).Google Scholar
Cross Ref
- G. Girao, D. Barcelos, and F. R. Wagner. 2009. Performance and energy evaluation of memory organizations in noc-based mpsocs under latency and task migration. In Proceedings of the 17th IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration (VLSI-SoC'09).Google Scholar
- S. M. Z. Iqbal, Y. Liang, and H. Grahn. 2010. ParMiBench: An open source benchmark for embedded multiprocessor systems. In Proceedings of Computer Architecture Letters. Google Scholar
Digital Library
- Li, F. Kandemir, M. J. Irwin, and S. W. SON. 2008. A novel migration-based nuca design for chip multiprocessors. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC'08). Google Scholar
Digital Library
- A. B. Kahng, B. Li, L. S. Peh, and K. Samadi. 2009. ORION 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'09). 423--428. Google Scholar
Digital Library
- C. Kim, D. Burger, and S. W. Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'02). 211--222. Google Scholar
Digital Library
- J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, and C. R. Das. 2005. A low latency router supporting adaptivity for on-chip interconnects. In Proceedings of the 42nd Annual Design Automation Conference (DAC'05). 559--564. Google Scholar
Digital Library
- D. Kim, K. Kim, J.-Y. Kim, S.-J. Lee, and H.-J. Yoo. 2007a. Solutions for real chip implementation issues of noc and their application to memory-centric noc. In Proceedings of the 1st International Symposium on Networks-on-Chip (NOCS'07). 30--39. Google Scholar
Digital Library
- J. Kim, C. Nicopoulos, and D. Park. 2006. A gracefully degrading and energy-efficient modular router architecture for on-chip networks. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA'06). 4--15. Google Scholar
Digital Library
- J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. Hoe. 2007b. Multi-bit error tolerant caches using two-dimensional error coding. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07). 197--209. Google Scholar
Digital Library
- C. K. Koh, W. F. Wong, Y. Chen, and H. Li. 2009. Tolerating process variations in large, set associative caches: The buddy cache. ACM Trans. Archit. Code Optim. 6, 2, 1--34. Google Scholar
Digital Library
- L. Kunz, G. Girao, and F. R. Wagner. 2011. Improving the efficiency of a hardware transactional memory on an noc-based mpsoc. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition (DATE'11). 1--4.Google Scholar
- P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, Al. E. 2002. Simics: A full system simulation platform. IEEE Comput. 35, 2, 50--58. Google Scholar
Digital Library
- S. Manolache, P. Eles, and Z. Peng. 2005. Fault and energy-aware communication mapping with guaranteed latency for applications implemented on noc. In Proceedings of the 42nd Design Automation Conference (DAC'05). 266--269. Google Scholar
Digital Library
- T. Marescaux, E. Brockmeyer, and H. Corporaal. 2007. The impact of higher communication layers on noc supported mpsocs. In Proceedings of the International Symposium on Networks-on-Chips (NOCS'07). Google Scholar
Digital Library
- R. Marculescu, U. Y. Ogras, L.-S. Peh, N. E. Jerger, and Y. Hoskote. 2009. Outstanding research problems in noc design: System, microarchitecture, and circuit perspectives. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 28, 1, 3--21. Google Scholar
Digital Library
- M. Monchiero, G. Palermo, C. Silvano, and O. Villa. 2006. Exploration of distributed shared memory architectures for noc-based multiprocessors. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (IC-SAMOS'06). 144--151.Google Scholar
- N. Muralimanohar, R. Balasubramonian, and N. Jouppi. 2009. Cacti 6.5. Tech. rep., HP Laboratories. http://www.hpl.hp.com/research/cacti/.Google Scholar
- S. R. Nassif, N. Mehta, and Y. Cao. 2010. A resilience roadmap. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE'10). 1011--1016. Google Scholar
Digital Library
- S. Ozdemir, D. Sinha, G. Memik, J. Adams, and H. Zhou. 2006. Yield-aware cache architectures. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). 15--25. Google Scholar
Digital Library
- V. Puente, J. A. Gregorio, F. Vallejo, and R. Beivide. 2004. Immunet: A cheap and robust fault-tolerant packet routing mechanism. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA'04). 198. Google Scholar
Digital Library
- M. Pirretti, G. M. Link, R. R. Brooks, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. 2004. Fault tolerant algorithms for network-on-chip interconnect. In Proceedings of the IEEE Symposium on VLSI. 46--51.Google Scholar
- D. Roberts, N. S. Kim, and T. Mudge. 2007. On-chip cache device scaling limits and effective fault repair techniques in future nanoscale technology. In Proceedings of the 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD'07). Google Scholar
Digital Library
- SUN/ORACLE. 2010. SPARC T3 processor data sheet. http://www.oracle.com/us/products/servers-storage/servers/sparc-enterprise/t-series/sparc-t3-chip-ds-173097.pdf.Google Scholar
- T. Thomas and B. Anthony. 1999. Area, performance, and yield implications of redundancy in on-chip caches. In Proceedings of the International Conference on Computer Design (ICCD'99). 291--292. Google Scholar
Digital Library
- P. M. Yaghini, A. Eghbal, H. Pedram, and H. R. Zarandi. 2010. Investigation of transient fault effects in an asynchronous noc router. In Proceedings of the 18th Euromicro International Conference on Parallel, Distributed and Network Based Processing (PDP'10). 540--545. Google Scholar
Digital Library
- Y. Wang, L. Zhang, Y. Han, H. Li, and X. Li. 2010. Address remapping for static nuca in noc-based degradable chip-multiprocessors. In Proceedings of the 16th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'10). Google Scholar
Digital Library
- A. G. Wassal, H. H. Sarhan, A. Elsherief. 2011. Novel 3d memory-centric noc architecture for transaction-based soc applications. In Proceedings of the Saudi International Electronics, Communications and Photonics Conference (SIECPC'11). 1--5.Google Scholar
Cross Ref
- C. Wilkerson, H. Gao, A. R. Alamelden, Z. Chishti, M. Khellah, and S.-L. Lu. 2008. Trading off cache capacity for reliability to enable low voltage operation. In Proceedings of the 35th Annual International Symposium on Computer Architecture (ISCA'08). 203--214. Google Scholar
Digital Library
- C. Wilkerson, A. R. Alamelden, Z. Chishti, W. Wu, D. Somasekhar, and S.-L. Lu. 2010. Reducing cache power with low-cost, multi-bit error-correcting codes. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA'10). Google Scholar
Digital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. 1995. The splash-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture (ISCA'95). Google Scholar
Digital Library
- C. A. Zeferino and A. A. Susin. 2003. SoCIN: A parametric and scalable network-on-chip. In Proceedings of the 16th Symposium on Integrated Circuits and Systems Design (SBCCI'03). 169. Google Scholar
Digital Library
- M. Zhang, V. M. Stojanovic, and P. Ampadu. 2012. Reliable ultra-low-voltage cache design for many-core systems. IEEE Trans. Circ. Syst. II: Express Briefs 59, 12, 858--862.Google Scholar
Cross Ref
Index Terms
NoC-based fault-tolerant cache design in chip multiprocessors
Recommendations
METEOR: Hybrid photonic ring-mesh network-on-chip for multicore architectures
Special Issue on Design Challenges for Many-Core Processors, Special Section on ESTIMedia'13 and Regular PapersWith increasing application complexity and improvements in process technology, Chip MultiProcessors (CMPs) with tens to hundreds of cores on a chip are becoming a reality. Networks-on-Chip (NoCs) have emerged as scalable communication fabrics that can ...
A novel NoC-based design for fault-tolerance of last-level caches in CMPs
CODES+ISSS '12: Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesisAdvances in technology scaling, coupled with aggressive voltage scaling results in significant reliability challenges for emerging Chip Multiprocessor (CMP) platforms, where error-prone caches continue to dominate the chip area. Network-on-Chip (NoC) ...
Exploring hybrid photonic networks-on-chip foremerging chip multiprocessors
CODES+ISSS '09: Proceedings of the 7th IEEE/ACM international conference on Hardware/software codesign and system synthesisIncreasing application complexity and improvements in process technology have today enabled chip multiprocessors (CMPs) with tens to hundreds of cores on a chip. Networks on Chip (NoCs) have emerged as scalable communication fabrics that can support ...






Comments