Abstract
In this article, new heuristic-search methods and algorithms are presented for enabling highly efficient and adaptive, defect-tolerant multiprocessor arrays. We consider systems where a homogeneous multiprocessor array lies on top of reconfigurable interconnects which allow the pipeline stages of the processors to be connected in all possible configurations. Considering the multiprocessor array partitioned in substitutable units at the granularity of pipeline stages, we employ a variety of heuristic-search methods and algorithms to isolate and replace defective units. The proposed heuristics are designed for off-line execution and aim at minimizing the performance overhead necessarily introduced to the array by the interconnects' latency. An empirical evaluation of the designed algorithms is then carried out, in order to assess the targeted problem and the efficacy of our approach. Our findings indicate this to be a NP-complete computational problem, however, our heuristic-search methods can achieve, for the problem sizes we exhaustively searched, 100% accuracy in finding the optimal solution among 1019 possible candidates within 2.5 seconds. Alternatively, they can provide near-optimal solutions at an accuracy which consistently exceeds 70% (compared to the optimal solution) in only 10-4 seconds.
- Aggarwal, N., Ranganathan, P., Jouppi, N. P., and Smith, J. E. 2007. Configurable isolation: building high availability systems with commodity multi-core processors. In Proceedings of ISCA'07, 470--481. Google Scholar
Digital Library
- Austin, T., Bertacco, V., Mahlke, S., and Cao, Y. 2008. Reliable systems on unreliable fabrics. IEEE Des. Test 25, 4, 322--332. Google Scholar
Digital Library
- Borkar, S. 2005. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE, Micro 25, 6, 10--16. Google Scholar
Digital Library
- Cormen, T., Leiserson, C., Rivest, R., and Stein, C. 2003. Introduction to Algorithms 2nd Ed. McGraw-Hill Science/Engineering/Math. Google Scholar
Digital Library
- D. Bhaduri, S. K. Shukla, P. G. M. G. 2005. Reliability analysis of fault-tolerant reconfigurable architectures. In Proceedings of the IEEE International Workshop on Design & Test of Defect-Tolerant Nanoscale Archit (NANOARCH).Google Scholar
- Garey, M. R. and Johnson, D. S. 1979. Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Co., San Francisco, CA. Google Scholar
Digital Library
- Gold, B. T., Falsafi, B., and Hoe, J. C. 2009. Chip-level redundancy in distributed shared-memory multiprocessors. In Proceedings of the 15th IEEE Pacific Rim International Symposium on Dependable Computing (PRDC'09). 195--201. Google Scholar
Digital Library
- Goldberg, D. E. 1989. Genetic Algorithms in Search, Optimization and Machine Learning 1st Ed. Addison-Wesley Longman Publishing Co., Inc., Boston, MA. Google Scholar
Digital Library
- Gupta, S., Ansari, A., Feng, S., and Mahlke, S. 2010. StageWeb: Interweaving pipeline stages into a wearout and variation tolerant CMP fabric. Proceedings of the International Conference on Dependable Systems and Networks, 101--110. Google Scholar
Cross Ref
- Gupta, S., Feng, S., Ansari, A., Blome, J., and Mahlke, S. 2008. The stagenet fabric for constructing resilient multicore systems. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO-41). 141--151. Google Scholar
Digital Library
- Gupta, S., Feng, S., Ansari, A., and Mahlke, S. 2011. Stagenet: A reconfigurable fabric for constructing dependable cmps. IEEE Trans. Comput. 60, 1, 5--19. Google Scholar
Digital Library
- Gupta, S., Feng, S., Blome, J., and Mahlke, S. 2007. Stagenet: A reconfigurable cmp fabric for resilient systems. In Proceedings of the Reconfigurable and Adaptive Architecture Workshop (RAAW).Google Scholar
- Hauck, S. and DeHon, A. 2007. Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation. Morgan Kaufmann Publishers Inc., San Francisco, CA. Google Scholar
Digital Library
- Hennessy, J. L. and Patterson, D. A. 2007. Computer Architecture—A Quantitative Approach 4th Ed. Google Scholar
Digital Library
- Hoos, H. H. and Stuetzle, T. 2004. Stochastic Local Search: Foundations and Applications. Google Scholar
Digital Library
- J. Emmert, C. Stroud, B. S., and Abramovici, M. 2000. Dynamic fault tolerance in fpgas via partial reconfiguration. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines. Google Scholar
Digital Library
- J. M. Emmert, C. E. S. and Abramovici, M. 2007. Online fault tolerance for fpga logic blocks. IEEE Trans. VLSI Syst. 15, 216--226. Google Scholar
Digital Library
- Kirkpatrick, S., Gelatt, C. D., and Vecchi, M. P. 1983. Optimization by simulated annealing. Science 220, 4598, 671--680. Google Scholar
Cross Ref
- Knuth, D. E. 2011. The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1. Addison-Wesley Professional. Google Scholar
Digital Library
- Powell, M. D., Biswas, A., Gupta, S., and Mukherjee, S. S. 2009. Architectural core salvaging in a multi-core processor for hard-error tolerance. In Proceedings of the International Symposium on Computer Architecture (ISCA'09). 93--104. Google Scholar
Digital Library
- Ramirez, A., Cabarcas, F., Juurlink, B., Mesa, M. A., Sanchez, F., Azevedo, A., Meenderinck, C., Ciobanu, C., Isaza, S., and Gaydadjiev, G. 2010. The sarc architecture. IEEE Micro 30, 16--29. Google Scholar
Digital Library
- Ramos, J., Samson, J., Lupia, D., Troxel, I., Subramaniyan, R., A. Jacobs, J. Greco, G. C., Curreri, J., Fischer, M., Grobelny, E., Georgr, A., and Some, R. 2006. High-performance, dependable multiprocessor. In Proceedings of the IEEE Aerospace Conference. Google Scholar
Cross Ref
- Riordan, T., Grewal, G., Hsu, S., Kinsel, J., Libby, J., March, R., Mills, M., Ries, P., and Scofield, R. 1988. The mips m2000 system. In Proceedings of the ICCD. 366--369. Google Scholar
Cross Ref
- Romanescu, B. F. and Sorin, D. J. 2008. Core cannibalization architecture: improving lifetime chip performance for multicore processors in the presence of hard faults. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT'08). 43--51. Google Scholar
Digital Library
- Rosen, K. H. 2002. Discrete Mathematics and Its Applications 5th Ed. McGraw-Hill Higher Education. Google Scholar
Digital Library
- Shuler, Robert, J. 2010. Fpgas with reconfigurable fault-tolerant redundancy. NASA Tech briefs MSC-24464-1, NASA Center, Johnson Space Center.Google Scholar
- Smaragdos, G. 2012. An adaptive defect-tolerant multiprocessor array architecture. M.S. thesis, Delft University of Technology.Google Scholar
- Smolens, J. C., Gold, B. T., Falsafi, B., and Hoe, J. C. 2006. Reunion: Complexity-effective multicore redundancy. In Proceedings of the IEEE/ACM International Symposium on Microarchitecture (MICRO'39). 223--234. Google Scholar
Digital Library
- Stuart, R. and Peter, N. 2003. Artificial Intelligence: A Modern Approach 2nd Ed. Prentice Hall. Google Scholar
Digital Library
- Sylvester, D., Blaauw, D., and Karl, E. 2006. Elastic: An adaptive self-healing architecture for unpredictable silicon. Proceedings of the Desi. Test Comput. 23, 6, 484--490. Google Scholar
Digital Library
- Tzilis, S., Sourdis, I., and Gaydadjiev, G. 2010. Fine-grain fault diagnosis for fpga logic blocks. In Proceedings of the International Conference on Field-Programmable Technology (FPT 2010). Beijing, China. Google Scholar
Cross Ref
- Vasilikos, V. 2011. Heuristic search for defect tolerant adaptive multiprocessor arrays. M.S. thesis.Google Scholar
- Černý, V. 1985. Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. J. Optimi. Theory Appli. 45, 1, 41--51. Google Scholar
Digital Library
Index Terms
Heuristic search for adaptive, defect-tolerant multiprocessor arrays
Recommendations
A High-Throughput Distributed Shared-Buffer NoC Router
Microarchitectural configurations of buffers in routers have a significant impact on the overall performance of an on-chip network (NoC). This buffering can be at the inputs or the outputs of a router, corresponding to an input-buffered router (IBR) or ...
A Cost-Effective Latency-Aware Memory Bus for Symmetric Multiprocessor Systems
This paper presents how a multi-core system can benefit from the use of a latency-aware memory bus capable of dual-concurrent data transfers on a single wire line: Source synchronous CDMA interconnect (SSCDMA-I) has been adopted to implement the memory ...
Scalable high-radix router microarchitecture using a network switch organization
As the system size of supercomputers and datacenters increases, cost-efficient networks become critical in achieving good scalability on those systems. High-radix routers reduce network cost by lowering the network diameter while providing a high ...






Comments