Abstract
Asymmetric coherency is a new optimization method for coherency policies to support nonuniform workloads in multicore processors. Asymmetric coherency assists in load balancing a workload and this is applicable to SoC multicores where the applications are not evenly spread among the processors and customization of the coherency is possible. Asymmetric coherency is a policy change, and consequently our designs require little or no additional hardware over an existing system. We explore two different types of asymmetric coherency policies. Our bus-based asymmetric coherency policy, generated a 60% coherency cost reduction (reduction of latencies due to coherency messages) for nonshared data. Our directory-based asymmetric coherency policy, showed up to a 5.8% execution time improvement and up to a 22% improvement in average memory latency for the parallel benchmarks Sha, using a statically allocated asymmetry. Dynamically allocated asymmetry was found to generate further improvements in access latency, increasing the effectiveness of asymmetric coherency by up to 73.8% when compared to the static asymmetric solution.
- Annavaram, M., Grochowski, E., and Shen, J. 2005. Mitigating Amdahl’s Law through EPI throttling. SIGARCH Comput. Archit. News 33, 298--309. Google Scholar
Digital Library
- Becchi, M. and Crowley, P. 2006. Dynamic thread assignment on heterogeneous multiprocessor architectures. In Proceedings of the 3rd Conference on Computing Frontiers. ACM, New York, NY, 29--40. Google Scholar
Digital Library
- Bennett, J., Carter, J., and Zwaenepoel, W. 1990. Adaptive software cache management for distributed shared memory architectures. In Proceedings of the 17th International Symposium on Computer Architecture. 125--134. Google Scholar
Digital Library
- Chandra, S., Larus, J. R., and Rogers, A. 1994. Where is time spent in message-passing and shared-memory programs? In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems. 61--73. Google Scholar
Digital Library
- Cheng, L., Carter, J. B., and Dai, D. 2007. An adaptive cache coherence protocol optimized for producer-consumer sharing. In Proceedings of the IEEE 13th International Symposium on High Performance Computer Architecture. 328--339. Google Scholar
Digital Library
- Cox, A. L. and Fowler, R. J. 1993. Adaptive cache coherency for detecting migratory shared data. In Proceedings of the 20th International Symposium on Computer Architecture. 98--108. Google Scholar
Digital Library
- Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. 2001. MiBench: A free, commercially rep-resentative embedded benchmark suite. In Proceedings in the IEEE International Workshop on Workload Characterization. 3--14. Google Scholar
Digital Library
- Iqbal, S., Liang, Y., and Grahn, H. 2010. Parmibench - an open-source benchmark for embedded multiprocessor systems. Comput. Archit. Lett. 9, 2, 45--48. Google Scholar
Digital Library
- Iyer, R., Zhao, L., Guo, F., Illikkal, R., Makineni, S., Newell, D., Solihin, Y., Hsu, L., and Reinhardt, S. 2007. QoS policies and architecture for cache/memory in CMP platforms. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. 25--36. Google Scholar
Digital Library
- Kumar, R., Mattson, T. G., Pokam, G., and van der Wijngaar, R. 2011. A case for message passing for many-core computing. In Multiprocessor System-on-Chip: Hardware Design and Tool Integration. Springer.Google Scholar
- Martin, M. M. K., Harper, P. J., Sorin, D. J., Hill, M. D., and Wood, D. A. 2003. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture. 206--217. Google Scholar
Digital Library
- Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., Alameldeen, A. R., Moore, K. E., Hill, M. D., and Wood, D. A. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News 33, 92--99. Google Scholar
Digital Library
- Shield, J., Sutton, P., and Machanick, P. 2007. Analysis of kernel effects on optimisation mismatch in cache reconfiguration. In Proceedings of the 17th International Conference on Field Programmable Logic and Applications. IEEE, 625--628.Google Scholar
- Shield, J., Diguet, J.-P., and Gogniat, G. 2011. Asymmetric cache coherency: Improving multicore performance for non-uniform workloads. In Proceedings of the 6th International Workshop on Reconfigurable Communication-Centric Systems-on-Chip. 1--8.Google Scholar
- Stenström, P., Brorsson, M., and Sandberg, L. 1993. An adaptive cache coherence protocol optimized for migratory sharing. In Proceedings of the 20th International Symposium on Computer Architecture. 109--118. Google Scholar
Digital Library
- Wolf, W., Jerraya, A., and Martin, G. 2008. Multiprocessor System-on-Chip (MPSoC) technology. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 27, 10, 1701--1713. Google Scholar
Digital Library
Index Terms
Asymmetric Cache Coherency: Policy Modifications to Improve Multicore Performance
Recommendations
Effects of Cache Coherency in Multiprocessors
In many commercial multiprocessor systems, each processor accesses the memory through a private cache. One problem that could limit the extensibility of the system and its performance is the enforcement of cache coherence. A mechanism must exist which ...
Decoupled Cache Segmentation: Mutable Policy with Automated Bypass
PACT '11: Proceedings of the 2011 International Conference on Parallel Architectures and Compilation TechniquesThe least recently used (LRU) replacement policy performs poorly in the last-level cache (LLC) because temporal locality of memory accesses is filtered by first and second level caches. We propose a cache segmentation technique that adapts to cache ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...






Comments