Abstract
Going beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data accesses issued by a core non-uniform. This non-uniformity can play a significant role in shaping the overall application performance. This work presents a novel compiler strategy which involves exposing architecture information to the compiler to enable an optimized computation-to-core mapping. Specifically, we propose a compiler-guided scheme that takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. The experimental data collected using a set of 21 multi-threaded applications reveal that, on an average, our approach reduces the on-chip network latency in a 6×6 manycore system by 38.4% in the case of private LLCs, and 43.8% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 10.9% and 12.7% for the private LLC and shared LLC based systems, respectively.
Supplemental Material
- 2007. Intel teralops research chip. goo.gl/lewCk7.Google Scholar
- 2009. Intel Single-cloud chip. goo.gl/RSJjfg.Google Scholar
- 2012. minighost. https://mantevo.org/default.php.Google Scholar
- 2012. The Architecture and Performance of the TILE-Gx Processor Family. http://www.tilera.com/products/processors/TILE-Gx_Family.Google Scholar
- 2013. CORAL Benchmarks. htps://asc.llnl.gov/CORAL-benchmarks/Google Scholar
- 2013. SPEC OMP 2001. htps://www.spec.org/omp2001/Google Scholar
- 2014. Intel Xeon Phi processor. goo.gl/3DVc9T.Google Scholar
- Jennifer M. Anderson and Monica S. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In PLDI. Google Scholar
Digital Library
- Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling. In Proceedings of the 21st international symposium on High Performance Computer Architecture (HPCA).Google Scholar
Cross Ref
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News (2011). Google Scholar
Digital Library
- Uday Bondhugula, J. Ramanujam, and et al. 2008. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of Programming Language Design And Implementation (PLDI).Google Scholar
- Steve Carr, Kathryn S. McKinley, and Chau-Oen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In ASPLOS. Google Scholar
Digital Library
- M. Chaudhuri. 2009. PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. In Proceedings of High Performance Computer Architecture (HPCA).Google Scholar
Cross Ref
- Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2003. Distance Associativity for High-Performance Energy-Eicient Non-Uniform Cache Architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO). Google Scholar
Digital Library
- Micha?Cierniak and Wei Li. 1995. Unifying Data and Control Transformations for Distributed Shared-memory Machines. In PLDI. Google Scholar
Digital Library
- R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In Proceedings of International Symposium on High Performance Computer Architecture (HPCA). Google Scholar
Digital Library
- Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. Aergia: Exploiting Packet Latency Slack in On-chip Networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994. Communication Optimizations for Irregular Scientiic Computations on Distributed Memory Architectures. J. Parallel Distrib. Comput. (1994). Google Scholar
Digital Library
- Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994. Communication Optimizations for Irregular Scientiic Computations on Distributed Memory Architectures. J. Parallel Distrib. Comput. (1994). Google Scholar
Digital Library
- Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In ASPLOS. Google Scholar
Digital Library
- Wei Ding, Xulong Tang, Mahmut Kandemir, Yuanrui Zhang, and Emre Kultursay. 2015. Optimizing Of-chip Accesses in Multicores. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. Google Scholar
Digital Library
- Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1999. Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior. ACM Trans. Program. Lang. Syst. (TOPLAS) (1999). Google Scholar
Digital Library
- Mary H. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Oei Liao, and Monica S. Lam. 1995. Detecting Coarse-grain Parallelism Using an Interprocedural Parallelizing Compiler. In Supercomputing. Google Scholar
Digital Library
- Hwansoo Han and C.-O. Tseng. 2006. Exploiting locality for irregular scientific codes. Parallel and Distributed Systems, IEEE Transactions on (2006). Google Scholar
Digital Library
- Mahmut Kandemir, Alok Choudhary, J Ramanujam, and Prith Banerjee. 1999. A matrix-based approach to global locality optimization. J. Parallel and Distrib. Comput. (1999). Google Scholar
Digital Library
- M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Comput. (2001). Google Scholar
Digital Library
- Mahmut Kandemir, Hui Zhao, Xulong Tang, and Mustafa Karakoy. 2015. Memory Row Reuse Distance and Its Role in Optimizing Application Performance. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. Google Scholar
Digital Library
- Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2016. uC-States: Fine-grained GPU Datapath Power Management. In PACT. Google Scholar
Digital Library
- Changkyu Kim, Doug Burger, and Stephen O Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated onchip caches. In ACM SIGPLAN Notices. Google Scholar
Digital Library
- John Kim, James Balfour, and Oilliam Dally. 2007. Flattened Butterly Topology for On-Chip Networks. In Proceedings of International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2017. POSTER: Location-Aware Computation Mapping for Manycore Processors. In Proceedings of the 2017 International Conference on Parallel Architectures and Compilation.Google Scholar
Cross Ref
- Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric Multi-level Blocking. In PLDI. Google Scholar
Digital Library
- J. B. Kotra, M. Arjomand, D. Guttman, M. T. Kandemir, and C. R. Das. 2016. Re-NUCA: A Practical NUCA Architecture for ReRAM Based Last-Level Caches. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google Scholar
- J. B. Kotra, D. Guttman, N. C. N., M. T. Kandemir, and C. R. Das. 2017. Quantifying the Potential Beneits of On-chip Near-Data Computing in Manycore Processors. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google Scholar
- J. B. Kotra, S. Kim, K. Madduri, and M. T. Kandemir. 2017. Congestionaware memory management on NUMA platforms: A VMware ESXi case study. In 2017 IEEE International Symposium on Workload Characterization (IISWC).Google Scholar
- Jagadish B. Kotra, Narges Shahidi, Zeshan A. Chishti, and Mahmut T. Kandemir. 2017. Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. Google Scholar
Digital Library
- Monica D. Lam, Edward E. Rothberg, and Michael E. Oolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In ASPLOS. Google Scholar
Digital Library
- Shun-Tak Leung and John Zahorjan. 1995. Optimizing data locality by array restructuring. Department of Computer Science and Engineering, University of Oashington.Google Scholar
- Wei Li. 1994. Compiling for NUMA parallel machines. Technical Report. Cornell University. Google Scholar
- Yong Li, A. Abousamra, R. Melhem, and A. K. Jones. 2012. Compiler-Assisted Data Distribution and Network Coniguration for Chip Multiprocessors. IEEE Transactions on Parallel and Distributed Systems (2012). Google Scholar
Digital Library
- Amy O. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In ICS. Google Scholar
Digital Library
- Jun Liu, Jagadish Kotra, Oei Ding, and Mahmut Kandemir. 2015. Network Footprint Reduction Through Data Access and Computation Placement in NoC-based Manycores. In Proceedings of the 52Nd Annual Design Automation Conference. Google Scholar
Digital Library
- Henrik Löf and Sverker Holmgren. 2005. Ainity-on-next-touch: Increasing the Performance of an Industrial PDE Solver on a cc-NUMA System. In ICS.Google Scholar
- Qingda Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors. In Proceedings of the Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Amirhossein Mirhosseini, Mohammad Sadrosadati, Ali Fakhrzadehgan, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2015. An Energy-efficient Virtual Channel Power-gating Mechanism for On-chip Networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). Google Scholar
Digital Library
- Asit K. Mishra, Onur Mutlu, and Chita R. Das. 2013. A Heterogeneous Multiple Network-on-chip Design: An Application-aware Approach. In Proceedings of the 50th Annual Design Automation Conference (DAC). Google Scholar
Digital Library
- Asit K. Mishra, N. Vijaykrishnan, and Chita R. Das. 2011. A Case for Heterogeneous On-chip Interconnects for CMPs. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- M.F.P. O'Boyle and P.M.O. Knijnenburg. 2002. Integrating Loop and Data Transformations for Global Optimization. J. Parallel Distribute Computer (2002). Google Scholar
Digital Library
- Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. In PACT. Google Scholar
Digital Library
- Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using SimPoint for Accurate and Efficient Simulation. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. Google Scholar
Digital Library
- A. Sharifi, E. Kultursay, M. Kandemir, and C.R. Das. 2012. Addressing End-to-End Memory Access Latency in NoC-Based Multicores. In Proceedings of International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- David E. Shaw, Ron O. Dror, John K. Salmon, J. P. Grossman, Kenneth M. Mackenzie, Joseph A. Bank, Cliff Young, Martin M. Denerof, Brannon Batson, Kevin J. Bowers, Edmond Chow, Michael P. Eastwood, Douglas J. Ierardi, John L. Klepeis, Jefrey S. Kuskin, Richard H. Larson, Kresten Lindorff-Larsen, Paul Maragakis, Mark A. Moraes, Stefano Piana, Yibing Shan, and Brian Towles. 2009. Millisecond-scale Molecular Dynamics Simulations on Anton. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC). Google Scholar
Digital Library
- Akbar Shrifi, Wei Ding, Diana Guttman, Hui Zhao, Xulong Tang, Mahmut Kandemir, and Chita Das. 2017. DEMM: a Dynamic Energy-saving mechanism for Multicore Memories. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.Google Scholar
Cross Ref
- A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro (2016). Google Scholar
Digital Library
- Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI. Google Scholar
Digital Library
- Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. 2016. Improving Bank-Level Parallelism for Irregular Applications. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. Google Scholar
Digital Library
- Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, Mohamed Ibrahim, Mahmut Kandemir, and Chita Das. 2017. Controlled Kernel Launch for Dynamic Parallelism in GPUs. In Proceedings of the 23rd International Symposium on High-Performance Computer Architecture.Google Scholar
Cross Ref
- S. Verdoolaege, M. Bruynooghe, G. Janssens, and P. Catthoor. 2003. Multi-dimensional incremental loop fusion for data locality. In ASAP.Google Scholar
- Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CCNUMA Compute Servers. In ASPLOS. Google Scholar
Digital Library
- Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In PLDI. Google Scholar
Digital Library
- M. E. Wolf and M. S. Lam. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems (1991). Google Scholar
Digital Library
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Haibo Zhang, Prasanna Venkatesh Rengasamy, Nachiappan Chidambaram Nachiappan, Shulin Zhao, Anand Sivasubramaniam, Mahmut Kandemir, and Chita R. Das. 2018. FLOSS: FLOw Sensitive Scheduling on Mobile Platforms. In In Proceedings of The Design Automation Conference (DAC). Google Scholar
Digital Library
- Haibo Zhang, Prasanna Venkatesh Rengasamy, Shulin Zhao, Nachiappan Chidambaram Nachiappan, Anand Sivasubramaniam, Mahmut T. Kandemir, Ravi Iyer, and Chita R. Das. 2017. Race-to-sleep + Content Caching + Display Caching: A Recipe for Energy-efficient Video Streaming on Handhelds. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. Google Scholar
Digital Library
- Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2002. Breaking Address Mapping Symmetry at Multi-levels of Memory Hierarchy to Reduce DRAM Row-bufer Conlicts. In The Journal of Instruction-Level Parallelism (JILP).Google Scholar
- Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
Index Terms
Enhancing computation-to-core assignment with physical location information
Recommendations
Enhancing computation-to-core assignment with physical location information
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and ImplementationGoing beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data ...
Computation mapping for multi-level storage cache hierarchies
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed ComputingImproving I/O performance is an important issue for many data-intensive, large-scale parallel applications. Although storage caches are used for improving I/O latencies of parallel applications, most of the prior work has focused on the management and ...
Maintaining Cache Coherence through Compiler-Directed Data Prefetching
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...







Comments