skip to main content
research-article

Enhancing computation-to-core assignment with physical location information

Published:11 June 2018Publication History
Skip Abstract Section

Abstract

Going beyond a certain number of cores in modern architectures requires an on-chip network more scalable than conventional buses. However, employing an on-chip network in a manycore system (to improve scalability) makes the latencies of the data accesses issued by a core non-uniform. This non-uniformity can play a significant role in shaping the overall application performance. This work presents a novel compiler strategy which involves exposing architecture information to the compiler to enable an optimized computation-to-core mapping. Specifically, we propose a compiler-guided scheme that takes into account the relative positions of (and distances between) cores, last-level caches (LLCs) and memory controllers (MCs) in a manycore system, and generates a mapping of computations to cores with the goal of minimizing the on-chip network traffic. The experimental data collected using a set of 21 multi-threaded applications reveal that, on an average, our approach reduces the on-chip network latency in a 6×6 manycore system by 38.4% in the case of private LLCs, and 43.8% in the case of shared LLCs. These improvements translate to the corresponding execution time improvements of 10.9% and 12.7% for the private LLC and shared LLC based systems, respectively.

Skip Supplemental Material Section

Supplemental Material

p312-kislal.webm

References

  1. 2007. Intel teralops research chip. goo.gl/lewCk7.Google ScholarGoogle Scholar
  2. 2009. Intel Single-cloud chip. goo.gl/RSJjfg.Google ScholarGoogle Scholar
  3. 2012. minighost. https://mantevo.org/default.php.Google ScholarGoogle Scholar
  4. 2012. The Architecture and Performance of the TILE-Gx Processor Family. http://www.tilera.com/products/processors/TILE-Gx_Family.Google ScholarGoogle Scholar
  5. 2013. CORAL Benchmarks. htps://asc.llnl.gov/CORAL-benchmarks/Google ScholarGoogle Scholar
  6. 2013. SPEC OMP 2001. htps://www.spec.org/omp2001/Google ScholarGoogle Scholar
  7. 2014. Intel Xeon Phi processor. goo.gl/3DVc9T.Google ScholarGoogle Scholar
  8. Jennifer M. Anderson and Monica S. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling. In Proceedings of the 21st international symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  10. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Uday Bondhugula, J. Ramanujam, and et al. 2008. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of Programming Language Design And Implementation (PLDI).Google ScholarGoogle Scholar
  12. Steve Carr, Kathryn S. McKinley, and Chau-Oen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Chaudhuri. 2009. PageNUCA: Selected policies for page-grain locality management in large shared chip-multiprocessor caches. In Proceedings of High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  14. Zeshan Chishti, Michael D. Powell, and T. N. Vijaykumar. 2003. Distance Associativity for High-Performance Energy-Eicient Non-Uniform Cache Architectures. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Micha?Cierniak and Wei Li. 1995. Unifying Data and Control Transformations for Distributed Shared-memory Machines. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In Proceedings of International Symposium on High Performance Computer Architecture (HPCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Reetuparna Das, Onur Mutlu, Thomas Moscibroda, and Chita R. Das. 2010. Aergia: Exploiting Packet Latency Slack in On-chip Networks. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994. Communication Optimizations for Irregular Scientiic Computations on Distributed Memory Architectures. J. Parallel Distrib. Comput. (1994). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994. Communication Optimizations for Irregular Scientiic Computations on Distributed Memory Architectures. J. Parallel Distrib. Comput. (1994). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Wei Ding, Xulong Tang, Mahmut Kandemir, Yuanrui Zhang, and Emre Kultursay. 2015. Optimizing Of-chip Accesses in Multicores. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1999. Cache Miss Equations: A Compiler Framework for Analyzing and Tuning Memory Behavior. ACM Trans. Program. Lang. Syst. (TOPLAS) (1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Mary H. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Oei Liao, and Monica S. Lam. 1995. Detecting Coarse-grain Parallelism Using an Interprocedural Parallelizing Compiler. In Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Hwansoo Han and C.-O. Tseng. 2006. Exploiting locality for irregular scientific codes. Parallel and Distributed Systems, IEEE Transactions on (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mahmut Kandemir, Alok Choudhary, J Ramanujam, and Prith Banerjee. 1999. A matrix-based approach to global locality optimization. J. Parallel and Distrib. Comput. (1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Comput. (2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mahmut Kandemir, Hui Zhao, Xulong Tang, and Mustafa Karakoy. 2015. Memory Row Reuse Distance and Its Role in Optimizing Application Performance. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Onur Kayiran, Adwait Jog, Ashutosh Pattnaik, Rachata Ausavarungnirun, Xulong Tang, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2016. uC-States: Fine-grained GPU Datapath Power Management. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Changkyu Kim, Doug Burger, and Stephen O Keckler. 2002. An adaptive, non-uniform cache structure for wire-delay dominated onchip caches. In ACM SIGPLAN Notices. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. John Kim, James Balfour, and Oilliam Dally. 2007. Flattened Butterly Topology for On-Chip Networks. In Proceedings of International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2017. POSTER: Location-Aware Computation Mapping for Manycore Processors. In Proceedings of the 2017 International Conference on Parallel Architectures and Compilation.Google ScholarGoogle ScholarCross RefCross Ref
  33. Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric Multi-level Blocking. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. B. Kotra, M. Arjomand, D. Guttman, M. T. Kandemir, and C. R. Das. 2016. Re-NUCA: A Practical NUCA Architecture for ReRAM Based Last-Level Caches. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarGoogle Scholar
  35. J. B. Kotra, D. Guttman, N. C. N., M. T. Kandemir, and C. R. Das. 2017. Quantifying the Potential Beneits of On-chip Near-Data Computing in Manycore Processors. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google ScholarGoogle Scholar
  36. J. B. Kotra, S. Kim, K. Madduri, and M. T. Kandemir. 2017. Congestionaware memory management on NUMA platforms: A VMware ESXi case study. In 2017 IEEE International Symposium on Workload Characterization (IISWC).Google ScholarGoogle Scholar
  37. Jagadish B. Kotra, Narges Shahidi, Zeshan A. Chishti, and Mahmut T. Kandemir. 2017. Hardware-Software Co-design to Mitigate DRAM Refresh Overheads: A Case for Refresh-Aware Process Scheduling. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Monica D. Lam, Edward E. Rothberg, and Michael E. Oolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Shun-Tak Leung and John Zahorjan. 1995. Optimizing data locality by array restructuring. Department of Computer Science and Engineering, University of Oashington.Google ScholarGoogle Scholar
  40. Wei Li. 1994. Compiling for NUMA parallel machines. Technical Report. Cornell University. Google ScholarGoogle Scholar
  41. Yong Li, A. Abousamra, R. Melhem, and A. K. Jones. 2012. Compiler-Assisted Data Distribution and Network Coniguration for Chip Multiprocessors. IEEE Transactions on Parallel and Distributed Systems (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Amy O. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In ICS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jun Liu, Jagadish Kotra, Oei Ding, and Mahmut Kandemir. 2015. Network Footprint Reduction Through Data Access and Computation Placement in NoC-based Manycores. In Proceedings of the 52Nd Annual Design Automation Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Henrik Löf and Sverker Holmgren. 2005. Ainity-on-next-touch: Increasing the Performance of an Industrial PDE Solver on a cc-NUMA System. In ICS.Google ScholarGoogle Scholar
  45. Qingda Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors. In Proceedings of the Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Amirhossein Mirhosseini, Mohammad Sadrosadati, Ali Fakhrzadehgan, Mehdi Modarressi, and Hamid Sarbazi-Azad. 2015. An Energy-efficient Virtual Channel Power-gating Mechanism for On-chip Networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Asit K. Mishra, Onur Mutlu, and Chita R. Das. 2013. A Heterogeneous Multiple Network-on-chip Design: An Application-aware Approach. In Proceedings of the 50th Annual Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Asit K. Mishra, N. Vijaykrishnan, and Chita R. Das. 2011. A Case for Heterogeneous On-chip Interconnects for CMPs. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M.F.P. O'Boyle and P.M.O. Knijnenburg. 2002. Integrating Loop and Data Transformations for Global Optimization. J. Parallel Distribute Computer (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. In PACT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using SimPoint for Accurate and Efficient Simulation. In Proceedings of the 2003 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. A. Sharifi, E. Kultursay, M. Kandemir, and C.R. Das. 2012. Addressing End-to-End Memory Access Latency in NoC-Based Multicores. In Proceedings of International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. David E. Shaw, Ron O. Dror, John K. Salmon, J. P. Grossman, Kenneth M. Mackenzie, Joseph A. Bank, Cliff Young, Martin M. Denerof, Brannon Batson, Kevin J. Bowers, Edmond Chow, Michael P. Eastwood, Douglas J. Ierardi, John L. Klepeis, Jefrey S. Kuskin, Richard H. Larson, Kresten Lindorff-Larsen, Paul Maragakis, Mark A. Moraes, Stefano Piana, Yibing Shan, and Brian Towles. 2009. Millisecond-scale Molecular Dynamics Simulations on Anton. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Akbar Shrifi, Wei Ding, Diana Guttman, Hui Zhao, Xulong Tang, Mahmut Kandemir, and Chita Das. 2017. DEMM: a Dynamic Energy-saving mechanism for Multicore Memories. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.Google ScholarGoogle ScholarCross RefCross Ref
  55. A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. 2016. Improving Bank-Level Parallelism for Irregular Applications. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, Mohamed Ibrahim, Mahmut Kandemir, and Chita Das. 2017. Controlled Kernel Launch for Dynamic Parallelism in GPUs. In Proceedings of the 23rd International Symposium on High-Performance Computer Architecture.Google ScholarGoogle ScholarCross RefCross Ref
  60. S. Verdoolaege, M. Bruynooghe, G. Janssens, and P. Catthoor. 2003. Multi-dimensional incremental loop fusion for data locality. In ASAP.Google ScholarGoogle Scholar
  61. Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CCNUMA Compute Servers. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. M. E. Wolf and M. S. Lam. 1991. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems (1991). Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Haibo Zhang, Prasanna Venkatesh Rengasamy, Nachiappan Chidambaram Nachiappan, Shulin Zhao, Anand Sivasubramaniam, Mahmut Kandemir, and Chita R. Das. 2018. FLOSS: FLOw Sensitive Scheduling on Mobile Platforms. In In Proceedings of The Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Haibo Zhang, Prasanna Venkatesh Rengasamy, Shulin Zhao, Nachiappan Chidambaram Nachiappan, Anand Sivasubramaniam, Mahmut T. Kandemir, Ravi Iyer, and Chita R. Das. 2017. Race-to-sleep + Content Caching + Display Caching: A Recipe for Energy-efficient Video Streaming on Handhelds. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2002. Breaking Address Mapping Symmetry at Multi-levels of Memory Hierarchy to Reduce DRAM Row-bufer Conlicts. In The Journal of Instruction-Level Parallelism (JILP).Google ScholarGoogle Scholar
  68. Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing Shared Resource Contention in Multicore Processors via Scheduling. In Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing computation-to-core assignment with physical location information

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 53, Issue 4
      PLDI '18
      April 2018
      834 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3296979
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2018
        825 pages
        ISBN:9781450356985
        DOI:10.1145/3192366

      Copyright © 2018 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2018

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!