skip to main content
research-article
Public Access

Mix and Match: Reorganizing Tasks for Enhancing Data Locality

Published:04 June 2021Publication History
Skip Abstract Section

Abstract

Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. However, to maximize the performance of multithreaded applications running on emerging manycore systems, data movement in on-chip network should also be minimized. Unfortunately, the way many multithreaded programs are written does not lend itself well to minimal data movement. Motivated by this observation, in this paper, we target task-based programs (which cover a large set of available multithreaded programs), and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The detailed experiments show that, when targeting a state of the art manycore system, our proposed compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average, and it also outperforms two state-of-the-art data access optimizations for all the benchmarks tested. Our results also show that the proposed approach i) improves the performance of multiprogrammed workloads, and ii) generates results that are close to maximum savings that could be achieved with perfect profiling information. Overall, our experimental results emphasize the importance of dividing an original set of tasks of an application into sub-tasks and constructing new tasks from the resulting sub-tasks in a data movement- and locality-aware fashion.

References

  1. 2012. minighost. https://mantevo.org/default.php.Google ScholarGoogle Scholar
  2. Ian F. Adams, Darrell D. E. Long, Ethan L. Miller, Shankar Pasupathy, and Mark W. Storer. 2009. Maximizing Efficiency by Trading Storage for Computation. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data Reorganization in Memory Using 3D-stacked DRAM. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  5. Ismail Akturk and Ulya R. Karpuzcu. 2017. AMNESIAC: Amnesic Automatic Computer. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems.Google ScholarGoogle Scholar
  6. Jennifer M. Anderson and Monica S. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google ScholarGoogle Scholar
  7. Alexandr Andoni, Huy L. Nguyen, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten. 2017. Approximate Near Neighbors for General Symmetric Norms. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Vishal Aslot, Max Domeika, Rudolf Eigenmann, Greg Gaertner, Wesley B. Jones, and Bodo Parady. 2001. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In OpenMP Shared Memory Parallel Programming, Rudolf Eigenmann and Michael J. Voss (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--10.Google ScholarGoogle Scholar
  9. Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 Workshop. Micro, IEEE (2014).Google ScholarGoogle Scholar
  10. Antonio Barbalace, Anthony Iliopoulos, Holm Rauchfuss, and Goetz Brasche. 2017. It's time to think about an operating system for near data processing architectures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. 56--61.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Carter, W. Hsieh, L. Stoller, M. Swanson, Lixin Zhang, E. Brunvand, A. Davis, Chen-Chi Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. 1999. Impulse: Building a Smarter Memory Controller. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dong Chen, Fangzhou Liu, Chen Ding, and Sreepathi Pai. 2018. Locality Analysis Through Static Parallel Sampling. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dongjoo Choi, Myunghoon Jeon, Namgi Kim, and Byoung-Dai Lee. 2017. An enhanced data-locality-aware task scheduling algorithm for hadoop applications. IEEE Systems Journal, Vol. 12, 4 (2017), 3346--3357.Google ScholarGoogle ScholarCross RefCross Ref
  15. Michael L. Chu, Nuwan Jayasena, Dong Ping Zhang, and Mike Ignatowski. 2013. High-level Programming Model Abstractions for Processing in Memory. Proc. of 1st Workshop on Near-Data Processing in conjunction with MICRO (2013).Google ScholarGoogle Scholar
  16. R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In Proceedings of International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In ASPLOS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Shantanu Dutt. 1993. New Faster Kernighan-Lin-type Graph-partitioning Algorithms. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD).Google ScholarGoogle ScholarCross RefCross Ref
  19. Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. 2016. Biscuit: A Framework for Near-data Processing of Big Data Workloads. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mary H. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, and Monica S. Lam. 1995. Detecting Coarse-grain Parallelism Using an Interprocedural Parallelizing Compiler. In Supercomputing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. In ISCA.Google ScholarGoogle Scholar
  22. Mahmut Kandemir, Alok Choudhary, J Ramanujam, and Prith Banerjee. 1999. A matrix-based approach to global locality optimization. J. Parallel and Distrib. Comput. (1999).Google ScholarGoogle Scholar
  23. M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Comput. (2001).Google ScholarGoogle Scholar
  24. Gwangsun Kim, John Kim, Jung Ho Ahn, and Yongkee Kwon. 2014. Memory Network: Enabling Technology for Scalable Near-Data Computing. In Proceedings of the 2nd Workshop on Near-Data Processing.Google ScholarGoogle Scholar
  25. Nam Sung Kim and Pankaj Mehra. 2019. Practical Near-Data Processing to Evolve Memory and Storage Devices into Mainstream Heterogeneous Computing Systems. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC '19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2018. Enhancing Computation-to-core Assignment with Physical Location Information. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric Multi-level Blocking. In PLDI.Google ScholarGoogle Scholar
  28. Jagadish B Kotra, Diana Guttman, Mahmut T Kandemir, Chita R Das, et al. 2017a. Quantifying the potential benefits of on-chip near-data computing in manycore processors. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 198--209.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. B. Kotra, D. Guttman, N. C. N., M. T. Kandemir, and C. R. Das. 2017b. Quantifying the Potential Benefits of On-chip Near-Data Computing in Manycore Processors. In 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google ScholarGoogle Scholar
  30. Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarGoogle Scholar
  31. Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO).Google ScholarGoogle ScholarCross RefCross Ref
  32. Shun-Tak Leung and John Zahorjan. 1995. Optimizing data locality by array restructuring .Department of Computer Science and Engineering, University of Washington.Google ScholarGoogle Scholar
  33. Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In ICS.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Qingda Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors. In Proceedings of the Parallel Architectures and Compilation Techniques (PACT).Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kyoji Mizoguchi, Shohei Kotaki, Yoshiaki Deguchi, and Ken Takeuchi. 2017. Lateral charge migration suppression of 3D-NAND flash by vth nearing for near data computing. In 2017 IEEE International Electron Devices Meeting (IEDM). IEEE, 19--2.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R Nair, SF Antao, C Bertolli, P Bose, JR Brunheroto, T Chen, C-Y Cher, CHA Costa, C Evangelinos, and BM Fleischer. 2015. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development (2015).Google ScholarGoogle Scholar
  39. Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. 2018. Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. OpenMP Architecture Review Board. 2011. OpenMP Application Program Interface. Specification. http://www.openmp.org/mp-documents/OpenMP3.1.pdfGoogle ScholarGoogle Scholar
  41. Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT).Google ScholarGoogle Scholar
  42. Ashutosh Pattnaik, Xulong Tang, Onur Kayiran, Adwait Jog, Asit Mishra, Mahmut T. Kandemir, Anand Sivasubramaniam, and Chita R. Das. 2019. Opportunistic Computing in GPU Architectures. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  43. Arch D Robison. 2012. Cilk plus: Language support for thread and vector parallelism. Talk at HP-CAST, Vol. 18 (2012), 25.Google ScholarGoogle Scholar
  44. Gagandeep Singh, Lorenzo Chelini, Stefano Corda, Ahsan Javed Awan, Sander Stuijk, Roel Jordans, Henk Corporaal, and Albert-Jan Boonstra. 2018. A review of near-memory computing architectures: Opportunities and challenges. In 2018 21st Euromicro Conference on Digital System Design (DSD). IEEE, 608--617.Google ScholarGoogle ScholarCross RefCross Ref
  45. A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro (2016).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI.Google ScholarGoogle Scholar
  47. Georgios L Stavrinides and Helen D Karatza. 2020. Orchestration of real-time workflows with varying input data locality in a heterogeneous fog environment. In 2020 Fifth International Conference on Fog and Mobile Edge Computing (FMEC). IEEE, 202--209.Google ScholarGoogle ScholarCross RefCross Ref
  48. Kirshanthan Sundararajah, Laith Sakka, and Milind Kulkarni. 2017. Locality Transformations for Nested Recursive Iteration Spaces. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Xulong Tang, Mahmut Taylan Kandemir, Hui Zhao, Myoungsoo Jung, and Mustafa Karakoy. 2019 a. Computing with Near Data. In Proceedings of the 2019 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xulong Tang, Ashutosh Pattnaik, Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Das. 2019 b. Quantifying Data Locality in Dynamic Parallelism in GPUs. ACM SIGMETRICS Performance Evaluation Review, Vol. 47, 1 (2019), 25--26.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Didem Unat, Anshu Dubey, Torsten Hoefler, John Shalf, Mark Abraham, Mauro Bianco, Bradford L Chamberlain, Romain Cledat, H Carter Edwards, Hal Finkel, et al. 2017. Trends in data locality abstractions for HPC systems. IEEE Transactions on Parallel and Distributed Systems, Vol. 28, 10 (2017), 3007--3020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. S. Verdoolaege, M. Bruynooghe, G. Janssens, and P. Catthoor. 2003. Multi-dimensional incremental loop fusion for data locality. In Proceedings of ASAP.Google ScholarGoogle Scholar
  54. Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In ASPLOS.Google ScholarGoogle Scholar
  55. Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 829--842.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Thomas Willhalm and Nicolae Popovici. 2008. Putting intel® threading building blocks to work. In Proceedings of the 1st international workshop on Multicore software engineering. 3--4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proceedings of PLDI.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing Transformations on Hierarchically Structured Data. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In Proceedings of HPDC.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mix and Match: Reorganizing Tasks for Enhancing Data Locality

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!