Abstract
Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. However, to maximize the performance of multithreaded applications running on emerging manycore systems, data movement in on-chip network should also be minimized. Unfortunately, the way many multithreaded programs are written does not lend itself well to minimal data movement. Motivated by this observation, in this paper, we target task-based programs (which cover a large set of available multithreaded programs), and propose a novel compiler-based approach that consists of four complementary steps. First, we partition the original tasks in the target application into sub-tasks and build a data reuse graph at a sub-task granularity. Second, based on the intensity of temporal and spatial data reuses among sub-tasks, we generate new tasks where each such (new) task includes a set of sub-tasks that exhibit high data reuse among them. Third, we assign the newly-generated tasks to cores in an architecture-aware fashion with the knowledge of data location. Finally, we re-schedule the execution order of sub-tasks within new tasks such that sub-tasks that belong to different tasks but share data among them are executed in close proximity in time. The detailed experiments show that, when targeting a state of the art manycore system, our proposed compiler-based approach improves the performance of 10 multithreaded programs by 23.4% on average, and it also outperforms two state-of-the-art data access optimizations for all the benchmarks tested. Our results also show that the proposed approach i) improves the performance of multiprogrammed workloads, and ii) generates results that are close to maximum savings that could be achieved with perfect profiling information. Overall, our experimental results emphasize the importance of dividing an original set of tasks of an application into sub-tasks and constructing new tasks from the resulting sub-tasks in a data movement- and locality-aware fashion.
- 2012. minighost. https://mantevo.org/default.php.Google Scholar
- Ian F. Adams, Darrell D. E. Long, Ethan L. Miller, Shankar Pasupathy, and Mark W. Storer. 2009. Maximizing Efficiency by Trading Storage for Computation. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing.Google Scholar
Digital Library
- Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA).Google Scholar
Digital Library
- Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data Reorganization in Memory Using 3D-stacked DRAM. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA).Google Scholar
- Ismail Akturk and Ulya R. Karpuzcu. 2017. AMNESIAC: Amnesic Automatic Computer. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems.Google Scholar
- Jennifer M. Anderson and Monica S. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google Scholar
- Alexandr Andoni, Huy L. Nguyen, Aleksandar Nikolov, Ilya Razenshteyn, and Erik Waingarten. 2017. Approximate Near Neighbors for General Symmetric Norms. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2017).Google Scholar
Digital Library
- Vishal Aslot, Max Domeika, Rudolf Eigenmann, Greg Gaertner, Wesley B. Jones, and Bodo Parady. 2001. SPEComp: A New Benchmark Suite for Measuring Parallel Computer Performance. In OpenMP Shared Memory Parallel Programming, Rudolf Eigenmann and Michael J. Voss (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 1--10.Google Scholar
- Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 Workshop. Micro, IEEE (2014).Google Scholar
- Antonio Barbalace, Anthony Iliopoulos, Holm Rauchfuss, and Goetz Brasche. 2017. It's time to think about an operating system for near data processing architectures. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems. 56--61.Google Scholar
Digital Library
- Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google Scholar
Digital Library
- J. Carter, W. Hsieh, L. Stoller, M. Swanson, Lixin Zhang, E. Brunvand, A. Davis, Chen-Chi Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. 1999. Impulse: Building a Smarter Memory Controller. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (ISCA).Google Scholar
Digital Library
- Dong Chen, Fangzhou Liu, Chen Ding, and Sreepathi Pai. 2018. Locality Analysis Through Static Parallel Sampling. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google Scholar
Digital Library
- Dongjoo Choi, Myunghoon Jeon, Namgi Kim, and Byoung-Dai Lee. 2017. An enhanced data-locality-aware task scheduling algorithm for hadoop applications. IEEE Systems Journal, Vol. 12, 4 (2017), 3346--3357.Google Scholar
Cross Ref
- Michael L. Chu, Nuwan Jayasena, Dong Ping Zhang, and Mike Ignatowski. 2013. High-level Programming Model Abstractions for Processing in Memory. Proc. of 1st Workshop on Near-Data Processing in conjunction with MICRO (2013).Google Scholar
- R. Das, R. Ausavarungnirun, O. Mutlu, A. Kumar, and M. Azimi. 2013. Application-to-core mapping policies to reduce memory system interference in multi-core systems. In Proceedings of International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
Digital Library
- Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In ASPLOS.Google Scholar
Digital Library
- Shantanu Dutt. 1993. New Faster Kernighan-Lin-type Graph-partitioning Algorithms. In Proceedings of the IEEE/ACM International Conference on Computer-aided Design (ICCAD).Google Scholar
Cross Ref
- Boncheol Gu, Andre S. Yoon, Duck-Ho Bae, Insoon Jo, Jinyoung Lee, Jonghyun Yoon, Jeong-Uk Kang, Moonsang Kwon, Chanho Yoon, Sangyeun Cho, Jaeheon Jeong, and Duckhyun Chang. 2016. Biscuit: A Framework for Near-data Processing of Big Data Workloads. In Proceedings of the 43rd International Symposium on Computer Architecture (ISCA).Google Scholar
Digital Library
- Mary H. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, and Monica S. Lam. 1995. Detecting Coarse-grain Parallelism Using an Interprocedural Parallelizing Compiler. In Supercomputing.Google Scholar
Digital Library
- Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. In ISCA.Google Scholar
- Mahmut Kandemir, Alok Choudhary, J Ramanujam, and Prith Banerjee. 1999. A matrix-based approach to global locality optimization. J. Parallel and Distrib. Comput. (1999).Google Scholar
- M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Comput. (2001).Google Scholar
- Gwangsun Kim, John Kim, Jung Ho Ahn, and Yongkee Kwon. 2014. Memory Network: Enabling Technology for Scalable Near-Data Computing. In Proceedings of the 2nd Workshop on Near-Data Processing.Google Scholar
- Nam Sung Kim and Pankaj Mehra. 2019. Practical Near-Data Processing to Evolve Memory and Storage Devices into Mainstream Heterogeneous Computing Systems. In Proceedings of the 56th Annual Design Automation Conference 2019 (DAC '19).Google Scholar
Digital Library
- Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2018. Enhancing Computation-to-core Assignment with Physical Location Information. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google Scholar
Digital Library
- Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric Multi-level Blocking. In PLDI.Google Scholar
- Jagadish B Kotra, Diana Guttman, Mahmut T Kandemir, Chita R Das, et al. 2017a. Quantifying the potential benefits of on-chip near-data computing in manycore processors. In 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, 198--209.Google Scholar
Cross Ref
- J. B. Kotra, D. Guttman, N. C. N., M. T. Kandemir, and C. R. Das. 2017b. Quantifying the Potential Benefits of On-chip Near-Data Computing in Manycore Processors. In 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google Scholar
- Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google Scholar
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO).Google Scholar
Cross Ref
- Shun-Tak Leung and John Zahorjan. 1995. Optimizing data locality by array restructuring .Department of Computer Science and Engineering, University of Washington.Google Scholar
- Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, Akash Kumar, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google Scholar
Digital Library
- Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In ICS.Google Scholar
Digital Library
- Qingda Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors. In Proceedings of the Parallel Architectures and Compilation Techniques (PACT).Google Scholar
Digital Library
- Kyoji Mizoguchi, Shohei Kotaki, Yoshiaki Deguchi, and Ken Takeuchi. 2017. Lateral charge migration suppression of 3D-NAND flash by vth nearing for near data computing. In 2017 IEEE International Electron Devices Meeting (IEDM). IEEE, 19--2.Google Scholar
Cross Ref
- Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google Scholar
Digital Library
- R Nair, SF Antao, C Bertolli, P Bose, JR Brunheroto, T Chen, C-Y Cher, CHA Costa, C Evangelinos, and BM Fleischer. 2015. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development (2015).Google Scholar
- Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. 2018. Tigr: Transforming Irregular Graphs for GPU-Friendly Graph Processing. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google Scholar
Digital Library
- OpenMP Architecture Review Board. 2011. OpenMP Application Program Interface. Specification. http://www.openmp.org/mp-documents/OpenMP3.1.pdfGoogle Scholar
- Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT).Google Scholar
- Ashutosh Pattnaik, Xulong Tang, Onur Kayiran, Adwait Jog, Asit Mishra, Mahmut T. Kandemir, Anand Sivasubramaniam, and Chita R. Das. 2019. Opportunistic Computing in GPU Architectures. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA).Google Scholar
- Arch D Robison. 2012. Cilk plus: Language support for thread and vector parallelism. Talk at HP-CAST, Vol. 18 (2012), 25.Google Scholar
- Gagandeep Singh, Lorenzo Chelini, Stefano Corda, Ahsan Javed Awan, Sander Stuijk, Roel Jordans, Henk Corporaal, and Albert-Jan Boonstra. 2018. A review of near-memory computing architectures: Opportunities and challenges. In 2018 21st Euromicro Conference on Digital System Design (DSD). IEEE, 608--617.Google Scholar
Cross Ref
- A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro (2016).Google Scholar
Digital Library
- Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI.Google Scholar
- Georgios L Stavrinides and Helen D Karatza. 2020. Orchestration of real-time workflows with varying input data locality in a heterogeneous fog environment. In 2020 Fifth International Conference on Fog and Mobile Edge Computing (FMEC). IEEE, 202--209.Google Scholar
Cross Ref
- Kirshanthan Sundararajah, Laith Sakka, and Milind Kulkarni. 2017. Locality Transformations for Nested Recursive Iteration Spaces. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google Scholar
Digital Library
- Xulong Tang, Mahmut Taylan Kandemir, Hui Zhao, Myoungsoo Jung, and Mustafa Karakoy. 2019 a. Computing with Near Data. In Proceedings of the 2019 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS).Google Scholar
Digital Library
- Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
Digital Library
- Xulong Tang, Ashutosh Pattnaik, Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Das. 2019 b. Quantifying Data Locality in Dynamic Parallelism in GPUs. ACM SIGMETRICS Performance Evaluation Review, Vol. 47, 1 (2019), 25--26.Google Scholar
Digital Library
- Didem Unat, Anshu Dubey, Torsten Hoefler, John Shalf, Mark Abraham, Mauro Bianco, Bradford L Chamberlain, Romain Cledat, H Carter Edwards, Hal Finkel, et al. 2017. Trends in data locality abstractions for HPC systems. IEEE Transactions on Parallel and Distributed Systems, Vol. 28, 10 (2017), 3007--3020.Google Scholar
Digital Library
- S. Verdoolaege, M. Bruynooghe, G. Janssens, and P. Catthoor. 2003. Multi-dimensional incremental loop fusion for data locality. In Proceedings of ASAP.Google Scholar
- Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In ASPLOS.Google Scholar
- Nandita Vijaykumar, Eiman Ebrahimi, Kevin Hsieh, Phillip B Gibbons, and Onur Mutlu. 2018. The locality descriptor: A holistic cross-layer abstraction to express data locality in GPUs. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 829--842.Google Scholar
Digital Library
- Thomas Willhalm and Nicolae Popovici. 2008. Putting intel® threading building blocks to work. In Proceedings of the 1st international workshop on Multicore software engineering. 3--4.Google Scholar
Digital Library
- Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proceedings of PLDI.Google Scholar
Digital Library
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of International Symposium on Computer Architecture (ISCA).Google Scholar
Digital Library
- Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing Transformations on Hierarchically Structured Data. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).Google Scholar
Digital Library
- Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In Proceedings of HPDC.Google Scholar
Digital Library
- Dan Zhang, Xiaoyu Ma, Michael Thomson, and Derek Chiou. 2018. Minnow: Lightweight Offload Engines for Worklist Management and Worklist-Directed Prefetching. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).Google Scholar
Digital Library
Index Terms
Mix and Match: Reorganizing Tasks for Enhancing Data Locality
Recommendations
Mix and Match: Reorganizing Tasks for Enhancing Data Locality
SIGMETRICS '21Application programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. In this paper, we target task-based programs, and propose a novel compiler-based approach that consists of ...
Mix and Match: Reorganizing Tasks for Enhancing Data Locality
SIGMETRICS '21: Abstract Proceedings of the 2021 ACM SIGMETRICS / International Conference on Measurement and Modeling of Computer SystemsApplication programs that exhibit strong locality of reference lead to minimized cache misses and better performance in different architectures. In this paper, we target task-based programs, and propose a novel compiler-based approach that consists of ...
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...






Comments