Abstract
One cost that plays a significant role in shaping the overall performance of both single-threaded and multi-thread applications in modern computing systems is the cost of moving data between compute elements and storage elements. Traditional approaches to address this cost are code and data layout reorganizations and various hardware enhancements. More recently, an alternative paradigm, called Near Data Computing (NDC) or Near Data Processing (NDP), has been shown to be effective in reducing the data movements costs, by moving computation to data, instead of the traditional approach of moving data to computation. Unfortunately, the existing Near Data Computing proposals require significant modifications to hardware and are yet to be widely adopted.
In this paper, we present a software-only (compiler-driven) approach to reducing data movement costs in both single-threaded and multi-threaded applications. Our approach, referred to as Computing with Near Data (CND), is built upon a concept called "recomputation," in which a costly data access is replaced by a few less costly data accesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data access. If implemented carefully, CND can successfully trade off data access with computation, and considering the continuously increasing latency gap between the two, doing so can significantly reduce the execution latencies of both sequential and parallel application programs.
We i) quantify the intrinsic recomputability of a set of single-threaded and multi-threaded applications, ii) propose a practical, compiler-driven approach that automatically transforms a given application code fragment to a version that employs recomputation, iii) discuss an optimization strategy that increases recomputability; and iv) compare CND, both qualitatively and quantitatively, against NDC. Our experimental analysis of CND reveals that i) the average recomputability across our benchmarks is 51.1%, ii) our compiler-driven strategy is able to exploit 79.3% of the recomputation opportunities presented by our workloads, and iii) our enhancements increase the value of the recomputability metric significantly. As a result, our compiler-driven approach with the proposed enhancements brings an average execution time improvement of 40.1%.
- Ian F. Adams, Darrell D. E. Long, Ethan L. Miller, Shankar Pasupathy, and Mark W. Storer. 2009. Maximizing Efficiency by Trading Storage for Computation. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing. Google Scholar
Digital Library
- Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015a. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015b. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data Reorganization in Memory Using 3D-stacked DRAM. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Ismail Akturk and Ulya R. Karpuzcu. 2017. AMNESIAC: Amnesic Automatic Computer. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. Google Scholar
Digital Library
- Jennifer M. Anderson and Monica S. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 Workshop. Micro, IEEE (2014).Google Scholar
- Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling. In Proceedings of the 21st international symposium on High Performance Computer Architecture (HPCA).Google Scholar
Cross Ref
- Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News (2011). Google Scholar
Digital Library
- Uday Bondhugula, J. Ramanujam, and et al. 2008. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of Programming Language Design And Implementation (PLDI).Google Scholar
- Amirali Boroumand, Saugata Ghose, Brandon Lucia, Kevin Hsieh, Krishna Malladi, Hongzhong Zheng, and Onur Mutlu. 2016. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory. IEEE Computer Architecture Letters (2016).Google Scholar
- Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- J. Carter, W. Hsieh, L. Stoller, M. Swanson, Lixin Zhang, E. Brunvand, A. Davis, Chen-Chi Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. 1999. Impulse: Building a Smarter Memory Controller. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (ISCA). Google Scholar
Digital Library
- Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In Proceedings of the 43rd International Symposium on Computer Architecture. Google Scholar
Digital Library
- Michał Cierniak and Wei Li. 1995. Unifying Data and Control Transformations for Distributed Shared-memory Machines. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. 2012. Application-to-core Mapping Policies to Reduce Memory Interference in Multi-core Systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. Google Scholar
Digital Library
- Wei Ding, Xulong Tang, Mahmut Kandemir, Yuanrui Zhang, and Emre Kultursay. 2015. Optimizing Off-chip Accesses in Multicores. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1998. Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII). ACM, New York, NY, USA, 228--239. Google Scholar
Digital Library
- Maya Gokhale, Bill Holmes, and Ken Iobst. 1995. Processing in Memory: the Terasys Massively Parallel PIM Array. IEEE Computer (1995). Google Scholar
Digital Library
- Boris Grot, Stephen W. Keckler, and Onur Mutlu. 2009. Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-chip. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture. Google Scholar
Digital Library
- Mary H. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, and Monica S. Lam. 1995. Detecting Coarse-grain Parallelism Using an Interprocedural Parallelizing Compiler. In Supercomputing. Google Scholar
Digital Library
- Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu,, and Yale N. Patt. 2016. Accelerating Dependent Cache Misses with an Enhanced Memory Controller. In Proceedings of the 43rd International Symposium on Computer Architecture. Google Scholar
Digital Library
- Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, and Onur Mutlu. 2016. ChargeCache: Reducing DRAM latency by exploiting row access locality. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
Cross Ref
- Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. In ISCA. Google Scholar
Digital Library
- Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In 2012 Symposium on VLSI Technology (VLSIT).Google Scholar
Cross Ref
- Mahmut Kandemir, Alok Choudhary, J Ramanujam, and Prith Banerjee. 1999. A matrix-based approach to global locality optimization. J. Parallel and Distrib. Comput. (1999). Google Scholar
Digital Library
- M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Comput. (2001). Google Scholar
Digital Library
- Mahmut Kandemir, Hui Zhao, Xulong Tang, and Mustafa Karakoy. 2015. Memory Row Reuse Distance and Its Role in Optimizing Application Performance. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google Scholar
Digital Library
- Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2018. Enhancing Computation-to-core Assignment with Physical Location Information. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google Scholar
Digital Library
- Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2017. POSTER: Location-Aware Computation Mapping for Manycore Processors.. In Proceedings of the 2017 International Conference on Parallel Architectures and Compilation.Google Scholar
Cross Ref
- Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric Multi-level Blocking. In PLDI. Google Scholar
Digital Library
- Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google Scholar
Digital Library
- Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2009. Improving Memory Bank-level Parallelism in the Presence of Prefetching. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). Google Scholar
Digital Library
- Shun-Tak Leung and John Zahorjan. 1995. Optimizing data locality by array restructuring.Department of Computer Science and Engineering, University of Washington.Google Scholar
- Shuangchen Li, Dimin Niu, Krishna T.Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. 2017. DRISA: A DRAM-based Reconfigurable In-Situ Accelerator. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google Scholar
Digital Library
- Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In ICS. Google Scholar
Digital Library
- G.J. Lipovski and C. Yu. 1999. The Dynamic Associative Access Memory Chip and Its Application to SIMD Processing and Full-Text Database Retrieval. In MTDT. Google Scholar
Digital Library
- Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. 2012. A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore Systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Gabriel H. Loh. 2008. 3D-Stacked Memory Architectures for Multi-core Processors. In Proceedings of the 35th Annual International Symposium on Computer Architecture. Google Scholar
Digital Library
- Qingda Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors. In Proceedings of the Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Digital Library
- Asit K. Mishra, Onur Mutlu, and Chita R. Das. 2013. A Heterogeneous Multiple Network-on-chip Design: An Application-aware Approach. In Proceedings of the 50th Annual Design Automation Conference (DAC). Google Scholar
Digital Library
- Asit K. Mishra, N. Vijaykrishnan, and Chita R. Das. 2011. A Case for Heterogeneous On-chip Interconnects for CMPs. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- R Nair, SF Antao, C Bertolli, P Bose, JR Brunheroto, T Chen, C-Y Cher, CHA Costa, C Evangelinos, and BM Fleischer. 2015. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development (2015). Google Scholar
Digital Library
- M.F.P. O'Boyle and P.M.W. Knijnenburg. 2002. Integrating Loop and Data Transformations for Global Optimization. J. Parallel Distribute Computer (2002). Google Scholar
Digital Library
- Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT). Google Scholar
Digital Library
- Erik Riedel, Christos Faloutsos, and David Nagle. 2000. Active disk architecture for databases. Technical Report. DTIC Document.Google Scholar
- Gabriel Rivera and Chau-Wen Tseng. 1998. Data Transformations for Eliminating Conflict Misses. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation. Google Scholar
Digital Library
- Jihyun Ryoo, Orhan Kislal, Xulong Tang, and Mahmut Taylan Kandemir. 2018. Quantifying and Optimizing Data Access Parallelism on Manycores. In Proceedings of the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google Scholar
Cross Ref
- Akbar Shrifi, Wei Ding, Diana Guttman, Hui Zhao, Xulong Tang, Mahmut Kandemir, and Chita Das. 2017. DEMM: a Dynamic Energy-saving mechanism for Multicore Memories. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google Scholar
Cross Ref
- A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro (2016). Google Scholar
Digital Library
- Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI. Google Scholar
Digital Library
- Harold S. Stone. 1970. IEEE Transaction Computing (1970).Google Scholar
- Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. 2016. Improving Bank-Level Parallelism for Irregular Applications. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017a. Data Movement Aware Computation Partitioning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google Scholar
Digital Library
- Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, Mohamed Ibrahim, Mahmut Kandemir, and Chita Das. 2017b. Controlled Kernel Launch for Dynamic Parallelism in GPUs. In Proceedings of the 23rd International Symposium on High-Performance Computer Architecture (HPCA).Google Scholar
Cross Ref
- Xulong Tang, Ashutosh Pattnaik, Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita R. Das. 2019. Quantifying Data Locality in Dynamic Parallelism in GPUs. In Proceedings of the 2019 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google Scholar
Digital Library
- S. Verdoolaege, M. Bruynooghe, G. Janssens, and P. Catthoor. 2003. Multi-dimensional incremental loop fusion for data locality. In ASAP.Google Scholar
- Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In ASPLOS. Google Scholar
Digital Library
- Michael E. Wolf and Monica S. Lam. 1991 a. A Data Locality Optimizing Algorithm. In PLDI. Google Scholar
Digital Library
- M. E. Wolf and M. S. Lam. 1991 b. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems (1991). Google Scholar
Digital Library
- Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of International Symposium on Computer Architecture (ISCA). Google Scholar
Digital Library
- Lifan Xu, Dong Ping Zhang, and Nuwan Jayasena. 2015. Scaling Deep Learning on Multiple In-Memory Processors. In Proceedings of the 3rd Workshop on Near-Data Processing.Google Scholar
- Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In HPDC. Google Scholar
Digital Library
- Haibo Zhang, Prasanna Venkatesh Rengasamy, Nachiappan Chidambaram Nachiappan, Shulin Zhao, Anand Sivasubramaniam, Mahmut Kandemir, and Chita R. Das. 2018. FLOSS: FLOw Sensitive Scheduling on Mobile Platforms. In In Proceedings of The Design Automation Conference (DAC). Google Scholar
Digital Library
- Haibo Zhang, Prasanna Venkatesh Rengasamy, Shulin Zhao, Nachiappan Chidambaram Nachiappan, Anand Sivasubramaniam, Mahmut T. Kandemir, Ravi Iyer, and Chita R. Das. 2017. Race-to-sleepGoogle Scholar
- Content CachingGoogle Scholar
- Display Caching: A Recipe for Energy-efficient Video Streaming on Handhelds. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture.Google Scholar
- Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2002. Breaking Address Mapping Symmetry at Multi-levels of Memory Hierarchy to Reduce DRAM Row-buffer Conflicts. In The Journal of Instruction-Level Parallelism (JILP).Google Scholar
Index Terms
Computing with Near Data
Recommendations
Computing with Near Data
The cost of moving data between compute elements and storage elements plays a signiicant role in shaping the overall performance of applications.We present a compiler-driven approach to reducing data movement costs. Our approach, referred to as ...
Computing with Near Data
SIGMETRICS '19: Abstracts of the 2019 SIGMETRICS/Performance Joint International Conference on Measurement and Modeling of Computer SystemsThe cost of moving data between compute elements and storage elements plays a significant role in shaping the overall performance of applications. We present a compiler-driven approach to reducing data movement costs. Our approach, referred to as ...
Co-optimizing memory-level parallelism and cache-level parallelism
PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and ImplementationMinimizing cache misses has been the traditional goal in optimizing cache performance using compiler based techniques. However, continuously increasing dataset sizes combined with large numbers of cache banks and memory banks connected using on-chip ...






Comments