skip to main content
research-article
Public Access

Computing with Near Data

Published:21 December 2018Publication History
Skip Abstract Section

Abstract

One cost that plays a significant role in shaping the overall performance of both single-threaded and multi-thread applications in modern computing systems is the cost of moving data between compute elements and storage elements. Traditional approaches to address this cost are code and data layout reorganizations and various hardware enhancements. More recently, an alternative paradigm, called Near Data Computing (NDC) or Near Data Processing (NDP), has been shown to be effective in reducing the data movements costs, by moving computation to data, instead of the traditional approach of moving data to computation. Unfortunately, the existing Near Data Computing proposals require significant modifications to hardware and are yet to be widely adopted.

In this paper, we present a software-only (compiler-driven) approach to reducing data movement costs in both single-threaded and multi-threaded applications. Our approach, referred to as Computing with Near Data (CND), is built upon a concept called "recomputation," in which a costly data access is replaced by a few less costly data accesses plus some extra computation, if the cumulative cost of the latter is less than that of the costly data access. If implemented carefully, CND can successfully trade off data access with computation, and considering the continuously increasing latency gap between the two, doing so can significantly reduce the execution latencies of both sequential and parallel application programs.

We i) quantify the intrinsic recomputability of a set of single-threaded and multi-threaded applications, ii) propose a practical, compiler-driven approach that automatically transforms a given application code fragment to a version that employs recomputation, iii) discuss an optimization strategy that increases recomputability; and iv) compare CND, both qualitatively and quantitatively, against NDC. Our experimental analysis of CND reveals that i) the average recomputability across our benchmarks is 51.1%, ii) our compiler-driven strategy is able to exploit 79.3% of the recomputation opportunities presented by our workloads, and iii) our enhancements increase the value of the recomputability metric significantly. As a result, our compiler-driven approach with the proposed enhancements brings an average execution time improvement of 40.1%.

References

  1. Ian F. Adams, Darrell D. E. Long, Ethan L. Miller, Shankar Pasupathy, and Mark W. Storer. 2009. Maximizing Efficiency by Trading Storage for Computation. In Proceedings of the 2009 Conference on Hot Topics in Cloud Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015a. A Scalable Processing-in-memory Accelerator for Parallel Graph Processing. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015b. PIM-enabled Instructions: A Low-overhead, Locality-aware Processing-in-memory Architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Berkin Akin, Franz Franchetti, and James C. Hoe. 2015. Data Reorganization in Memory Using 3D-stacked DRAM. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ismail Akturk and Ulya R. Karpuzcu. 2017. AMNESIAC: Amnesic Automatic Computer. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jennifer M. Anderson and Monica S. Lam. 1993. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Rajeev Balasubramonian, Jichuan Chang, Troy Manning, Jaime H Moreno, Richard Murphy, Ravi Nair, and Steven Swanson. 2014. Near-data processing: Insights from a MICRO-46 Workshop. Micro, IEEE (2014).Google ScholarGoogle Scholar
  8. Nathan Beckmann, Po-An Tsai, and Daniel Sanchez. 2015. Scaling Distributed Cache Hierarchies through Computation and Data Co-Scheduling. In Proceedings of the 21st international symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  9. Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Uday Bondhugula, J. Ramanujam, and et al. 2008. PLuTo: A practical and fully automatic polyhedral program optimization system. In Proceedings of Programming Language Design And Implementation (PLDI).Google ScholarGoogle Scholar
  11. Amirali Boroumand, Saugata Ghose, Brandon Lucia, Kevin Hsieh, Krishna Malladi, Hongzhong Zheng, and Onur Mutlu. 2016. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory. IEEE Computer Architecture Letters (2016).Google ScholarGoogle Scholar
  12. Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Carter, W. Hsieh, L. Stoller, M. Swanson, Lixin Zhang, E. Brunvand, A. Davis, Chen-Chi Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama. 1999. Impulse: Building a Smarter Memory Controller. In Proceedings of the 5th International Symposium on High Performance Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and Yuan Xie. 2016. PRIME: A Novel Processing-in-memory Architecture for Neural Network Computation in ReRAM-based Main Memory. In Proceedings of the 43rd International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Michał Cierniak and Wei Li. 1995. Unifying Data and Control Transformations for Distributed Shared-memory Machines. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. 2012. Application-to-core Mapping Policies to Reduce Memory Interference in Multi-core Systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Wei Ding, Xulong Tang, Mahmut Kandemir, Yuanrui Zhang, and Emre Kultursay. 2015. Optimizing Off-chip Accesses in Multicores. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Somnath Ghosh, Margaret Martonosi, and Sharad Malik. 1998. Precise Miss Analysis for Program Transformations with Caches of Arbitrary Associativity. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VIII). ACM, New York, NY, USA, 228--239. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Maya Gokhale, Bill Holmes, and Ken Iobst. 1995. Processing in Memory: the Terasys Massively Parallel PIM Array. IEEE Computer (1995). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Boris Grot, Stephen W. Keckler, and Onur Mutlu. 2009. Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-chip. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mary H. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, and Monica S. Lam. 1995. Detecting Coarse-grain Parallelism Using an Interprocedural Parallelizing Compiler. In Supercomputing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Milad Hashemi, Khubaib, Eiman Ebrahimi, Onur Mutlu,, and Yale N. Patt. 2016. Accelerating Dependent Cache Misses with an Enhanced Memory Controller. In Proceedings of the 43rd International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, and Onur Mutlu. 2016. ChargeCache: Reducing DRAM latency by exploiting row access locality. In 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  24. Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. In ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In 2012 Symposium on VLSI Technology (VLSIT).Google ScholarGoogle ScholarCross RefCross Ref
  26. Mahmut Kandemir, Alok Choudhary, J Ramanujam, and Prith Banerjee. 1999. A matrix-based approach to global locality optimization. J. Parallel and Distrib. Comput. (1999). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. 2001. A layout-conscious iteration space transformation technique. IEEE Trans. Comput. (2001). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mahmut Kandemir, Hui Zhao, Xulong Tang, and Mustafa Karakoy. 2015. Memory Row Reuse Distance and Its Role in Optimizing Application Performance. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2018. Enhancing Computation-to-core Assignment with Physical Location Information. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Orhan Kislal, Jagadish Kotra, Xulong Tang, Mahmut Taylan Kandemir, and Myoungsoo Jung. 2017. POSTER: Location-Aware Computation Mapping for Manycore Processors.. In Proceedings of the 2017 International Conference on Parallel Architectures and Compilation.Google ScholarGoogle ScholarCross RefCross Ref
  31. Induprakas Kodukula, Nawaaz Ahmed, and Keshav Pingali. 1997. Data-centric Multi-level Blocking. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Monica D. Lam, Edward E. Rothberg, and Michael E. Wolf. 1991. The Cache Performance and Optimizations of Blocked Algorithms. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Chang Joo Lee, Veynu Narasiman, Onur Mutlu, and Yale N. Patt. 2009. Improving Memory Bank-level Parallelism in the Presence of Prefetching. In Proceedings of the 42Nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 42). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shun-Tak Leung and John Zahorjan. 1995. Optimizing data locality by array restructuring.Department of Computer Science and Engineering, University of Washington.Google ScholarGoogle Scholar
  35. Shuangchen Li, Dimin Niu, Krishna T.Malladi, Hongzhong Zheng, Bob Brennan, and Yuan Xie. 2017. DRISA: A DRAM-based Reconfigurable In-Situ Accelerator. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Amy W. Lim, Gerald I. Cheong, and Monica S. Lam. 1999. An Affine Partitioning Algorithm to Maximize Parallelism and Minimize Communication. In ICS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. G.J. Lipovski and C. Yu. 1999. The Dynamic Associative Access Memory Chip and Its Application to SIMD Processing and Full-Text Database Retrieval. In MTDT. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lei Liu, Zehan Cui, Mingjie Xing, Yungang Bao, Mingyu Chen, and Chengyong Wu. 2012. A Software Memory Partition Approach for Eliminating Bank-level Interference in Multicore Systems. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Gabriel H. Loh. 2008. 3D-Stacked Memory Architectures for Multi-core Processors. In Proceedings of the 35th Annual International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Qingda Lu, C. Alias, U. Bondhugula, T. Henretty, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, Yongjian Chen, Haibo Lin, and Tin-Fook Ngai. 2009. Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors. In Proceedings of the Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Asit K. Mishra, Onur Mutlu, and Chita R. Das. 2013. A Heterogeneous Multiple Network-on-chip Design: An Application-aware Approach. In Proceedings of the 50th Annual Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Asit K. Mishra, N. Vijaykrishnan, and Chita R. Das. 2011. A Case for Heterogeneous On-chip Interconnects for CMPs. In Proceedings of the 38th Annual International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R Nair, SF Antao, C Bertolli, P Bose, JR Brunheroto, T Chen, C-Y Cher, CHA Costa, C Evangelinos, and BM Fleischer. 2015. Active Memory Cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M.F.P. O'Boyle and P.M.W. Knijnenburg. 2002. Integrating Loop and Data Transformations for Global Optimization. J. Parallel Distribute Computer (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ashutosh Pattnaik, Xulong Tang, Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, and Chita R. Das. 2016. Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Erik Riedel, Christos Faloutsos, and David Nagle. 2000. Active disk architecture for databases. Technical Report. DTIC Document.Google ScholarGoogle Scholar
  47. Gabriel Rivera and Chau-Wen Tseng. 1998. Data Transformations for Eliminating Conflict Misses. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Jihyun Ryoo, Orhan Kislal, Xulong Tang, and Mahmut Taylan Kandemir. 2018. Quantifying and Optimizing Data Access Parallelism on Manycores. In Proceedings of the 26th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google ScholarGoogle ScholarCross RefCross Ref
  49. Akbar Shrifi, Wei Ding, Diana Guttman, Hui Zhao, Xulong Tang, Mahmut Kandemir, and Chita Das. 2017. DEMM: a Dynamic Energy-saving mechanism for Multicore Memories. In Proceedings of the 25th IEEE International Symposium on the Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).Google ScholarGoogle ScholarCross RefCross Ref
  50. A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y. C. Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro (2016). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yonghong Song and Zhiyuan Li. 1999. New Tiling Techniques to Improve Cache Temporal Locality. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Harold S. Stone. 1970. IEEE Transaction Computing (1970).Google ScholarGoogle Scholar
  53. Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli, and Jagadish Kotra. 2016. Improving Bank-Level Parallelism for Irregular Applications. In Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Xulong Tang, Orhan Kislal, Mahmut Kandemir, and Mustafa Karakoy. 2017a. Data Movement Aware Computation Partitioning. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, Adwait Jog, Sreepathi Pai, Mohamed Ibrahim, Mahmut Kandemir, and Chita Das. 2017b. Controlled Kernel Launch for Dynamic Parallelism in GPUs. In Proceedings of the 23rd International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  56. Xulong Tang, Ashutosh Pattnaik, Onur Kayiran, Adwait Jog, Mahmut Taylan Kandemir, and Chita R. Das. 2019. Quantifying Data Locality in Dynamic Parallelism in GPUs. In Proceedings of the 2019 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. S. Verdoolaege, M. Bruynooghe, G. Janssens, and P. Catthoor. 2003. Multi-dimensional incremental loop fusion for data locality. In ASAP.Google ScholarGoogle Scholar
  58. Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating System Support for Improving Data Locality on CC-NUMA Compute Servers. In ASPLOS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Michael E. Wolf and Monica S. Lam. 1991 a. A Data Locality Optimizing Algorithm. In PLDI. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. M. E. Wolf and M. S. Lam. 1991 b. A loop transformation theory and an algorithm to maximize parallelism. IEEE Transactions on Parallel and Distributed Systems (1991). Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 Programs: Characterization and Methodological Considerations. In Proceedings of International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Lifan Xu, Dong Ping Zhang, and Nuwan Jayasena. 2015. Scaling Deep Learning on Multiple In-Memory Processors. In Proceedings of the 3rd Workshop on Near-Data Processing.Google ScholarGoogle Scholar
  63. Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L. Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In HPDC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Haibo Zhang, Prasanna Venkatesh Rengasamy, Nachiappan Chidambaram Nachiappan, Shulin Zhao, Anand Sivasubramaniam, Mahmut Kandemir, and Chita R. Das. 2018. FLOSS: FLOw Sensitive Scheduling on Mobile Platforms. In In Proceedings of The Design Automation Conference (DAC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Haibo Zhang, Prasanna Venkatesh Rengasamy, Shulin Zhao, Nachiappan Chidambaram Nachiappan, Anand Sivasubramaniam, Mahmut T. Kandemir, Ravi Iyer, and Chita R. Das. 2017. Race-to-sleepGoogle ScholarGoogle Scholar
  66. Content CachingGoogle ScholarGoogle Scholar
  67. Display Caching: A Recipe for Energy-efficient Video Streaming on Handhelds. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture.Google ScholarGoogle Scholar
  68. Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2002. Breaking Address Mapping Symmetry at Multi-levels of Memory Hierarchy to Reduce DRAM Row-buffer Conflicts. In The Journal of Instruction-Level Parallelism (JILP).Google ScholarGoogle Scholar

Index Terms

  1. Computing with Near Data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Measurement and Analysis of Computing Systems
      Proceedings of the ACM on Measurement and Analysis of Computing Systems  Volume 2, Issue 3
      December 2018
      248 pages
      EISSN:2476-1249
      DOI:10.1145/3301416
      Issue’s Table of Contents

      Copyright © 2018 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 December 2018
      Published in pomacs Volume 2, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!