skip to main content
research-article
Public Access

Memory Space Recycling

Published:28 February 2022Publication History
Skip Abstract Section

Abstract

Many program codes from different application domains process very large amounts of data, making their cache memory behavior critical for high performance. Most of the existing work targeting cache memory hierarchies focus on improving data access patterns, e.g., maximizing sequential accesses to program data structures via code and/or data layout restructuring strategies. Prior work has addressed this data locality optimization problem in the context of both single-core and multi-core systems. Another dimension of optimization, which can be as equally important/beneficial as improving data access pattern is to reduce the data volume (total number of addresses) accessed by the program code. Compared to data access pattern restructuring, this volume minimization problem has relatively taken much less attention. In this work, we focus on this volume minimization problem and address it in both single-core and multi-core execution scenarios. Specifically, we explore the idea of rewriting an application program code to reduce its "memory space footprint". The main idea behind this approach is to reuse/recycle, for a given data element, a memory location that has originally been assigned to another data element, provided that the lifetimes of these two data elements do not overlap with each other. A unique aspect is that it is "distance aware", i.e., in identifying the memory/cache locations to recycle it takes into account the physical distance between the location of the core and the memory/cache location to be recycled. We present a detailed experimental evaluation of our proposed memory space recycling strategy, using five different metrics: memory space consumption, network footprint, data access distance, cache miss rate, and execution time. The experimental results show that our proposed approach brings, respectively, 33.2%, 48.6%, 46.5%, 31.8%, and 27.9% average improvements in these metrics, in the case of single-threaded applications. With the multi-threaded versions of the same applications, the achieved improvements are 39.5%, 55.5%, 53.4%, 26.2%, and 22.2%, in the same order.

References

  1. Kristof E. Beyls and Erik H. D'Hollander. 2000. Compiler generated multithreading to alleviate memory latency. Universal Computer Science, special issue on Multithreaded Processors and Chip-Multiprocessors , Vol. 6, 10 (2000), 968-993.Google ScholarGoogle Scholar
  2. Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.,D. Dissertation. Princeton University.Google ScholarGoogle Scholar
  3. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (Toronto, Ontario, Canada). 72-81.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Jim Butterfield. 1986. Part 4: Overlaying. In Loading and Linking Commodore Programs. Compute!Google ScholarGoogle Scholar
  5. Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California, USA). 252-262.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Lung-Yu Chang and Henry G. Dietz. 1990. Data Layout Optimization and Code Transformation for Paged Memory Systems . Technical Report TR-EE 90--43. Purdue University.Google ScholarGoogle Scholar
  7. Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. SIGPLAN Notices , Vol. 30, 6 (1995), 279-290.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. Transactions on Programming Languages and Systems , Vol. 13, 4 (1991), 451-490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Alain Darte, Robert Schreiber, and Gilles Villard. 2005. Lattice-based memory allocation. Transactions on Computers , Vol. 54, 10 (2005), 1242--1257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wei Ding, Diana Guttman, and Mahmut T. Kandemir. 2014. Compiler Support for Optimizing Memory Bank-Level Parallelism. In International Symposium on Microarchitecture (Cambridge, UK). 571--582.Google ScholarGoogle Scholar
  11. Amin Farmahini-Farahani , Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. DRAMA: An Architecture for Accelerated Processing Near Memory. Computer Architecture Letters , Vol. 14, 1 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Mohammad H. Foroozannejad, Matin Hashemi, Trevor L. Hodges, and Soheil Ghiasi. 2010. Look into Details: The Benefits of Fine-Grain Streaming Buffer Analysis. In Proceedings of Conference on Languages, Compilers, and Tools for Embedded Systems (Stockholm, Sweden). 27-36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Eddy D. Greef, Francky Catthoor, and Hugo D. Man. 1997. Memory size reduction through storage order optimization for embedded parallel multimedia applications. Parallel Comput. , Vol. 23, 12 (1997), 1811--1837.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Rajiv Gupta and Jim Kajiya. 1990. Compiler optimization of array data storage . Technical Report Caltech-CS-TR-90-07. California Institute of Technology.Google ScholarGoogle Scholar
  15. Vaibhav Gupta, Debabrata Mohapatra, Sang P. Park, Anand Raghunathan, and Kaushik Roy. 2011. IMPACT: IMPrecise adders for low-power approximate computing. In International Symposium on Low Power Electronics and Design (Fukuoka, Japan). 409--414.Google ScholarGoogle ScholarCross RefCross Ref
  16. Jie Han and Michael Orshansky. 2013. Approximate computing: An emerging paradigm for energy-efficient design. In European Test Symposium (Avignon, France). 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  17. Enric Herrero, José González, and Ramon Canal. 2010. Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors. In Proceedings of International Symposium on Computer Architecture (Saint-Malo, France). 419-428.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Marcos Horro, Mahmut T. Kandemir, Louis-Noel Pouchet, Gabriel Rodríguez, and Juan Tourino. 2019. Effect of Distributed Directories in Mesh Interconnects. In Proceedings of the Annual Design Automation Conference (Las Vegas, NV, USA). Article 51, bibinfonumpages6 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O-Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent near-Data Processing in GPU Systems. In Proceedings of the International Symposium on Computer Architecture (Seoul, Republic of Korea). 204-216.Google ScholarGoogle Scholar
  20. Guohua Jin and John Mellor-Crummey. 2005. Improving Performance by Reducing the Memory Footprint of Scientific Applications. High Performance Computing Applications , Vol. 19, 4 (2005), 433-451.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated on-Chip Caches. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California). 211-222.Google ScholarGoogle Scholar
  22. Orhan Kislal, Jagadish B. Kotra, Xulong Tang, Mahmut T. Kandemir, and Myoungsoo Jung. 2018. Enhancing Computation-to-Core Assignment with Physical Location Information. In Proceedings of Programming Language Design and Implementation (Philadelphia, PA, USA). 312-327.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. 1981. Dependence graphs and compiler optimizations. In Symposium on Principles of Programming Languages (Williamsburg, Virginia). 207-218.Google ScholarGoogle Scholar
  24. Hantak Kwak, Ben Lee , Ali R. Hurson, Suk-Han Yoon, and Woo-Jong Hahn. 1999. Effects of multithreading on cache performance. Transactions on Computers , Vol. 48, 2 (1999), 176--184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chris Lattner and Vikram Adve. 2004. LLVM: a compilation framework for lifelong program analysis transformation. In International Symposium on Code Generation and Optimization (Palo Alto, CA, USA). 75--86.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Wen-Yen Lin and Jean-Luc Gaudiot. 1998. The Design of an I-Structure Software Cache System. In Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation (Las Vegas, NV, USA).Google ScholarGoogle Scholar
  27. Yu Liu, Hong An , Xiaomei Li, Peng Leng , Sun Sun, and Junshi Chen. 2012. VSCP: A Cache Controlling Method for Improving Single Thread Performance in Multicore System. In International Conference on High Performance Computing and Communication & International Conference on Embedded Software and Systems (Liverpool, England, UK). 161--168.Google ScholarGoogle Scholar
  28. Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, and Dean M. Tullsen. 1999. Tuning Compiler Optimizations for Simultaneous Multithreading. Parallel Programming , Vol. 27, 6 (1999), 477-503.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Debabrata Mohapatra, Vinay K. Chippa, Anand Raghunathan, and Kaushik Roy. 2011. Design of voltage-scalable meta-functions for approximate computing. In Design, Automation Test in Europe (Grenoble, France). 1--6.Google ScholarGoogle Scholar
  30. Matthias S. Müller, John Baron, William C. Brantley, Huiyu Feng, Daniel Hackenberg, Robert Henschel, Gabriele Jost, Daniel Molka, Chris Parrott, Joe Robichaux, Pavel Shelepugin, Matthijs van Waveren, Brian Whitney, and Kalyan Kumaran. 2012. SPEC OMP2012 -- An Application Benchmark Suite for Parallel Systems Using OpenMP. In OpenMP in a Heterogeneous World. 223--236.Google ScholarGoogle Scholar
  31. Praveen K. Murthy and Shuvra S. Bhattacharyya. 2001. Shared buffer implementations of signal processing systems using lifetime analysis techniques. Computer-Aided Design of Integrated Circuits and Systems , Vol. 20, 2 (2001), 177--198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Walid A. Najjar, W. Marcus Miller, and A. P. Wim Bohm. 1992. An analysis of loop latency in dataflow execution. SIGARCH Computer Architecture News , Vol. 20, 2 (1992), 352-360.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan , Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memoryGoogle ScholarGoogle Scholar
  34. logic devices on MapReduce workloads. In International Symposium on Performance Analysis of Systems and Software. 190--200.Google ScholarGoogle Scholar
  35. Muhammad M. Rafique and Zhichun Zhu. 2018. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube. In Proc. of the International Conference on Parallel Processing (Eugene, OR, USA). 1--9.Google ScholarGoogle Scholar
  36. Easwaran Raman, Robert Hundt, and Sandya Mannarswamy. 2007. Structure Layout Optimization for Multithreaded Programs. In International Symposium on Code Generation and Optimization (San Jose, CA, USA). 271--282.Google ScholarGoogle Scholar
  37. James Reinders. 2005. Vtune performance analyzer essentials. In Intel Press .Google ScholarGoogle Scholar
  38. Muhammad Shafique, Rehan Hafiz, Semeen Rehman, Walaa El-Harouni, and Jörg Henkel. 2016. INVITED: Cross-Layer Approximate Computing: From Logic to Architectures. In Proceedings of the 53rd Annual Design Automation Conference (Austin, TX, USA). Article 99, bibinfonumpages6 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro , Vol. 36, 2 (2016), 34--46.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A Multithreaded Processor-in-Memory Architecture for Petaflops Computing. In Proceedings of Supercomputing (Baltimore, MD, USA). 48--48.Google ScholarGoogle Scholar
  41. Harold S. Stone. 1970. A Logic-in-Memory Computer. Transactions on Computers , Vol. C-19, 1 (1970), 73--78.Google ScholarGoogle Scholar
  42. Michelle M. Strout, Larry Carter, Jeanne Ferrante, and Beth Simon. 1998. Schedule-independent storage mapping for loops. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California, USA). 24-33.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Xulong Tang, Mahmut T. Kandemir, Mustafa Karakoy, and Meenakshi Arunachalam. 2019. Co-optimizing Memory-Level Parallelism and Cache-Level Parallelism. In Proceedings of Programming Language Design and Implementation (Phoenix, AZ, USA). 935-949.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Xulong Tang, Mahmut Taylan Kandemir, Hui Zhao, Myoungsoo Jung, and Mustafa Karakoy. 2018. Computing with Near Data. Proceedings of Measurement and Analysis of Computing Systems , Vol. 2, 3, Article 42 (2018), bibinfonumpages30 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Xulong Tang, Orhan Kislal, Mahmut T. Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proceedings of International Symposium on Microarchitecture (Cambridge, MA, USA). 730-744.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Swagath Venkataramani, Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Quality programmable vector processors for approximate computing. In International Symposium on Microarchitecture (Davis, CA, USA). 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Wen-Yen Lin and Jean-Luc Gaudiot. 1997. Exploiting global data locality in non-blocking multithreaded architectures. In Proceedings of International Symposium on Parallel Architectures, Algorithms and Networks (Taipei, Taiwan). 78--84.Google ScholarGoogle Scholar
  48. Qian Zhang, Ting Wang, Ye Tian, Feng Yuan, and Qiang Xu. 2015. ApproxANN: An approximate computing framework for artificial neural network. In Design, Automation Test in Europe (Grenoble, France). 701--706.Google ScholarGoogle Scholar
  49. Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut T. Kandemir. 2011. Optimizing Data Layouts for Parallel Computation on Multicores. In Parallel Architectures and Compilation Techniques (Galveston, TX, USA). 143--154.Google ScholarGoogle Scholar

Index Terms

  1. Memory Space Recycling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)68
        • Downloads (Last 6 weeks)6

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!