Abstract
Many program codes from different application domains process very large amounts of data, making their cache memory behavior critical for high performance. Most of the existing work targeting cache memory hierarchies focus on improving data access patterns, e.g., maximizing sequential accesses to program data structures via code and/or data layout restructuring strategies. Prior work has addressed this data locality optimization problem in the context of both single-core and multi-core systems. Another dimension of optimization, which can be as equally important/beneficial as improving data access pattern is to reduce the data volume (total number of addresses) accessed by the program code. Compared to data access pattern restructuring, this volume minimization problem has relatively taken much less attention. In this work, we focus on this volume minimization problem and address it in both single-core and multi-core execution scenarios. Specifically, we explore the idea of rewriting an application program code to reduce its "memory space footprint". The main idea behind this approach is to reuse/recycle, for a given data element, a memory location that has originally been assigned to another data element, provided that the lifetimes of these two data elements do not overlap with each other. A unique aspect is that it is "distance aware", i.e., in identifying the memory/cache locations to recycle it takes into account the physical distance between the location of the core and the memory/cache location to be recycled. We present a detailed experimental evaluation of our proposed memory space recycling strategy, using five different metrics: memory space consumption, network footprint, data access distance, cache miss rate, and execution time. The experimental results show that our proposed approach brings, respectively, 33.2%, 48.6%, 46.5%, 31.8%, and 27.9% average improvements in these metrics, in the case of single-threaded applications. With the multi-threaded versions of the same applications, the achieved improvements are 39.5%, 55.5%, 53.4%, 26.2%, and 22.2%, in the same order.
- Kristof E. Beyls and Erik H. D'Hollander. 2000. Compiler generated multithreading to alleviate memory latency. Universal Computer Science, special issue on Multithreaded Processors and Chip-Multiprocessors , Vol. 6, 10 (2000), 968-993.Google Scholar
- Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.,D. Dissertation. Princeton University.Google Scholar
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (Toronto, Ontario, Canada). 72-81.Google Scholar
Digital Library
- Jim Butterfield. 1986. Part 4: Overlaying. In Loading and Linking Commodore Programs. Compute!Google Scholar
- Steve Carr, Kathryn S. McKinley, and Chau-Wen Tseng. 1994. Compiler Optimizations for Improving Data Locality. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California, USA). 252-262.Google Scholar
Digital Library
- Lung-Yu Chang and Henry G. Dietz. 1990. Data Layout Optimization and Code Transformation for Paged Memory Systems . Technical Report TR-EE 90--43. Purdue University.Google Scholar
- Stephanie Coleman and Kathryn S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. SIGPLAN Notices , Vol. 30, 6 (1995), 279-290.Google Scholar
Digital Library
- Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. Transactions on Programming Languages and Systems , Vol. 13, 4 (1991), 451-490.Google Scholar
Digital Library
- Alain Darte, Robert Schreiber, and Gilles Villard. 2005. Lattice-based memory allocation. Transactions on Computers , Vol. 54, 10 (2005), 1242--1257.Google Scholar
Digital Library
- Wei Ding, Diana Guttman, and Mahmut T. Kandemir. 2014. Compiler Support for Optimizing Memory Bank-Level Parallelism. In International Symposium on Microarchitecture (Cambridge, UK). 571--582.Google Scholar
- Amin Farmahini-Farahani , Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. DRAMA: An Architecture for Accelerated Processing Near Memory. Computer Architecture Letters , Vol. 14, 1 (2015).Google Scholar
Digital Library
- Mohammad H. Foroozannejad, Matin Hashemi, Trevor L. Hodges, and Soheil Ghiasi. 2010. Look into Details: The Benefits of Fine-Grain Streaming Buffer Analysis. In Proceedings of Conference on Languages, Compilers, and Tools for Embedded Systems (Stockholm, Sweden). 27-36.Google Scholar
Digital Library
- Eddy D. Greef, Francky Catthoor, and Hugo D. Man. 1997. Memory size reduction through storage order optimization for embedded parallel multimedia applications. Parallel Comput. , Vol. 23, 12 (1997), 1811--1837.Google Scholar
Digital Library
- Rajiv Gupta and Jim Kajiya. 1990. Compiler optimization of array data storage . Technical Report Caltech-CS-TR-90-07. California Institute of Technology.Google Scholar
- Vaibhav Gupta, Debabrata Mohapatra, Sang P. Park, Anand Raghunathan, and Kaushik Roy. 2011. IMPACT: IMPrecise adders for low-power approximate computing. In International Symposium on Low Power Electronics and Design (Fukuoka, Japan). 409--414.Google Scholar
Cross Ref
- Jie Han and Michael Orshansky. 2013. Approximate computing: An emerging paradigm for energy-efficient design. In European Test Symposium (Avignon, France). 1--6.Google Scholar
Cross Ref
- Enric Herrero, José González, and Ramon Canal. 2010. Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors. In Proceedings of International Symposium on Computer Architecture (Saint-Malo, France). 419-428.Google Scholar
Digital Library
- Marcos Horro, Mahmut T. Kandemir, Louis-Noel Pouchet, Gabriel RodrÃguez, and Juan Tourino. 2019. Effect of Distributed Directories in Mesh Interconnects. In Proceedings of the Annual Design Automation Conference (Las Vegas, NV, USA). Article 51, bibinfonumpages6 pages.Google Scholar
Digital Library
- Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O-Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W. Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent near-Data Processing in GPU Systems. In Proceedings of the International Symposium on Computer Architecture (Seoul, Republic of Korea). 204-216.Google Scholar
- Guohua Jin and John Mellor-Crummey. 2005. Improving Performance by Reducing the Memory Footprint of Scientific Applications. High Performance Computing Applications , Vol. 19, 4 (2005), 433-451.Google Scholar
Digital Library
- Changkyu Kim, Doug Burger, and Stephen W. Keckler. 2002. An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated on-Chip Caches. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California). 211-222.Google Scholar
- Orhan Kislal, Jagadish B. Kotra, Xulong Tang, Mahmut T. Kandemir, and Myoungsoo Jung. 2018. Enhancing Computation-to-Core Assignment with Physical Location Information. In Proceedings of Programming Language Design and Implementation (Philadelphia, PA, USA). 312-327.Google Scholar
Digital Library
- D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. 1981. Dependence graphs and compiler optimizations. In Symposium on Principles of Programming Languages (Williamsburg, Virginia). 207-218.Google Scholar
- Hantak Kwak, Ben Lee , Ali R. Hurson, Suk-Han Yoon, and Woo-Jong Hahn. 1999. Effects of multithreading on cache performance. Transactions on Computers , Vol. 48, 2 (1999), 176--184.Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: a compilation framework for lifelong program analysis transformation. In International Symposium on Code Generation and Optimization (Palo Alto, CA, USA). 75--86.Google Scholar
Digital Library
- Wen-Yen Lin and Jean-Luc Gaudiot. 1998. The Design of an I-Structure Software Cache System. In Proceedings of the Workshop on Multithreaded Execution, Architecture and Compilation (Las Vegas, NV, USA).Google Scholar
- Yu Liu, Hong An , Xiaomei Li, Peng Leng , Sun Sun, and Junshi Chen. 2012. VSCP: A Cache Controlling Method for Improving Single Thread Performance in Multicore System. In International Conference on High Performance Computing and Communication & International Conference on Embedded Software and Systems (Liverpool, England, UK). 161--168.Google Scholar
- Jack L. Lo, Susan J. Eggers, Henry M. Levy, Sujay S. Parekh, and Dean M. Tullsen. 1999. Tuning Compiler Optimizations for Simultaneous Multithreading. Parallel Programming , Vol. 27, 6 (1999), 477-503.Google Scholar
Digital Library
- Debabrata Mohapatra, Vinay K. Chippa, Anand Raghunathan, and Kaushik Roy. 2011. Design of voltage-scalable meta-functions for approximate computing. In Design, Automation Test in Europe (Grenoble, France). 1--6.Google Scholar
- Matthias S. Müller, John Baron, William C. Brantley, Huiyu Feng, Daniel Hackenberg, Robert Henschel, Gabriele Jost, Daniel Molka, Chris Parrott, Joe Robichaux, Pavel Shelepugin, Matthijs van Waveren, Brian Whitney, and Kalyan Kumaran. 2012. SPEC OMP2012 -- An Application Benchmark Suite for Parallel Systems Using OpenMP. In OpenMP in a Heterogeneous World. 223--236.Google Scholar
- Praveen K. Murthy and Shuvra S. Bhattacharyya. 2001. Shared buffer implementations of signal processing systems using lifetime analysis techniques. Computer-Aided Design of Integrated Circuits and Systems , Vol. 20, 2 (2001), 177--198.Google Scholar
Digital Library
- Walid A. Najjar, W. Marcus Miller, and A. P. Wim Bohm. 1992. An analysis of loop latency in dataflow execution. SIGARCH Computer Architecture News , Vol. 20, 2 (1992), 352-360.Google Scholar
Digital Library
- Seth H. Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan , Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memoryGoogle Scholar
- logic devices on MapReduce workloads. In International Symposium on Performance Analysis of Systems and Software. 190--200.Google Scholar
- Muhammad M. Rafique and Zhichun Zhu. 2018. CAMPS: Conflict-Aware Memory-Side Prefetching Scheme for Hybrid Memory Cube. In Proc. of the International Conference on Parallel Processing (Eugene, OR, USA). 1--9.Google Scholar
- Easwaran Raman, Robert Hundt, and Sandya Mannarswamy. 2007. Structure Layout Optimization for Multithreaded Programs. In International Symposium on Code Generation and Optimization (San Jose, CA, USA). 271--282.Google Scholar
- James Reinders. 2005. Vtune performance analyzer essentials. In Intel Press .Google Scholar
- Muhammad Shafique, Rehan Hafiz, Semeen Rehman, Walaa El-Harouni, and Jörg Henkel. 2016. INVITED: Cross-Layer Approximate Computing: From Logic to Architectures. In Proceedings of the 53rd Annual Design Automation Conference (Austin, TX, USA). Article 99, bibinfonumpages6 pages.Google Scholar
Digital Library
- Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro , Vol. 36, 2 (2016), 34--46.Google Scholar
Digital Library
- Thomas L. Sterling and Hans P. Zima. 2002. Gilgamesh: A Multithreaded Processor-in-Memory Architecture for Petaflops Computing. In Proceedings of Supercomputing (Baltimore, MD, USA). 48--48.Google Scholar
- Harold S. Stone. 1970. A Logic-in-Memory Computer. Transactions on Computers , Vol. C-19, 1 (1970), 73--78.Google Scholar
- Michelle M. Strout, Larry Carter, Jeanne Ferrante, and Beth Simon. 1998. Schedule-independent storage mapping for loops. In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, California, USA). 24-33.Google Scholar
Digital Library
- Xulong Tang, Mahmut T. Kandemir, Mustafa Karakoy, and Meenakshi Arunachalam. 2019. Co-optimizing Memory-Level Parallelism and Cache-Level Parallelism. In Proceedings of Programming Language Design and Implementation (Phoenix, AZ, USA). 935-949.Google Scholar
Digital Library
- Xulong Tang, Mahmut Taylan Kandemir, Hui Zhao, Myoungsoo Jung, and Mustafa Karakoy. 2018. Computing with Near Data. Proceedings of Measurement and Analysis of Computing Systems , Vol. 2, 3, Article 42 (2018), bibinfonumpages30 pages.Google Scholar
Digital Library
- Xulong Tang, Orhan Kislal, Mahmut T. Kandemir, and Mustafa Karakoy. 2017. Data Movement Aware Computation Partitioning. In Proceedings of International Symposium on Microarchitecture (Cambridge, MA, USA). 730-744.Google Scholar
Digital Library
- Swagath Venkataramani, Vinay K. Chippa, Srimat T. Chakradhar, Kaushik Roy, and Anand Raghunathan. 2013. Quality programmable vector processors for approximate computing. In International Symposium on Microarchitecture (Davis, CA, USA). 1--12.Google Scholar
Digital Library
- Wen-Yen Lin and Jean-Luc Gaudiot. 1997. Exploiting global data locality in non-blocking multithreaded architectures. In Proceedings of International Symposium on Parallel Architectures, Algorithms and Networks (Taipei, Taiwan). 78--84.Google Scholar
- Qian Zhang, Ting Wang, Ye Tian, Feng Yuan, and Qiang Xu. 2015. ApproxANN: An approximate computing framework for artificial neural network. In Design, Automation Test in Europe (Grenoble, France). 701--706.Google Scholar
- Yuanrui Zhang, Wei Ding, Jun Liu, and Mahmut T. Kandemir. 2011. Optimizing Data Layouts for Parallel Computation on Multicores. In Parallel Architectures and Compilation Techniques (Galveston, TX, USA). 143--154.Google Scholar
Index Terms
Memory Space Recycling
Recommendations
Memory Space Recycling
SIGMETRICS '22Many program codes from different application domains process very large amounts of data, making their data locality/cache memory behavior critical for high performance. Prior work has addressed the data locality optimization problem in the context of ...
Memory Space Recycling
SIGMETRICS/PERFORMANCE '22: Abstract Proceedings of the 2022 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer SystemsMany program codes from different application domains process very large amounts of data, making their data locality/cache memory behavior critical for high performance. Prior work has addressed the data locality optimization problem in the context of ...
Prefetch-Aware Memory Controllers
Existing DRAM controllers employ rigid, nonadaptive scheduling and buffer management policies when servicing prefetch requests. Some controllers treat prefetches the same as demand requests, and others always prioritize demands over prefetches. However, ...






Comments