Abstract
Hardware performance monitoring units (PMUs) are a standard feature in modern microprocessors, providing a rich set of microarchitectural event samplers. Recently, numerous profile-guided optimization (PGO) frameworks have exploited them to feature much lower profiling overhead compared to conventional instrumentation-based frameworks. However, existing PGO frameworks mainly focus on optimizing the layout of binaries; they overlook rich information provided by the PMU about data access behaviors over the memory hierarchy. Thus, we propose MaPHeA, a lightweight
- [1] . 2004. Prefetch injection based on hardware monitoring and object metadata. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation.
DOI: Google ScholarDigital Library
- [2] . 2017. Thermostat: Application-transparent page management for two-tiered main memory. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems.
DOI: Google ScholarDigital Library
- [3] . 2018. Write-rationing garbage collection for hybrid memories. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation.
DOI: Google ScholarDigital Library
- [4] . 2017. AMD64 Architecture Programmer’s Manual Volume 2: System Programming. Retrieved from https://www.amd.com/system/files/TechDocs/24593.pdf.Google Scholar
- [5] . 2010. Introducing the Graph 500.
DOI: https://www.osti.gov/biblio/1014641Google Scholar - [6] . 2019. Cascade Lake: Next generation Intel Xeon scalable processor. IEEE Micro 39 (2019), 29–36.
DOI: Google ScholarCross Ref
- [7] . 2019. ARM® ARM Architecture Reference Manual Armv8, for Armv8-A Architecture Profile. Retrieved from https://documentation-service.arm.com/static/60119835773bb020e3de6fee?token=.Google Scholar
- [8] . 2018. Memory hierarchy for web search. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture.
DOI: Google ScholarCross Ref
- [9] . 2011. SpecTLB: A mechanism for speculative address translation. In Proceedings of the 38th Annual International Symposium on Computer Architecture.
DOI: Google ScholarDigital Library
- [10] . 2013. Efficient virtual memory for big memory servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture.
DOI: Google ScholarDigital Library
- [11] . 2015. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In Proceedings of the IEEE International Symposium on Workload Characterization.
DOI: Google ScholarDigital Library
- [12] . 2017. The GAP Benchmark Suite. arXiv:1508.03619 [cs.DC].Google Scholar
- [13] . 2015. memkind: An extensible heap memory manager for heterogeneous memory platforms and mixed memory policies.
DOI: https://www.osti.gov/biblio/1245908Google Scholar - [14] . 2017. AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications. In Proceedings of the International Symposium on Code Generation and Optimization.
DOI: Google ScholarDigital Library
- [15] . 2010. Taming hardware event samples for FDO compilation. In Proceedings of International Symposium on Code Generation and Optimization.
DOI: Google ScholarDigital Library
- [16] . 2013. Taming hardware event samples for precise and versatile feedback directed optimizations. IEEE Trans. Comput. 62 (2013), 376–389.
DOI: Google ScholarDigital Library
- [17] . 2014. Towards scalable graph computation on mobile devices. In Proceedings of the IEEE International Conference on Big Data.
DOI: Google ScholarCross Ref
- [18] . 2020. ATMem: Adaptive data placement in graph applications on heterogeneous memories. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization.
DOI: Google ScholarDigital Library
- [19] . 1999. Cache-conscious structure layout. In Proceedings of the ACM SIGPLAN’99 Conference on Programming Language Design and Implementation.
DOI: Google ScholarDigital Library
- [20] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing.
DOI: Google ScholarDigital Library
- [21] . 2011. The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38 (2011), 1–25.
DOI: Google ScholarDigital Library
- [22] . 2009. Performance counters on Linux. In Proceedings of the Linux Plumbers Conference.Google Scholar
- [23] . 2016. Data tiering in heterogeneous memory systems. In Proceedings of the 11th European Conference on Computer Systems.
DOI: Google ScholarDigital Library
- [24] . 2016. GCC. Retrieved from https://github.com/gcc-mirror/gcc.Google Scholar
- [25] . 2019. AutoFDO. Retrieved from: https://github.com/google/autofdo.Google Scholar
- [26] . 2019. LLAMA—Automatic memory allocations: An LLVM pass and library for automatically determining memory allocations. In Proceedings of the International Symposium on Memory Systems.
DOI: Google ScholarDigital Library
- [27] . 2019. The Preliminary Evaluation of a Hypervisor-Based Virtualization Mechanism for Intel Optane DC Persistent Memory Module. arXiv:1907.12014 [cs.OS].Google Scholar
- [28] . 2015. Low overhead software wear leveling for hybrid PCM + DRAM main memory on embedded systems. IEEE Trans. Very Large Scale Integ. Syst. 23 (2015), 654–663.
DOI: Google ScholarDigital Library
- [29] . 2013. Software enabled wear-leveling for hybrid PCM main memory on embedded systems. In Proceedings of the Conference on Design, Automation and Test in Europe.
DOI: Google ScholarCross Ref
- [30] . 2005. Profile driven optimisations in GCC. In Proceedings of the GCC Summit.Google Scholar
- [31] . 2018. POWER9 Performance Monitor Unit User’s Guide. Retrieved from https://wiki.raptorcs.com/w/images/6/6b/POWER9_PMU_UG_v12_28NOV2018_pub.pdf.Google Scholar
- [32] . 2018. Memory Optimizer. Retrieved from https://github.com/intel/memory-optimizer.Google Scholar
- [33] . 2018. Persistent Memory Documentation. Retrieved from https://docs.pmem.io/persistent-memory/.Google Scholar
- [34] . 2019. MEMKIND. Retrieved from https://github.com/memkind/memkind.Google Scholar
- [35] . 2021. Intel® 64 and IA-32 Architectures Optimization Reference Manual. Retrieved from https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.Google Scholar
- [36] . 2021. Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes 3B: System Programming Guide. Retrieved from https://software.intel.com/en-us/download/intel-64-and-ia-32-architectures-sdm-volume-3b-system-programming-guide-part-2.Google Scholar
- [37] . 2012. JEDEC Standard: DDR4 SDRAM.Google Scholar
- [38] . 2015. High Bandwidth Memory (HBM) DRAM.Google Scholar
- [39] . 2016. Large pages on steroids: Small ideas to accelerate big memory applications. IEEE Comput. Archit. Lett. 15 (2016), 101–104.
DOI: Google ScholarDigital Library
- [40] . 2015. Profiling a warehouse-scale computer. In Proceedings of the 42nd Annual International Symposium on Computer Architecture.
DOI: Google ScholarDigital Library
- [41] . 2017. Mallacc: Accelerating memory allocation. In Proceedings of the 22nd International Conference on Architectural Support for Programming Languages and Operating Systems.
DOI: Google ScholarDigital Library
- [42] . 2017. HeteroOS: OS design for heterogeneous memory management in datacenter. In Proceedings of the 44th Annual International Symposium on Computer Architecture.
DOI: Google ScholarDigital Library
- [43] . 2016. Towards automatic HBM allocation using LLVM: A case study with Knights Landing. In Proceedings of the 3rd Workshop on the LLVM Compiler Infrastructure in HPC.
DOI: Google ScholarDigital Library
- [44] . 2013. Compiler-based data prefetching and streaming non-temporal store generation for the Intel(R) Xeon Phi(TM) coprocessor. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum.
DOI: Google ScholarDigital Library
- [45] . 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web.
DOI: Google ScholarDigital Library
- [46] . 2016. Coordinated and efficient huge page management with Ingens. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation.
DOI: Google ScholarDigital Library
- [47] . 2019. 23.4 A 512GB 1.1 V Managed DRAM solution with 16GB ODP and media controller. In Proceedings of the IEEE International Solid-State Circuits Conference.
DOI: Google ScholarCross Ref
- [48] . 2015. Hybrid Memory Cube System Interconnect Directory-Based Cache Coherence Methodology.
US Patent App. 14/706,516 .Google Scholar - [49] . 2009. Transparent Hugepages. Retrieved from https://lwn.net/Articles/359158.Google Scholar
- [50] . 2018. PMEM NUMA Node and Hotness Accounting/Migration. Retrieved from https://lkml.org/lkml/2018/12/26/138.Google Scholar
- [51] . 2019. Intel Optane data center persistent memory. In Proceedings of the IEEE Hot Chips 31 Symposium.
DOI: Google ScholarCross Ref
- [52] . 2004. Ispike: A post-link optimizer for the Intel Itanium architecture. In Proceedings of the International Symposium on Code Generation and Optimization.
DOI: Google ScholarCross Ref
- [53] . 2020. Learning-based memory allocation for C++ server workloads. In Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems.
DOI: Google ScholarDigital Library
- [54] . 2009. A case for compiler-driven superpage allocation. In Proceedings of the 47th Annual Southeast Regional Conference.
DOI: Google ScholarDigital Library
- [55] . 2015. Big data on low power cores: Are low power embedded processors a good fit for the big data workloads? In Proceedings of the 33rd IEEE International Conference on Computer Design.
DOI: Google ScholarDigital Library
- [56] . 2006. Hardware profile-guided automatic page placement for ccNUMA systems. In Proceedings of the 11th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
DOI: Google ScholarDigital Library
- [57] . 1995. Memory bandwidth and machine balance in current high performance computers. IEEE Comput. Societ. Technic. Committ. Comput. Archit. Newsl. 84 (1995), 19–25.Google Scholar
- [58] . 2003. GENERIC and GIMPLE: A new tree representation for entire functions. In Proceedings of the GCC Summit.Google Scholar
- [59] . 2015. Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture.
DOI: Google ScholarCross Ref
- [60] . 2016. 3D XPoint Technology. Retrieved from https://www.micron.com/products/advanced-solutions/3d-xpoint-technology.Google Scholar
- [61] . 2015. GraphBIG: Understanding graph computing in the context of industrial solutions. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
DOI: Google ScholarDigital Library
- [62] . 2018. MOCA: Memory object classification and allocation in heterogeneous memory systems. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium.
DOI: Google ScholarCross Ref
- [63] . 2002. Practical, transparent operating system support for superpages. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation.
DOI: Google ScholarDigital Library
- [64] . 2021. MaPHeA: A lightweight memory hierarchy-aware profile-guided heap allocation framework. In Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems.
DOI: Google ScholarDigital Library
- [65] . 2018. Cross-layer memory management to improve DRAM energy efficiency. ACM Trans. Archit. Code Optim. 15 (2018), 1–27.
DOI: Google ScholarDigital Library
- [66] . 2017. Optimizing function placement for large-scale data-center applications. In Proceedings of the International Symposium on Code Generation and Optimization.
DOI: Google ScholarCross Ref
- [67] . 2019. BOLT: A practical binary optimizer for data centers and beyond. In Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization.
DOI: Google ScholarCross Ref
- [68] . 2015. Low complexicity graph based navigation and path finding of mobile robot using BFS. In Proceedings of the 2nd International Conference on Perception and Machine Intelligence.
DOI: Google ScholarDigital Library
- [69] . 2019. HawkEye: Efficient fine-grained os support for huge pages. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems.
DOI: Google ScholarDigital Library
- [70] . 1990. Profile guided code positioning. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation.
DOI: Google ScholarDigital Library
- [71] . 2012. Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical SRAM-tags with a simple and practical design. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture.
DOI: Google ScholarDigital Library
- [72] . 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture.
DOI: Google ScholarDigital Library
- [73] . 2020. redis.io. Retrieved from https://redis.io.Google Scholar
- [74] . 2017. Persistent memory programming. Login: Usenix Mag. 42 (2017), 34–40.Google Scholar
- [75] . 2010. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
DOI: Google ScholarDigital Library
- [76] . 2020. HALO: Post-link heap-layout optimisation. In Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization.
DOI: Google ScholarDigital Library
- [77] . 1998. Segregating heap objects by reference behavior and lifetime. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems.
DOI: Google ScholarDigital Library
- [78] . 2017. Automating the application data placement in hybrid memory systems. In Proceedings of the IEEE International Conference on Cluster Computing.
DOI: Google ScholarCross Ref
- [79] . 2015. Adaptive GPU cache bypassing. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs.
DOI: Google ScholarDigital Library
- [80] . 2019. Panthera: Holistic memory management for big data processing over hybrid memories. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation.
DOI: Google ScholarDigital Library
- [81] . 2014. BigDataBench: A big data benchmark suite from internet services. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture.
DOI: Google ScholarCross Ref
- [82] . 2018. ProfDP: A lightweight profiler to guide data placement in heterogeneous memory systems. In Proceedings of the International Conference on Supercomputing.
DOI: Google ScholarDigital Library
- [83] . 2014. Hardware Counted Profile-Guided Optimization. arXiv:1411.6361 [cs.PL].Google Scholar
- [84] . 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture.
DOI: Google ScholarCross Ref
- [85] . 2017. Unimem: Runtime data management on non-volatile memory-based heterogeneous main memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis.
DOI: Google ScholarDigital Library
- [86] . 2018. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis.
DOI: Google ScholarDigital Library
- [87] . 2013. An efficient compiler framework for cache bypassing on GPUs. In Proceedings of the International Conference on Computer-aided Design.
DOI: Google ScholarCross Ref
- [88] . 2015. Coordinated static and dynamic cache bypassing for GPUs. In Proceedings of the IEEE 21st International Symposium on High Performance Computer Architecture.
DOI: Google ScholarCross Ref
- [89] . 2019. Nimble page management for tiered memory systems. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems.
DOI: Google ScholarDigital Library
- [90] . 2020. Improving program locality in the GC using hotness. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation.
DOI: Google ScholarDigital Library
- [91] . 2016. Energy efficient real-time task scheduling for embedded systems with hybrid main memory. In Proceedings of the IEEE 20th International Conference on Embedded and Real-time Computing Systems and Applications
DOI: Google ScholarCross Ref
Index Terms
MaPHeA: A Framework for Lightweight Memory Hierarchy-aware Profile-guided Heap Allocation
Recommendations
MaPHeA: a lightweight memory hierarchy-aware profile-guided heap allocation framework
LCTES 2021: Proceedings of the 22nd ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded SystemsHardware performance monitoring units (PMUs) are a standard feature in modern microprocessors for high-performance computing (HPC) and embedded systems, by providing a rich set of microarchitectural event samplers. Recently, many profile-guided ...
Heap data allocation to scratch-pad memory in embedded systems
Cache exploitation in embedded systemsThis paper presents the first-ever compile-time method for allocating a portion of the heap data to scratch-pad memory. A scratch-pad is a fast directly addressed compiler-managed SRAM memory that replaces the hardware-managed cache. It is motivated by ...
Memory system performance of programs with intensive heap allocation
Heap allocation with copying garbage collection is a general storage management technique for programming languages. It is believed to have poor memory system performance. To investigate this, we conducted an in-depth study of the memory system ...






Comments