Abstract
As the memory access footprints of applications in areas like data analytics increase, the latency overhead of translation lookaside buffer (TLB) misses increases. Thus, the efficiency of TLB becomes increasingly critical for overall system performance. Analyzing TLB miss traces is useful for hardware architecture design and software application optimization. Utilizing cycle-accurate simulators or instrumentation tools is very time-consuming and/or inaccurate for tracing and profiling TLB misses. In this article, we propose an efficient and precise tool to collect and profile last-level TLB misses. This tool utilizes a novel software method called Page Table Access Tracing (PTAT), storing last-level page table entries of certain workload processes into a reserved uncached memory region. Therefore, each last-level TLB miss incurred by user process corresponds to one uncached page table access to main memory, which can be captured and recorded by a hardware memory bus monitor. The detected information is then dumped into offline storage. In this manner, full TLB miss traces are collected and can be analyzed flexibly. Compared to previous software-based methods, this method achieves higher performance. Experiments show that, compared with a state-of-the-art kernel instrumentation method (BadgerTrap), which lacks complete dumping trace function, the speedup is still up to 3.88-fold for memory-intensive benchmarks. Due to the improved efficiency and completeness of tracing, case studies validate that more flexible profiling can be conducted, which is of great significance for TLB performance optimization. The accuracy of PTAT is verified by both dedicated sequence and performance counters.
- D. A. Bader, J. Feo, J. Gilbert, J. Kepner, D. Koester, E. Loh, K. Madduri, W. Mann, and Theresa Meuse. 2011. The Graph 500 List. Retrieved from http://graph500.org.Google Scholar
- Will Cohen, Suravee Suthikulpanit, Will Deacon, Gilles Allard, Daniel Hansel, and Robert Richter. 2017. Oprofile. Retrieved from http://oprofle.sourceforge.net.Google Scholar
- Yungang Bao, Mingyu Chen, Yuan Ruan, Li Liu, Jianping Fan, Qingbo Yuan, Bo Song, and Jianwei Xu. 2008. HMTT: A platform independent full-system memory trace monitoring system. Meas. Model. Comput. Syst. 36, 1 (2008), 229--240. Google Scholar
Digital Library
- Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track. 41--46. Google Scholar
Digital Library
- Nathan Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The GEM5 simulator. ACM SIGARCH Computer Archit. News 39, 2 (2011), 1--7. Google Scholar
Digital Library
- Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan. 2012. A lightweight hybrid hardware/software approach for object-relative memory profiling. In 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’12). IEEE, 46--57. Google Scholar
Digital Library
- Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: classification-based memory deduplication through page access characteristics. In ACM SIGPLAN Notices, Vol. 49. ACM, 65--76. Google Scholar
Digital Library
- Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. BadgerTrap: A tool to instrument x86-64 TLB misses. ACM Sigarch Comput. Archit. News 42, 2 (2014), 20--23. Google Scholar
Digital Library
- John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 34, 4 (2006), 1--17. Google Scholar
Digital Library
- Yongbing Huang. 2011. TopMC: An Approach for Understanding Architectural Characteristics (Technical Report). Retrieved from http://asg.ict.ac.cn/projects/topmc.Google Scholar
- Yongbing Huang, Licheng Chen, Zehan Cui, Yuan Ruan, Yungang Bao, Mingyu Chen, and Ninghui Sun. 2014. HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace’s semantic gap. ACM Trans. Archit. Code Optim. 11, 1 (2014), 7. Google Scholar
Digital Library
- Yongbing Huang, Zehan Cui, Licheng Chen, Wenli Zhang, Yungang Bao, and Mingyu Chen. 2012. HaLock: hardware-assisted lock contention detection in multithreaded applications. In 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 253--262. Google Scholar
Digital Library
- Intel. 2013. Intel-64 and IA-32 Architectures Software Developer’s Manual, Vol. 3A: System Programming Guide, Part 1, 64 (2013), (3--11--20)--(3--11--33).Google Scholar
- Intel. 2017. Intel VTune Amplifier 2017. Retrieved from http://software.intel.com/en-us/intel-vtune.Google Scholar
- Bruce L. Jacob and Trevor N. Mudge. 1998. A look at several memory management units, TLB-refill mechanisms, and page table organizations. In ACM SIGOPS Operating Systems Review, Vol. 32. ACM, 295--306. Google Scholar
Digital Library
- Teledyne LeCroy. 2017. Kibra 480 analyzer. Retrieved from http://teledynelecroy.com/protocolanalyzer.Google Scholar
- Yuhang Liu and Xian-He Sun. 2015. LPM: Concurrency-driven layered performance matching. In 2015 44th International Conference on Parallel Processing (ICPP’15). IEEE, 879--888. Google Scholar
Digital Library
- Yuhang Liu and Xian-He Sun. 2015. Reevaluating data stall time with the consideration of data access concurrency. J. Comput. Sci. Technol. 30, 2 (2015), 227--245.Google Scholar
Cross Ref
- Yuhang Liu and Xian-He Sun. 2017. Evaluating the combined effect of memory capacity and concurrency for many-core chip design. ACM Trans. Model. Perform. Eval. Comput. Syst. 2, 2 (2017), 9. Google Scholar
Digital Library
- Peter S. Magnusson, Mattias Christensson, J. Eskilson, Daniel Forsgren, G. Hallberg, J. Hogberg, F. Larsson, Andreas Moestedt, and B. Werner. 2002. SIMICS: A full system simulation platform. IEEE Comput. 35, 2 (2002), 50--58. Google Scholar
Digital Library
- Shirley V. Moore. 2002. A comparison of counting and sampling modes of using performance monitoring hardware. In International Conference on Computational Science. Springer, 904--912. Google Scholar
Digital Library
- Nobuyuki Ohba, Seiji Munetoh, Atsuya Okazaki, and Yasunao Katayama. 2014. Non-intrusive scalable memory access tracer. In International Conference on Quantitative Evaluation of Systems. Springer, 245--248.Google Scholar
Cross Ref
- Dhinakaran Pandiyan, Shin-Ying Lee, and Carole-Jean Wu. 2013. Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite-MobileBench. In 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 133--142.Google Scholar
Cross Ref
- Avadh Patel, Furat Afram, Shunfei Chen, and Kanad Ghose. 2011. MARSS: A full system simulator for multicore x86 CPUs. In Proceedings of the 48th Design Automation Conference. ACM, 1050--1055. Google Scholar
Digital Library
- Steven J. Plimpton, Ron Brightwell, Courtenay Vaughan, Keith Underwood, and Mike Davis. 2006. A simple synchronous distributed-memory algorithm for the HPCC randomaccess benchmark. In 2006 IEEE International Conference on Cluster Computing. IEEE, 1--7.Google Scholar
Cross Ref
- Green Hills Software. 2017. Supertrace probe. Retrieved from http://www.ghs.com/products/supertraceprobe.html.Google Scholar
- Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L. Santoni, Fernando C. M. Martins, Andrew V. Anderson, Steven M. Bennett, Alain Kagi, Felix H. Leung, and Larry Smith. 2005. Intel virtualization technology. Computer 38, 5 (2005), 48--56. Google Scholar
Digital Library
- Petrie Wong, Ziqiang Feng, Wenjian Xu, Eric Lo, and Ben Kao. 2015. TLB misses: The missing issue of adaptive radix tree? In Proceedings of the 11th International Workshop on Data Management on New Hardware. ACM, 6. Google Scholar
Digital Library
- Lixin Zhang, Zhen Fang, Mike Parker, Binu K. Mathew, Lambert Schaelicke, John B. Carter, Wilson C. Hsieh, and Sally A. McKee. 2001. The impulse memory controller. IEEE Trans. Comput. 50, 11 (2001), 1117--1132. Google Scholar
Digital Library
Index Terms
PTAT: An Efficient and Precise Tool for Tracing and Profiling Detailed TLB Misses
Recommendations
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11: Proceedings of the 38th annual international symposium on Computer architectureTo meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...
Tiling, Block Data Layout, and Memory Hierarchy Performance
Recently, several experimental studies have been conducted on block data layout in conjunction with tiling as a data transformation technique to improve cache performance. In this paper, we analyze cache and TLB performance of such alternate layouts (...
Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks
ISCA '11To meet the demand for more powerful high-performance shared-memory servers, multiprocessor systems must incorporate efficient and scalable cache coherence protocols, such as those based on directory caches. However, the limited directory cache size of ...






Comments