skip to main content
research-article

PTAT: An Efficient and Precise Tool for Tracing and Profiling Detailed TLB Misses

Authors Info & Claims
Published:22 May 2018Publication History
Skip Abstract Section

Abstract

As the memory access footprints of applications in areas like data analytics increase, the latency overhead of translation lookaside buffer (TLB) misses increases. Thus, the efficiency of TLB becomes increasingly critical for overall system performance. Analyzing TLB miss traces is useful for hardware architecture design and software application optimization. Utilizing cycle-accurate simulators or instrumentation tools is very time-consuming and/or inaccurate for tracing and profiling TLB misses. In this article, we propose an efficient and precise tool to collect and profile last-level TLB misses. This tool utilizes a novel software method called Page Table Access Tracing (PTAT), storing last-level page table entries of certain workload processes into a reserved uncached memory region. Therefore, each last-level TLB miss incurred by user process corresponds to one uncached page table access to main memory, which can be captured and recorded by a hardware memory bus monitor. The detected information is then dumped into offline storage. In this manner, full TLB miss traces are collected and can be analyzed flexibly. Compared to previous software-based methods, this method achieves higher performance. Experiments show that, compared with a state-of-the-art kernel instrumentation method (BadgerTrap), which lacks complete dumping trace function, the speedup is still up to 3.88-fold for memory-intensive benchmarks. Due to the improved efficiency and completeness of tracing, case studies validate that more flexible profiling can be conducted, which is of great significance for TLB performance optimization. The accuracy of PTAT is verified by both dedicated sequence and performance counters.

References

  1. D. A. Bader, J. Feo, J. Gilbert, J. Kepner, D. Koester, E. Loh, K. Madduri, W. Mann, and Theresa Meuse. 2011. The Graph 500 List. Retrieved from http://graph500.org.Google ScholarGoogle Scholar
  2. Will Cohen, Suravee Suthikulpanit, Will Deacon, Gilles Allard, Daniel Hansel, and Robert Richter. 2017. Oprofile. Retrieved from http://oprofle.sourceforge.net.Google ScholarGoogle Scholar
  3. Yungang Bao, Mingyu Chen, Yuan Ruan, Li Liu, Jianping Fan, Qingbo Yuan, Bo Song, and Jianwei Xu. 2008. HMTT: A platform independent full-system memory trace monitoring system. Meas. Model. Comput. Syst. 36, 1 (2008), 229--240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator. In USENIX Annual Technical Conference, FREENIX Track. 41--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nathan Binkert, Bradford M. Beckmann, Gabriel Black, Steven K. Reinhardt, Ali G. Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The GEM5 simulator. ACM SIGARCH Computer Archit. News 39, 2 (2011), 1--7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan. 2012. A lightweight hybrid hardware/software approach for object-relative memory profiling. In 2012 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’12). IEEE, 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Licheng Chen, Zhipeng Wei, Zehan Cui, Mingyu Chen, Haiyang Pan, and Yungang Bao. 2014. CMD: classification-based memory deduplication through page access characteristics. In ACM SIGPLAN Notices, Vol. 49. ACM, 65--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jayneel Gandhi, Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2014. BadgerTrap: A tool to instrument x86-64 TLB misses. ACM Sigarch Comput. Archit. News 42, 2 (2014), 20--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Comput. Archit. News 34, 4 (2006), 1--17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yongbing Huang. 2011. TopMC: An Approach for Understanding Architectural Characteristics (Technical Report). Retrieved from http://asg.ict.ac.cn/projects/topmc.Google ScholarGoogle Scholar
  11. Yongbing Huang, Licheng Chen, Zehan Cui, Yuan Ruan, Yungang Bao, Mingyu Chen, and Ninghui Sun. 2014. HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace’s semantic gap. ACM Trans. Archit. Code Optim. 11, 1 (2014), 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yongbing Huang, Zehan Cui, Licheng Chen, Wenli Zhang, Yungang Bao, and Mingyu Chen. 2012. HaLock: hardware-assisted lock contention detection in multithreaded applications. In 21st International Conference on Parallel Architectures and Compilation Techniques. ACM, 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Intel. 2013. Intel-64 and IA-32 Architectures Software Developer’s Manual, Vol. 3A: System Programming Guide, Part 1, 64 (2013), (3--11--20)--(3--11--33).Google ScholarGoogle Scholar
  14. Intel. 2017. Intel VTune Amplifier 2017. Retrieved from http://software.intel.com/en-us/intel-vtune.Google ScholarGoogle Scholar
  15. Bruce L. Jacob and Trevor N. Mudge. 1998. A look at several memory management units, TLB-refill mechanisms, and page table organizations. In ACM SIGOPS Operating Systems Review, Vol. 32. ACM, 295--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Teledyne LeCroy. 2017. Kibra 480 analyzer. Retrieved from http://teledynelecroy.com/protocolanalyzer.Google ScholarGoogle Scholar
  17. Yuhang Liu and Xian-He Sun. 2015. LPM: Concurrency-driven layered performance matching. In 2015 44th International Conference on Parallel Processing (ICPP’15). IEEE, 879--888. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Yuhang Liu and Xian-He Sun. 2015. Reevaluating data stall time with the consideration of data access concurrency. J. Comput. Sci. Technol. 30, 2 (2015), 227--245.Google ScholarGoogle ScholarCross RefCross Ref
  19. Yuhang Liu and Xian-He Sun. 2017. Evaluating the combined effect of memory capacity and concurrency for many-core chip design. ACM Trans. Model. Perform. Eval. Comput. Syst. 2, 2 (2017), 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Peter S. Magnusson, Mattias Christensson, J. Eskilson, Daniel Forsgren, G. Hallberg, J. Hogberg, F. Larsson, Andreas Moestedt, and B. Werner. 2002. SIMICS: A full system simulation platform. IEEE Comput. 35, 2 (2002), 50--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Shirley V. Moore. 2002. A comparison of counting and sampling modes of using performance monitoring hardware. In International Conference on Computational Science. Springer, 904--912. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Nobuyuki Ohba, Seiji Munetoh, Atsuya Okazaki, and Yasunao Katayama. 2014. Non-intrusive scalable memory access tracer. In International Conference on Quantitative Evaluation of Systems. Springer, 245--248.Google ScholarGoogle ScholarCross RefCross Ref
  23. Dhinakaran Pandiyan, Shin-Ying Lee, and Carole-Jean Wu. 2013. Performance, energy characterizations and architectural implications of an emerging mobile platform benchmark suite-MobileBench. In 2013 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 133--142.Google ScholarGoogle ScholarCross RefCross Ref
  24. Avadh Patel, Furat Afram, Shunfei Chen, and Kanad Ghose. 2011. MARSS: A full system simulator for multicore x86 CPUs. In Proceedings of the 48th Design Automation Conference. ACM, 1050--1055. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Steven J. Plimpton, Ron Brightwell, Courtenay Vaughan, Keith Underwood, and Mike Davis. 2006. A simple synchronous distributed-memory algorithm for the HPCC randomaccess benchmark. In 2006 IEEE International Conference on Cluster Computing. IEEE, 1--7.Google ScholarGoogle ScholarCross RefCross Ref
  26. Green Hills Software. 2017. Supertrace probe. Retrieved from http://www.ghs.com/products/supertraceprobe.html.Google ScholarGoogle Scholar
  27. Rich Uhlig, Gil Neiger, Dion Rodgers, Amy L. Santoni, Fernando C. M. Martins, Andrew V. Anderson, Steven M. Bennett, Alain Kagi, Felix H. Leung, and Larry Smith. 2005. Intel virtualization technology. Computer 38, 5 (2005), 48--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Petrie Wong, Ziqiang Feng, Wenjian Xu, Eric Lo, and Ben Kao. 2015. TLB misses: The missing issue of adaptive radix tree? In Proceedings of the 11th International Workshop on Data Management on New Hardware. ACM, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lixin Zhang, Zhen Fang, Mike Parker, Binu K. Mathew, Lambert Schaelicke, John B. Carter, Wilson C. Hsieh, and Sally A. McKee. 2001. The impulse memory controller. IEEE Trans. Comput. 50, 11 (2001), 1117--1132. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. PTAT: An Efficient and Precise Tool for Tracing and Profiling Detailed TLB Misses

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!