skip to main content
research-article

Enabling On-the-Fly Hardware Tracing of Data Reads in Multicores

Published:10 June 2019Publication History
Skip Abstract Section

Abstract

Software debugging is one of the most challenging aspects of embedded system development due to growing hardware and software complexity, limited visibility of system components, and tightening time-to-market. To find software bugs faster, developers often rely on on-chip trace modules with large buffers to capture program execution traces with minimum interference with program execution. However, the high volumes of trace data and the high cost of trace modules limit the visibility into the system operation to short program segments. This article introduces a new hardware/software technique for capturing and filtering read data value traces in multicores that enables a complete reconstruction of parallel program execution. The proposed technique exploits tracking of data reads in data caches and cache coherence protocol states to minimize the number of trace messages streamed out of the target platform to the software debugger. The effectiveness of the proposed technique is determined by analyzing the required trace port bandwidth and trace buffer sizes as a function of the data cache size and the number of processor cores. The results show that the proposed technique significantly reduces the required trace port bandwidth, from 12.2 to 73.9 times, when compared to the Nexus-like read data value tracing, thus enabling continuous on-the-fly data tracing at modest hardware cost.

References

  1. Arm. 2018. Arm Embedded Trace Macrocell Architecture Specification ETMv4.0 to ETMv4.4. Retrieved June 7, 2018 from https://static.docs.arm.com/ihi0064/f/etm_v4_4_architecture_specification_IHI0064F.pdf.Google ScholarGoogle Scholar
  2. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mike Burrows and David J. Wheeler. 1994. A Block-sorting Lossless Data Compression Algorithm. Digital SRC. Retrieved from https://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-124.pdf.Google ScholarGoogle Scholar
  4. James Campbell, Valeriy Kazantsev, and Hugh O'Keeffe. 2017. Real-Time Trace: A Better Way to Debug Embedded Applications. Ashling Microsystems. Retrieved July 12, 2017 from http://www.ashling.com/wp-content/uploads/Real-time_trace_a_better_way_to_debug_embedded_applications.pdf.Google ScholarGoogle Scholar
  5. Yunji Chen, Weiwu Hu, Tianshi Chen, and Ruiyang Wu. 2010. LReplay: A pending period based deterministic replay scheme. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 187--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. John L. Hennessy and David A. Patterson. 2012. Computer Architecture: A Quantitative Approach (5th ed.). Morgan Kaufmann/Elsevier, Waltham MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christian Hochberger and Alexander Weiss. 2008. Acquiring an exhaustive, continuous and real-time trace from SoCs. In Proceedings of the IEEE International Conference on Computer Design 2008 (ICCD’08). 356--362.Google ScholarGoogle ScholarCross RefCross Ref
  8. Andrew B. T. Hopkins and Klaus D. McDonald-Maier. 2006. Debug support strategy for systems-on-chips with multiple processor cores. IEEE Trans. Comput. 55, 2 (2006), 174--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. IEEE-ISTO. 2012. The Nexus 5001 Forum Standard for a Global Embedded Processor Debug Interface V 3.01. Retrieved November 28, 2015 from http://www.nexus5001.org/standard.Google ScholarGoogle Scholar
  10. Intel. 2016. Intel 64 and IA-32 Architectures Developer's Manual: Vol. 3C. Retrieved July 11, 2017 from https://goo.gl/QLKR85.Google ScholarGoogle Scholar
  11. Intel. 2018. Nios II Processor Reference Guide. Intel. Retrieved June 7, 2018 from https://goo.gl/Ghp8xk.Google ScholarGoogle Scholar
  12. Kai-uwe Irrgang and Thomas B. Preußer. 2015. An LZ77-style bit-level compression for trace data compaction. In Proceedings of the 2015 25th International Conference on Field Programmable Logic and Applications (FPL’15). 1--4.Google ScholarGoogle Scholar
  13. Chung-Fu Kao, Shyh-Ming Huang, and Ing-Jer Huang. 2007. A Hardware Approach to Real-Time Program Trace Compression for Embedded Processors. IEEE Trans. Circ Syst. 54, 3 (2007), 530--543.Google ScholarGoogle ScholarCross RefCross Ref
  14. Georgios Kornaros and Dionisios Pnevmatikatos. 2013. A survey and taxonomy of on-chip monitoring of multicore systems-on-chip. ACM Trans. Autom. Electron. Syst. 18, 2 (2013), 17:1--17:38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Felix Martin and Michael Deubzer. 2017. Hardware Tracing of Embedded Multi-Core Real-Time Systems. SAE International, Warrendale, PA.Google ScholarGoogle Scholar
  16. Albrecht Mayer, Harry Siebert, and Klaus D. McDonald-Maier. 2007. Boosting debugging support for complex systems on chip. Computer 40, 4 (2007), 76--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Bojan Mihajlović, Željko Žilić, and Warren J. Gross. 2015. Architecture-aware real-time compression of execution traces. ACM Trans. Embed. Comput. Syst. 14, 4 (2015), 75:1--75:24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Aleksandar Milenković, Vladimir Uzelac, Milena Milenković, and Burtscher Burtscher. 2011. Caches and predictors for real-time, unobtrusive, and cost-effective program tracing in embedded systems. IEEE Trans. Comput. 60, 7 (2011), 992--1005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. MIPS Technologies. 2012. MIPS PDtrace Specification. MIPS. Retrieved April 1, 2016 from http://www.t-es-t.hu/download/mips/md00439g.pdf.Google ScholarGoogle Scholar
  20. Pablo Montesinos, Luis Ceze, and Josep Torrellas. 2008. Delorean: recording and deterministically replaying shared-memory multiprocessor execution efficiently. In Proceedings of the 35th International Symposium on Computer Architecture, 289--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Satish Narayanasamy, Gilles Pokam, and Brad Calder. 2005. BugNet: Continuously recording program execution for deterministic replay debugging. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA’05). 284--295. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. William Orme. 2008. Debug and Trace for Multicore SoCs. Retrieved March 28, 2016 from https://www.arm.com/files/pdf/CoresightWhitepaper.pdf.Google ScholarGoogle Scholar
  23. Mounika Ponugoti and Aleksandar Milenković. 2016. Exploiting cache coherence for effective on-the-fly data tracing in multicores. In Proceedings of the 2016 IEEE 34th International Conference on Computer Design (ICCD’16). 312--319.Google ScholarGoogle ScholarCross RefCross Ref
  24. Mounika Ponugoti, Amrish K. Tewar, and Aleksandar Milenkovic. 2016. On-the-fly load data value tracing in multicores. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Suchakrapani Datt Sharma and Michel Dagenais. 2016. Hardware-assisted instruction profiling and latency detection. J. Eng. 2016, 10 (2016), 367--376.Google ScholarGoogle ScholarCross RefCross Ref
  26. Neal Stollon and R. Collins. 2006. Nexus based multi-core debug. In Proceedings of the Design Conference International Engineering Consortium. 805--822. Retrieved March 28, 2016 from http://nexus5001.org/wp-content/uploads/2015/02/DesignCon_2006_Nexus_FS2_Freescale.pdf.Google ScholarGoogle Scholar
  27. Gregory Tassey. 2002. The Economic Impacts of Inadequate Infrastructure for Software Testing. Retrieved from http://www.rti.org/pubs/software_testing.pdf.Google ScholarGoogle Scholar
  28. Amrish Tewar, Albert Myers, and Aleksandar Milenković. 2015. mcfTRaptor: Toward unobtrusive on-the-fly control-flow tracing in multicores. J. Syst. Archit. 61, 10 (2015), 601--614. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Henrik Thane and Hans Hansson. 2000. Using deterministic replay for debugging of distributed real-time systems. In Proceedings of the 12th Euromicro Conference on Real-time Systems (Euromicro-RTS’00). 265--272. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. 335. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Vladimir Uzelac and Aleksandar Milenkovic. 2009. A real-time program trace compressor utilizing double move-to-front method. In Proceedings of the Design Automation Conference. 738--743. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Vladimir Uzelac and Aleksandar Milenkovic. 2013. Hardware-based load value trace filtering for on-the-fly debugging. Trans. Embed. Comput. Syst. 12, 2s (2013), 1--18.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Vladimir Uzelac, Aleksandar Milenković, Milena Milenković, and Martin Burtscher. 2014. Using branch predictors and variable encoding for on-the-fly program tracing. IEEE Trans. Comput. 63, 4 (2014), 1008--1020. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Michael Williams. 2012. ARMV8 debug and trace architectures. In Proceedings of the 2012 System, Software, SoC and Silicon Debug Conference. 1--6.Google ScholarGoogle Scholar
  35. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 24--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Min Xu, Rastislav Bodik, and Mark D. Hill. 2003. A “flight data recorder” for enabling full-system multiprocessor deterministic replay. In Proceedings of the 30th Annual International Symposium on Computer Architecture (ISCA’03). 122--135.Google ScholarGoogle Scholar
  37. Min Xu, Mark D. Hill, and Rastislav Bodik. 2006. A regulated transitive reduction (RTR) for longer memory race recording. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. 49--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jacob Ziv and Abraham Lempel. 2006. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theor. 23, 3 (2006), 337--343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. 2005. Freescale—MPC565 Reference Manual. Retrieved from https://www.nxp.com/docs/en/data-sheet/MPC565RM.pdf.Google ScholarGoogle Scholar
  40. International Technology Roadmap for Semiconductors 2007 Edition. Retrieved April 8, 2016 from https://goo.gl/TdZY52.Google ScholarGoogle Scholar
  41. University of Cambridge Reverse Debugging Study. Retrieved December 17, 2017 from https://goo.gl/4asWCW.Google ScholarGoogle Scholar

Index Terms

  1. Enabling On-the-Fly Hardware Tracing of Data Reads in Multicores

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)7
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!