skip to main content
research-article

Efficient hardware-based nonintrusive dynamic application profiling

Published:05 May 2011Publication History
Skip Abstract Section

Abstract

Application profiling—the process of monitoring an application to determine the frequency of execution within specific regions—is an essential step within the design process for many software and hardware systems. Profiling is often a critical step within hardware/software partitioning utilized to determine the critical kernels of an application. In this article, we present an innovative, nonintrusive dynamic application profiler (DAProf) capable of profiling an executing application by monitoring the application's short backward branches, function calls, and function returns. The resulting profile information provides an accurate characterization of the frequently executed loops within the application providing a breakdown of loop executions versus loop iterations per execution. DAProf achieves excellent profiling accuracy with an average accuracy of 98% for loop executions, 97% for average iterations per execution, and 95% for percentage of execution time. In addition, the presented dynamic application profiler incurs as little as 11% area overhead compared to an ARM9 microprocessor. DAProf is ideally suited for rapidly profiling software applications and dynamic optimization approaches such as dynamic hardware/software partitioning in which detailed loop execution information is needed to provide accurate performance estimates.

References

  1. Altera, Inc. 2009. Performance counter core. http://www.altera.com.Google ScholarGoogle Scholar
  2. ARM Ltd. 2009. RealView profiler. http://www.arm.com/products/DevTools/RVP.html.Google ScholarGoogle Scholar
  3. Anderson, J., Berc L., Dean, J., Ghemawat, S., Henzinger, M., Leung, S.-T., Sites, R., Vandevoorde, M., Waldspurger, C., and Weihl, W. 1997. Continuous profiling: Where have all the cycles gone? ACM Trans. Comput. Syst. 15, 4, 357--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bala, V., Duesterwald, E., and Banerjia, S. 2000. Dynamo: A transparent runtime optimization system. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ball, T., and Larus, J. 1996. Efficient path profiling. In Proceedings of the International Symposium on Microarchitecture (MICRO). 46--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bellas, N., Hajj, I., Polychronopoulus, C., and Stamoulis, G. 1999. Energy and performance improvements in microprocessor design using a loop cache. In Proceedings of the International Conference on Computer Design (ICCD). 378--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Berrendorf, R., Ziegler, H., and Mohr, B. 2003. Performance counter library (PCL). http://www.fz-juelich.de/jsc/PCL/.Google ScholarGoogle Scholar
  8. Brown, S., Dongarra, J., Garner, N., London, K., and Mucci, P. 2000. A scalable cross-platform infrastructure for application performance tuning using hardware counters. In Proceedings of the ACM Conference on Supercomputing (SC). 42--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Burger, D., and Austin, T. M. 1997. The simplescalar tool set, version 2.0. Tech. Rep. 1342. Computer Sciences Department, University of Wisconsin-Madison, Madison, WI.Google ScholarGoogle Scholar
  10. Calder, B., Feller, P., and Eustace, A. 1997. Value profiling. In Proceedings of the International Symposium on Microarchitecture (MICRO). 259--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chung, E.Y., Benini, L., and De Micheli, G. 2001. Automatic source code specialization for Energy Reduction. In Proceedings of the International Symposium on Low-Power Electronics and Design (ISLPED). 80--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chernoff, A., Herdeg, M., Hookway, R., Reeve, C., Rubin, N., Tye, T., Bharadwaj Yadavalli, S., and Yates, J. 1998. FX!32: A profile-directed binary translator. IEEE Micro 18, 2, 56--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dean, J., Hicks, J., Waldspurger, C., Weihl, W., and Chrysos, G. 1997. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proceedings of the International Symposium on Microarchitecture (MICRO). 292--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ebcioglu, K., Altman, E., Gschwind, M., and Sathaye, S. 2001. Dynamic binary translation and optimization. IEEE Trans. Comput. 50, 6, 529--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gordon-Ross, A., Cotterell, S., and Vahid, F. 2002. Exploiting fixed programs in embedded systems: A loop cache example. IEEE Comput. Arch. Lett. 1, 1, 2--5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gordon-Ross, A., and Vahid, F. 2005. Frequent loop detection using efficient non-intrusive on-chip hardware. IEEE Trans. Comput. 54, 10, 1203--1215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Graham, S. L., Kessler, P. B., and McKusick, M. K. 1982. GPROF: A call graph execution profiler. In Proceedings of the Symposium on Compiler Construction. 120--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Guo, Z., Buyukkurt, B., Najjar, W., and Vissers, K. 2005. Optimized generation of data-path from C codes. In Proceedings of the Design Automation and Test in Europe Conference (DATE). 112--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Guthaus, M., Ringenberg, J., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the IEEE Workshop on Workload Characterization. 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hazelwood, K., and Klauser, A. 2006. A dynamic binary instrumentation engine for the ARM architecture. In Proceedings of the Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES). 261--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Henkel, J. 1999. A low power hardware/software partitioning approach for core-based embedded systems. In Proceedings of the Design Automation Conference (DAC). 122--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. IEEE. 2001. IEEE 1149.1 standard test access port and boundary scan architecture.Google ScholarGoogle Scholar
  23. Intel Corp. 2005. Vtune environment, http://developer.intel.com/vtune.Google ScholarGoogle Scholar
  24. Keane, J., Bradley, C., and Ebeling, C. 2004. A compiled accelerator for biological cell signaling simulations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA). 233--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Klaiber, A. 2000. The technology behind crusoe processors. Transmeta Corporation. Santa Clara CA. http://www.transmeta.com.Google ScholarGoogle Scholar
  26. Lakshminarayana, G., Raghunathan, A., Khouri, K., Jha, N., and Dey, S. 1999. Common-case computation: A high-level technique for power and performance optimization. In Proceedings of the Design Automation Conference (DAC). 56--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Lee, L. H., Moyer, B., and Arends, J. 1999. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). 267--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lysecky, R., Stitt, G., and Vahid, F. 2006. Warp processors. ACM Trans. Des. Automat. Electron. Syst. 11, 3, 659--681. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lysecky, R., Cotterell, S., and Vahid, F. 2002. A fast on-chip profiler memory. In Proceedings of the Design Automation Conference (DAC). 28--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nair, A., and Lysecky, R. 2008. Non-intrusive dynamic application profiler for detailed loop execution characterization. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES). 23--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Pettis, K., and Hansen, R. C. 1990. Profile guided code positioning. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). 16--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Schulz, M., White, B. S., McKee, S. A., Lee, H. S., and Jeitner, J. 2005. Owl: Next generation system monitoring. In Proceedings of the Conference on Computing Frontiers (CF). 116--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Scott, J., Lee, L.H., Chin, A., Arends, J., and Moyer, W. 1999. Designing the M*CORE M3 CPU architecture. In Proceedings of the International Conference on Computer Design (ICCD). 94--101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shannon, L. and Chow, P. 2004. Maximizing system performance: Using reconfigurability to monitor system communication. In Proceedings of the International Conference on Field-Programmable Technology (FPT). 231--238.Google ScholarGoogle Scholar
  35. Sprunt, B. 2002. Pentium 4 performance monitoring features. IEEE Micro 22, 72--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Stitt, G., Vahid, F., and Nematbakhsh, S. 2004. Power savings and speedups from partitioning critical loops to hardware in embedded systems. ACM Trans. Embed. Comp. Syst. 3, 1, 218--232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Stitt, G., and Vahid, F. 2002. The energy advantages of microprocessor platforms with on-chip configurable logic. IEEE Des. Test Comp. 19, 6, 36--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Tong, J., and Khalid, M. 2007. A comparison of profiling tools for FPGA-based embedded systems. In Proceedings of the Canadian Conference on Electrical and Computer Engineering (CCECE). 1687--1690.Google ScholarGoogle Scholar
  39. Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., and Bohm, W. 2001. A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture. In Proceedings of the International Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES). 116--125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Villarreal, J., Lysecky, R., Cotterell, S., and Vahid, F. 2001. Loop analysis of embedded applications. Tech. Rep. UCR-CSE-01-03. University of California Riverside, Riverside, CAGoogle ScholarGoogle Scholar
  41. Yellin, D. M. 2003. Competitive algorithms for the dynamic selection of component implementations. IBM Syst. J. 42, 1, 85--97. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Zagha, M., Larson, B., Turner, S., and Itzkowitz, M. 1996. Performance analysis using the MIPS R10000 performance counters. Supercomp., 16--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhang, X., Wang, Z., Gloy, N., Chen, J., and Smith, M. 1997. System support for automatic profiling and optimization. In Proceedings of the International Symposium on Operating Systems Principles, 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Zilles, C., and Sohi, G. 2001. A programmable co-processor for profiling. In Proceedings of the International Symposium on High-Performance Computer Architectures (HPCA). 241--252. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient hardware-based nonintrusive dynamic application profiling

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!