Abstract
Application profiling—the process of monitoring an application to determine the frequency of execution within specific regions—is an essential step within the design process for many software and hardware systems. Profiling is often a critical step within hardware/software partitioning utilized to determine the critical kernels of an application. In this article, we present an innovative, nonintrusive dynamic application profiler (DAProf) capable of profiling an executing application by monitoring the application's short backward branches, function calls, and function returns. The resulting profile information provides an accurate characterization of the frequently executed loops within the application providing a breakdown of loop executions versus loop iterations per execution. DAProf achieves excellent profiling accuracy with an average accuracy of 98% for loop executions, 97% for average iterations per execution, and 95% for percentage of execution time. In addition, the presented dynamic application profiler incurs as little as 11% area overhead compared to an ARM9 microprocessor. DAProf is ideally suited for rapidly profiling software applications and dynamic optimization approaches such as dynamic hardware/software partitioning in which detailed loop execution information is needed to provide accurate performance estimates.
- Altera, Inc. 2009. Performance counter core. http://www.altera.com.Google Scholar
- ARM Ltd. 2009. RealView profiler. http://www.arm.com/products/DevTools/RVP.html.Google Scholar
- Anderson, J., Berc L., Dean, J., Ghemawat, S., Henzinger, M., Leung, S.-T., Sites, R., Vandevoorde, M., Waldspurger, C., and Weihl, W. 1997. Continuous profiling: Where have all the cycles gone? ACM Trans. Comput. Syst. 15, 4, 357--390. Google Scholar
Digital Library
- Bala, V., Duesterwald, E., and Banerjia, S. 2000. Dynamo: A transparent runtime optimization system. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). 1--12. Google Scholar
Digital Library
- Ball, T., and Larus, J. 1996. Efficient path profiling. In Proceedings of the International Symposium on Microarchitecture (MICRO). 46--57. Google Scholar
Digital Library
- Bellas, N., Hajj, I., Polychronopoulus, C., and Stamoulis, G. 1999. Energy and performance improvements in microprocessor design using a loop cache. In Proceedings of the International Conference on Computer Design (ICCD). 378--383. Google Scholar
Digital Library
- Berrendorf, R., Ziegler, H., and Mohr, B. 2003. Performance counter library (PCL). http://www.fz-juelich.de/jsc/PCL/.Google Scholar
- Brown, S., Dongarra, J., Garner, N., London, K., and Mucci, P. 2000. A scalable cross-platform infrastructure for application performance tuning using hardware counters. In Proceedings of the ACM Conference on Supercomputing (SC). 42--54. Google Scholar
Digital Library
- Burger, D., and Austin, T. M. 1997. The simplescalar tool set, version 2.0. Tech. Rep. 1342. Computer Sciences Department, University of Wisconsin-Madison, Madison, WI.Google Scholar
- Calder, B., Feller, P., and Eustace, A. 1997. Value profiling. In Proceedings of the International Symposium on Microarchitecture (MICRO). 259--269. Google Scholar
Digital Library
- Chung, E.Y., Benini, L., and De Micheli, G. 2001. Automatic source code specialization for Energy Reduction. In Proceedings of the International Symposium on Low-Power Electronics and Design (ISLPED). 80--83. Google Scholar
Digital Library
- Chernoff, A., Herdeg, M., Hookway, R., Reeve, C., Rubin, N., Tye, T., Bharadwaj Yadavalli, S., and Yates, J. 1998. FX!32: A profile-directed binary translator. IEEE Micro 18, 2, 56--64. Google Scholar
Digital Library
- Dean, J., Hicks, J., Waldspurger, C., Weihl, W., and Chrysos, G. 1997. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proceedings of the International Symposium on Microarchitecture (MICRO). 292--302. Google Scholar
Digital Library
- Ebcioglu, K., Altman, E., Gschwind, M., and Sathaye, S. 2001. Dynamic binary translation and optimization. IEEE Trans. Comput. 50, 6, 529--548. Google Scholar
Digital Library
- Gordon-Ross, A., Cotterell, S., and Vahid, F. 2002. Exploiting fixed programs in embedded systems: A loop cache example. IEEE Comput. Arch. Lett. 1, 1, 2--5. Google Scholar
Digital Library
- Gordon-Ross, A., and Vahid, F. 2005. Frequent loop detection using efficient non-intrusive on-chip hardware. IEEE Trans. Comput. 54, 10, 1203--1215. Google Scholar
Digital Library
- Graham, S. L., Kessler, P. B., and McKusick, M. K. 1982. GPROF: A call graph execution profiler. In Proceedings of the Symposium on Compiler Construction. 120--126. Google Scholar
Digital Library
- Guo, Z., Buyukkurt, B., Najjar, W., and Vissers, K. 2005. Optimized generation of data-path from C codes. In Proceedings of the Design Automation and Test in Europe Conference (DATE). 112--117. Google Scholar
Digital Library
- Guthaus, M., Ringenberg, J., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. 2001. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the IEEE Workshop on Workload Characterization. 3--14. Google Scholar
Digital Library
- Hazelwood, K., and Klauser, A. 2006. A dynamic binary instrumentation engine for the ARM architecture. In Proceedings of the Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES). 261--270. Google Scholar
Digital Library
- Henkel, J. 1999. A low power hardware/software partitioning approach for core-based embedded systems. In Proceedings of the Design Automation Conference (DAC). 122--127. Google Scholar
Digital Library
- IEEE. 2001. IEEE 1149.1 standard test access port and boundary scan architecture.Google Scholar
- Intel Corp. 2005. Vtune environment, http://developer.intel.com/vtune.Google Scholar
- Keane, J., Bradley, C., and Ebeling, C. 2004. A compiled accelerator for biological cell signaling simulations. In Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA). 233--241. Google Scholar
Digital Library
- Klaiber, A. 2000. The technology behind crusoe processors. Transmeta Corporation. Santa Clara CA. http://www.transmeta.com.Google Scholar
- Lakshminarayana, G., Raghunathan, A., Khouri, K., Jha, N., and Dey, S. 1999. Common-case computation: A high-level technique for power and performance optimization. In Proceedings of the Design Automation Conference (DAC). 56--61. Google Scholar
Digital Library
- Lee, L. H., Moyer, B., and Arends, J. 1999. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). 267--269. Google Scholar
Digital Library
- Lysecky, R., Stitt, G., and Vahid, F. 2006. Warp processors. ACM Trans. Des. Automat. Electron. Syst. 11, 3, 659--681. Google Scholar
Digital Library
- Lysecky, R., Cotterell, S., and Vahid, F. 2002. A fast on-chip profiler memory. In Proceedings of the Design Automation Conference (DAC). 28--33. Google Scholar
Digital Library
- Nair, A., and Lysecky, R. 2008. Non-intrusive dynamic application profiler for detailed loop execution characterization. In Proceedings of the International Conference on Compilers, Architectures, and Synthesis for Embedded Systems (CASES). 23--30. Google Scholar
Digital Library
- Pettis, K., and Hansen, R. C. 1990. Profile guided code positioning. In Proceedings of the Conference on Programming Language Design and Implementation (PLDI). 16--27. Google Scholar
Digital Library
- Schulz, M., White, B. S., McKee, S. A., Lee, H. S., and Jeitner, J. 2005. Owl: Next generation system monitoring. In Proceedings of the Conference on Computing Frontiers (CF). 116--124. Google Scholar
Digital Library
- Scott, J., Lee, L.H., Chin, A., Arends, J., and Moyer, W. 1999. Designing the M*CORE M3 CPU architecture. In Proceedings of the International Conference on Computer Design (ICCD). 94--101. Google Scholar
Digital Library
- Shannon, L. and Chow, P. 2004. Maximizing system performance: Using reconfigurability to monitor system communication. In Proceedings of the International Conference on Field-Programmable Technology (FPT). 231--238.Google Scholar
- Sprunt, B. 2002. Pentium 4 performance monitoring features. IEEE Micro 22, 72--82. Google Scholar
Digital Library
- Stitt, G., Vahid, F., and Nematbakhsh, S. 2004. Power savings and speedups from partitioning critical loops to hardware in embedded systems. ACM Trans. Embed. Comp. Syst. 3, 1, 218--232. Google Scholar
Digital Library
- Stitt, G., and Vahid, F. 2002. The energy advantages of microprocessor platforms with on-chip configurable logic. IEEE Des. Test Comp. 19, 6, 36--43. Google Scholar
Digital Library
- Tong, J., and Khalid, M. 2007. A comparison of profiling tools for FPGA-based embedded systems. In Proceedings of the Canadian Conference on Electrical and Computer Engineering (CCECE). 1687--1690.Google Scholar
- Venkataramani, G., Najjar, W., Kurdahi, F., Bagherzadeh, N., and Bohm, W. 2001. A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture. In Proceedings of the International Conference on Compiler, Architecture and Synthesis for Embedded Systems (CASES). 116--125. Google Scholar
Digital Library
- Villarreal, J., Lysecky, R., Cotterell, S., and Vahid, F. 2001. Loop analysis of embedded applications. Tech. Rep. UCR-CSE-01-03. University of California Riverside, Riverside, CAGoogle Scholar
- Yellin, D. M. 2003. Competitive algorithms for the dynamic selection of component implementations. IBM Syst. J. 42, 1, 85--97. Google Scholar
Digital Library
- Zagha, M., Larson, B., Turner, S., and Itzkowitz, M. 1996. Performance analysis using the MIPS R10000 performance counters. Supercomp., 16--35. Google Scholar
Digital Library
- Zhang, X., Wang, Z., Gloy, N., Chen, J., and Smith, M. 1997. System support for automatic profiling and optimization. In Proceedings of the International Symposium on Operating Systems Principles, 15--26. Google Scholar
Digital Library
- Zilles, C., and Sohi, G. 2001. A programmable co-processor for profiling. In Proceedings of the International Symposium on High-Performance Computer Architectures (HPCA). 241--252. Google Scholar
Digital Library
Index Terms
Efficient hardware-based nonintrusive dynamic application profiling
Recommendations
Non-intrusive dynamic application profiler for detailed loop execution characterization
CASES '08: Proceedings of the 2008 international conference on Compilers, architectures and synthesis for embedded systemsApplication profiling - the process of monitoring an application to determine the frequency of execution within specific regions - is an essential step within the design process for many software and hardware systems. In this paper, we present an ...
Non-intrusive dynamic application profiling for multitasked applications
DAC '09: Proceedings of the 46th Annual Design Automation ConferenceApplication profiling -- the process of monitoring an application to determine the frequency of execution within specific regions -- is an essential step within the design process for many software and hardware systems. Profiling is often a critical ...
Reconfigurable vertical profiling framework for the android runtime system
Special Section ESFH'12, ESTIMedia'11 and Regular PapersDalvik virtual machine in the Android system creates a profiling barrier between VM-space applications and Linux user-space libraries. It is difficult for existing profiling tools on the Android system to definitively identify whether a bottleneck ...






Comments