Abstract
All software in use today relies on libraries, including standard libraries (e.g., C, C++) and application-specific libraries (e.g., libxml, libpng). Most libraries are loaded in memory and dynamically linked when programs are launched, resolving symbol addresses across the applications and libraries. Dynamic linking has many benefits: It allows code to be reused between applications, conserves memory (because only one copy of a library is kept in memory for all the applications that share it), and allows libraries to be patched and updated without modifying programs, among numerous other benefits. However, these benefits come at the cost of performance. For every call made to a function in a dynamically linked library, a trampoline is used to read the function address from a lookup table and branch to the function, incurring memory load and branch operations. Static linking avoids this performance penalty, but loses all the benefits of dynamic linking. Given its myriad benefits, dynamic linking is the predominant choice today, despite the performance cost. In this work, we propose a speculative hardware mechanism to optimize dynamic linking by avoiding executing the trampolines for library function calls, providing the benefits of dynamic linking with the performance of static linking. Speculatively skipping the memory load and branch operations of the library call trampolines improves performance by reducing the number of executed instructions and gains additional performance by reducing pressure on the instruction and data caches, TLBs, and branch predictors. Because the indirect targets of library call trampolines do not change during program execution, our speculative mechanism never misspeculates in practice. We evaluate our technique on real hardware with production software and observe up to 4% speedup using only 1.5KB of on-chip storage.
- Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7):422--426, July 1970. Google Scholar
Digital Library
- Willem De Groef, Nick Nikiforakis, Yves Younan, and Frank Piessens. Jitsec: Just-in-time security for code injection attacks. In Benelux Workshop on Information and System Security (WISSEC), 2010.Google Scholar
- Djellel Eddine Difallah, Andrew Pavlo, Carlo Curino, and Philippe Cudre-Mauroux. Oltp-bench: An extensible testbed for benchmarking relational databases. PVLDB, 7(4):277--288, 2013. Google Scholar
Digital Library
- Michael Ferdman, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, Anastasia Ailamaki, and Babak Falsafi. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. In 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012. recognized as Best Paper by the program committee and recognized as Top Pick of 2013 by IEEE Micro. Google Scholar
Digital Library
- Firefox. https://www.mozilla.org/en-US/firefox/new/.Google Scholar
- Brad Fitzpatrick. Distributed caching with memcached. Linux J., 2004(124):5--, August 2004. Google Scholar
Digital Library
- Michael Franz. Dynamic linking of software components. Computer, 30(3):74--81, March 1997. Google Scholar
Digital Library
- Jian Huang and David Lilja. Exploiting basic block value locality with block reuse. In Proceedings of the 5th International Symposium on High Performance Computer Architecture, HPCA '99, pages 106--, Washington, DC, USA, 1999. IEEE Computer Society. Google Scholar
Digital Library
- Intel Corporation. Intel 64 and IA-32 Architectures Optimization Reference Manual}. Intel, March 2009.Google Scholar
- Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual. Intel, December 2009.Google Scholar
- Intel Xeon Processor E5450 (12M Cache, 3.00 GHz, 1333 MHz FSB). http://ark.intel.com/products/33083/Intel-Xeon-Processor-E5450--12M-Cache-3_00-GHz-1333-MHz-FSB.Google Scholar
- Daniel A. Jimenez. Reconsidering complex branch predictors. In Proceedings of the 9th International Symposium on High-Performance Computer Architecture, HPCA '03, pages 43--, Washington, DC, USA, 2003. IEEE Computer Society. Google Scholar
Digital Library
- Hyesoon Kim, Jos{\'e} A. Joao, Onur Mutlu, Chang Joo Lee, Yale N. Patt, and Robert Cohn. Vpc prediction: Reducing the cost of indirect branches via hardware-based dynamic devirtualization. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 424--435, New York, NY, USA, 2007. ACM. Google Scholar
Digital Library
- Thomas Kistler and Michael Franz. Continuous program optimization: A case study. ACM Trans. Program. Lang. Syst., 25(4):500--548, July 2003. Google Scholar
Digital Library
- Pierre Michaud, Andr{\'e} Seznec, and Richard Uhlig. Trading conflict and capacity aliasing in conditional branch predictors. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 292--303, New York, NY, USA, 1997. ACM. Google Scholar
Digital Library
- S. Owicki and A. Agarwal. Evaluating the performance of software cache coherence. In Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS III, pages 230--242, New York, NY, USA, 1989. ACM. Google Scholar
Digital Library
- David A. Padua and Michael J. Wolfe. Advanced compiler optimizations for supercomputers. Commun. ACM, 29(12):1184--1201, December 1986. Google Scholar
Digital Library
- Peacekeeper - The universal Browser Test. http://peacekeeper.futuremark.com/.Google Scholar
- Donald E. Porter, Silas Boyd-Wickizer, Jon Howell, Reuben Olinsky, and Galen C. Hunt. Rethinking the library os from the top down. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 291--304, New York, NY, USA, 2011. ACM. Google Scholar
Digital Library
- Parthasarathy Ranganathan, Kourosh Gharachorloo, Sarita V Adve, and Luiz Andre Barroso. Performance of database workloads on shared-memory systems with out-of-order processors. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, October 1998. Google Scholar
Digital Library
- Hovav Shacham, Matthew Page, Ben Pfaff, Eu-Jin Goh, Nagendra Modadugu, and Dan Boneh. On the effectiveness of address-space randomization. In Proceedings of the 11th ACM Conference on Computer and Communications Security, CCS '04, pages 298--307, New York, NY, USA, 2004. ACM. Google Scholar
Digital Library
- Avinash Sodani and Gurindar S. Sohi. Dynamic instruction reuse. In Proceedings of the 24th Annual International Symposium on Computer Architecture, ISCA '97, pages 194--205, New York, NY, USA, 1997. ACM. Google Scholar
Digital Library
- SPEC - Standard Performance Evaluation Corporation. http://www.spec.org/.Google Scholar
- Tse-Yu Yeh and Yale N. Patt. Two-level adaptive training branch prediction. In Proceedings of the 24th Annual International Symposium on Microarchitecture, MICRO 24, pages 51--61, New York, NY, USA, 1991. ACM. Google Scholar
Digital Library
Index Terms
Architectural Support for Dynamic Linking
Recommendations
Architectural Support for Dynamic Linking
ASPLOS'15All software in use today relies on libraries, including standard libraries (e.g., C, C++) and application-specific libraries (e.g., libxml, libpng). Most libraries are loaded in memory and dynamically linked when programs are launched, resolving symbol ...
Architectural Support for Dynamic Linking
ASPLOS '15: Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating SystemsAll software in use today relies on libraries, including standard libraries (e.g., C, C++) and application-specific libraries (e.g., libxml, libpng). Most libraries are loaded in memory and dynamically linked when programs are launched, resolving symbol ...
Optimizations Enabled by a Decoupled Front-End Architecture
In the pursuit of instruction-level parallelism, significant demands are placed on a processor's instruction delivery mechanism. Delivering the performance necessary to meet future processor execution targets requires that the performance of the ...







Comments