Abstract
In the design of mobile systems, hardware/software (HW/SW) co-design has important advantages by creating specialized hardware for the performance or power optimizations. Dynamic binary translation (DBT) is a key component in co-design. During the translation, a dynamic optimizer in the DBT system applies various software optimizations to improve the quality of the translated code. With dynamic optimization, optimization time is an exposed run-time overhead and useful analyses are often restricted due to their high costs. Thus, a dynamic optimizer needs to make smart decisions with limited analysis information, which complicates the design of optimization decision models and often causes failures in human-made heuristics. In mobile systems, this problem is even more challenging because of strict constraints on computing capabilities and memory size.
To overcome the challenge, we investigate an opportunity to build practical optimization decision models for DBT by using machine learning techniques. As the first step, loop unrolling is chosen as the representative optimization. We base our approach on the industrial strength DBT infrastructure and conduct evaluation with 17,116 unrollable loops collected from 200 benchmarks and real-life programs across various domains. By utilizing all available features that are potentially important for loop unrolling decision, we identify the best classification algorithm for our infrastructure with consideration for both prediction accuracy and cost. The greedy feature selection algorithm is then applied to the classification algorithm to distinguish its significant features and cut down the feature space. By maintaining significant features only, the best affordable classifier, which satisfies the budgets allocated to the decision process, shows 74.5% of prediction accuracy for the optimal unroll factor and realizes an average 20.9% reduction in dynamic instruction count during the steady-state translated code execution. For comparison, the best baseline heuristic achieves 46.0% prediction accuracy with an average 13.6% instruction count reduction. Given that the infrastructure is already highly optimized and the ideal upper bound for instruction reduction is observed at 23.8%, we believe this result is noteworthy.
- 2019-02-08. Intel Core i7 Embedded Processor. https://ark.intel.com/products/series/122593/8th-Generation-Intel-Core-i7-Processors#@embedded.Google Scholar
- 2019-06-02. 3DMark. https://www.3dmark.com/.Google Scholar
- 2019-06-02. FPMark. https://www.eembc.org/fpmark/.Google Scholar
- 2019-06-02. Geekbench. https://www.geekbench.com/.Google Scholar
- 2019-06-02. SYSmark. https://bapco.com/products/sysmark-2018/.Google Scholar
- 2019-06-02. TabletMark. https://bapco.com/products/end-of-life-products/tabletmark/.Google Scholar
- Felice Balarin, Paolo Giusto, Attila Jurecska, Michael Chiodo, Harry Hsieh, Claudio Passerone, Ellen Sentovich, Luciano Lavagno, Bassam Tabbara, Alberto Sangiovanni-Vincentelli, et al. 1997. Hardware-software Co-design of Embedded Systems: The POLIS Approach. Springer Science 8 Business Media.Google Scholar
- Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. 2011. Cython: The best of both worlds. Computing in Science 8 Engineering 13, 2 (2011), 31--39.Google Scholar
- Edson Borin, Youfeng Wu, Cheng Wang, Wei Liu, Mauricio Breternitz Jr, Shiliang Hu, Esfir Natanzon, Shai Rotem, and Roni Rosner. 2010. TAO: Two-level atomicity for dynamic binary optimizations. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 12--21.Google Scholar
Digital Library
- James Bucek, Klaus-Dieter Lange, et al. 2018. SPEC CPU2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 41--42.Google Scholar
Digital Library
- John Cavazos and J. Eliot B. Moss. 2004. Inducing heuristics to decide whether to schedule. In ACM SIGPLAN Notices, Vol. 39. ACM, 183--194.Google Scholar
Digital Library
- John Cavazos and Michael F. P. O’Boyle. 2005. Automatic tuning of inlining heuristics. In Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference. IEEE, 14--14.Google Scholar
- John Cavazos and Michael F. P. O’boyle. 2006. Method-specific dynamic compilation using logistic regression. ACM SIGPLAN Notices 41, 10 (2006), 229--240.Google Scholar
Digital Library
- Jack W. Davidson and Sanjay Jinturkar. 1996. Aggressive loop unrolling in a retargetable, optimizing compiler. In International Conference on Compiler Construction. Springer, 59--73.Google Scholar
- James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing/spl trade/Software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Code Generation and Optimization, 2003. CGO 2003. International Symposium on. IEEE, 15--24.Google Scholar
Cross Ref
- Kemal Ebcioglu, Erik Altman, Michael Gschwind, and Sumedh Sathaye. 2001. Dynamic binary translation and optimization. IEEE Transactions on Computers 50, 6 (2001), 529--548.Google Scholar
Digital Library
- Ramaswamy Govindarajan, Erik R. Altman, and Guang R. Gao. 1994. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th Annual International Symposium on Microarchitecture. ACM, 85--94.Google Scholar
- John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach. Elsevier.Google Scholar
Digital Library
- John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1--17.Google Scholar
Digital Library
- Kenneth Hoste, Andy Georges, and Lieven Eeckhout. 2010. Automated just-in-time compiler tuning. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 62--72.Google Scholar
Digital Library
- Chandra Krintz and Brad Calder. 2001. Using annotations to reduce dynamic optimization time. ACM Sigplan Notices 36, 5 (2001), 156--167.Google Scholar
Digital Library
- Hugh Leather, Edwin Bonilla, and Michael O’Boyle. 2009. Automatic feature generation for machine learning based optimizing compilation. In Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 81--91.Google Scholar
Digital Library
- Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by randomForest. R News 2, 3 (2002), 18--22.Google Scholar
- Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization of the SPEC CPU2017 benchmark suite. In Performance Analysis of Systems and Software (ISPASS), 2018 IEEE International Symposium on. IEEE, 149--158.Google Scholar
Cross Ref
- Yu Liu, Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2018. MLbench: Benchmarking machine learning services against human experts. Proceedings of the VLDB Endowment 11, 10 (2018), 1220--1232.Google Scholar
Digital Library
- Josep Llosa, Mateo Valero, E. Agyuade, and Antonio González. 1998. Modulo scheduling with reduced register pressure. IEEE Transactions on Computers6 (1998), 625--638.Google Scholar
Digital Library
- Uma Mahadevan and Lacky Shah. 1998. Intelligent loop unrolling. US Patent 5,797,013.Google Scholar
- Antoine Monsifrot, François Bodin, and Rene Quiniou. 2002. A machine learning approach to automatic production of compiler heuristics. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Springer, 41--50.Google Scholar
Digital Library
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825--2830.Google Scholar
Digital Library
- Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using SimPoint for accurate and efficient simulation. In ACM SIGMETRICS Performance Evaluation Review, Vol. 31. ACM, 318--319.Google Scholar
- Archana Ravindar and Y. N. Srikant. 2011. Relative roles of instruction count and cycles per instruction in WCET estimation. In ACM SIGSOFT Software Engineering Notes, Vol. 36. ACM, 55--60.Google Scholar
- Stuart J Russell and Peter Norvig. 2016. Artificial Intelligence: A Modern Approach. Malaysia; Pearson Education Limited.Google Scholar
Digital Library
- Vivek Sarkar. 2000. Optimized unrolling of nested loops. In Proceedings of the 14th International Conference on Supercomputing. ACM, 153--166.Google Scholar
Digital Library
- Mark Stephenson and Saman Amarasinghe. 2005. Predicting unroll factors using supervised classification. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 123--134.Google Scholar
Digital Library
- Mark Stephenson, Saman Amarasinghe, Martin Martin, and Una-May O’Reilly. 2003. Meta optimization: Improving compiler heuristics with machine learning. In ACM SIGPLAN Notices, Vol. 38. ACM, 77--90.Google Scholar
Digital Library
- Cheng Wang and Youfeng Wu. 2013. TSO_ATOMICITY: Efficient hardware primitive for TSO-preserving region optimizations. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 509--520.Google Scholar
Digital Library
- Zheng Wang and Michael O’Boyle. 2018. Machine learning in compiler optimization. Proc. IEEE (2018).Google Scholar
- Markus Willems, Volker Bursgens, Thorsten Grotker, and Heinrich Meyr. 1997. FRIDGE: An interactive code generation environment for HW/SW codesign. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. IEEE, 287--290.Google Scholar
Cross Ref
- Wayne H. Wolf. 1994. Hardware-software co-design of embedded systems. Proc. IEEE 82, 7 (1994), 967--989.Google Scholar
Cross Ref
- Kamen Yotov, Xiaoming Li, Gang Ren, Michael Cibulskis, Gerald DeJong, Maria Garzaran, David Padua, Keshav Pingali, Paul Stodghill, and Peng Wu. 2003. A comparison of empirical and model-driven optimization. ACM SIGPLAN Notices 38, 5 (2003), 63--76.Google Scholar
Digital Library
- Xinchuan Zeng and Tony R. Martinez. 2000. Distribution-balanced stratified cross-validation for accuracy estimation. Journal of Experimental 8 Theoretical Artificial Intelligence 12, 1 (2000), 1--12.Google Scholar
Cross Ref
- Guoqiang Peter Zhang. 2000. Neural networks for classification: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 30, 4 (2000), 451--462.Google Scholar
Digital Library
Index Terms
Multi-objective Exploration for Practical Optimization Decisions in Binary Translation
Recommendations
Decision Space Scalability Analysis of Multi-Objective Particle Swarm Optimization Algorithms
2021 IEEE Congress on Evolutionary Computation (CEC)Particle swarm optimization (PSO) has been adapted to solve multi-objective optimization problems. However, these PSO-based multi-objective optimization algorithms typically face difficulties when the number of decision variables is increased and the ...
Modified multi-objective particle swarm optimization algorithm for multi-objective optimization problems
ICSI'12: Proceedings of the Third international conference on Advances in Swarm Intelligence - Volume Part IMulti-objective particle swarm optimization (MOPSO) is an optimization technique inspired by bird flocking, which has been steadily gaining attention from the research community because of its high convergence speed. However, faced with multi-objective ...
Ant Colony Optimization for Multi-Objective Optimization Problems
ICTAI '07: Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence - Volume 01We propose in this paper a generic algorithm based on Ant Colony Optimization to solve multi-objective optimiza- tion problems. The proposed algorithm is parameterized by the number of ant colonies and the number of pheromone trails. We compare ...






Comments