ABSTRACT
In order to evaluate software performance and find regressions, many developers use automated performance tests. However, the test results often contain a certain amount of noise that is not caused by actual performance changes in the programs. They are instead caused by external factors like operating system decisions or unexpected non-determinisms inside the programs. This makes interpreting the test results difficult since results that differ from previous results cannot easily be attributed to either genuine changes or noise. In this paper we present an analysis of a subset of the various factors that are likely to contribute to this noise using the Mozilla Firefox browser as an example. In addition we present a statistical technique for identifying outliers in Mozilla's automatic testing framework. Our results show that a significant amount of noise is caused by memory randomization and other external factors, that there is variance in Firefox internals that does not seem to be correlated with test result variance, and that our suggested statistical forecasting technique can give more reliable detection of genuine performance changes than the one currently in use by Mozilla.
References
- Alameldeen, A. R. & Wood, D. A. (2003), Variability in architectural simulations of multi-threaded workloads, in 'High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings. The Ninth International Symposium on', pp. 7--18. Google Scholar
Digital Library
- Brown, M. B. & Forsythe, A. B. (1974), 'Robust tests for the equality of variances', Journal of the American Statistical Association 69(346), 364--367.Google Scholar
Cross Ref
- Drepper, U. (2007), 'What every programmer should know about memory', http://people.redhat.com/drepper/cpumemory.pdf {22 April 2012}.Google Scholar
- Fowler, M. (2006), 'Continuous integration', http://www.martinfowler.com/articles/continuousIntegration.html {22 April 2012}.Google Scholar
- Georges, A., Buytaert, D. & Eeckhout, L. (2007), Statistically rigorous java performance evaluation, in 'Proceedings of the 22nd annual ACM SIGPLAN conference on Object oriented programming systems and applications - OOPSLA '07', Montreal, Quebec, Canada, p. 57. Google Scholar
Digital Library
- Gu, D., Verbrugge, C. & Gagnon, E. (2004), 'Code layout as a source of noise in JVM performance', In Component and Middleware Performance Workshop, OOPSLA.Google Scholar
- Holt, C. C. (1957), 'Forecasting seasonals and trends by exponentially weighted moving averages', International Journal of Forecasting 20(1), 5--10.Google Scholar
Cross Ref
- Kalibera, T., Bulej, L. & Tuma, P. (2005), 'Benchmark precision and random initial state', Proceedings of the 2005 International Symposium on Performance Evaluation of Computer and Telecommunications Systems pp. 853--862.Google Scholar
- Larres, J., Potanin, A. & Hirose, Y. (2012), A study of performance variations in the mozilla firefox web browser, Technical Report 12-14, Victoria University of Wellington.Google Scholar
- Levene, H. (1960), Robust tests for equality of variances, in I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow & H. B. Mann, eds, 'Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling', Stanford University Press, pp. 278--292.Google Scholar
- Mytkowicz, T., Diwan, A., Hauswirth, M. & Sweeney, P. F. (2009), Producing wrong data without doing anything obviously wrong!, in 'Proceeding of the 14th international conference on Architectural support for programming languages and operating systems', ACM, Washington, DC, USA, pp. 265--276. Google Scholar
Digital Library
- O'Callahan, R. (2010), 'Private communication'.Google Scholar
- R Core Team (2012), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0.Google Scholar
- Rodgers, J. L. & Nicewander, W. A. (1988), 'Thirteen ways to look at the correlation coefficient', The American Statistician 42(1), 59--66.Google Scholar
Cross Ref
- Shacham, H., Page, M., Pfaff, B., Goh, E., Modadugu, N. & Boneh, D. (2004), On the effectiveness of address-space randomization, in 'Proceedings of the 11th ACM conference on Computer and communications security', CCS '04, ACM, Washington DC, USA, p. 298--307. Google Scholar
Digital Library
- Tsafrir, D., Ouaknine, K. & Feitelson, D. G. (2007), Reducing performance evaluation sensitivity and variability by input shaking, in 'Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 2007. MASCOTS '07. 15th International Symposium on', pp. 231--237. Google Scholar
Digital Library
- Yar, M. & Chatfield, C. (1990), 'Prediction intervals for the Holt-Winters forecasting procedure', International Journal of Forecasting 6(1), 127--137.Google Scholar
Cross Ref

Alex Potanin

Comments