Abstract
This paper presents a surprising result: changing a seemingly innocuous aspect of an experimental setup can cause a systems researcher to draw wrong conclusions from an experiment. What appears to be an innocuous aspect in the experimental setup may in fact introduce a significant bias in an evaluation. This phenomenon is called measurement bias in the natural and social sciences.
Our results demonstrate that measurement bias is significant and commonplace in computer system evaluation. By significant we mean that measurement bias can lead to a performance analysis that either over-states an effect or even yields an incorrect conclusion. By commonplace we mean that measurement bias occurs in all architectures that we tried (Pentium 4, Core 2, and m5 O3CPU), both compilers that we tried (gcc and Intel's C compiler), and most of the SPEC CPU2006 C programs. Thus, we cannot ignore measurement bias. Nevertheless, in a literature survey of 133 recent papers from ASPLOS, PACT, PLDI, and CGO, we determined that none of the papers with experimental results adequately consider measurement bias.
Inspired by similar problems and their solutions in other sciences, we describe and demonstrate two methods, one for detecting (causal analysis) and one for avoiding (setup randomization) measurement bias.
References
- Alaa R. Alameldeen and David A. Wood. Variability in architectural simulations of multi-threaded workloads. In IEEE HPCA,pages 7--18, 2003. Google Scholar
Digital Library
- Nathan L. Binkert, Ronald G. Dreslinski, Lisa R. Hsu, Kevin T. Lim, Ali G. Saidi, and Steven K. Reinhardt. The m5 simulator: Modeling networked systems. IEEE Micro, 26(4):52--60, 2006. Google Scholar
Digital Library
- S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA, New York, NY, USA, October 2006. ACM Press. Google Scholar
Digital Library
- Stephen M Blackburn, Perry Cheng, and Kathryn S McKinley. Myths and realities: The performance impact of garbage collection. In SIGMETRICS, pages 25--36. ACMPress, 2004. Google Scholar
Digital Library
- S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable cross-platform infrastructure for application performance tuning using hardware counters. In SC, Dallas, Texas, November 2000. Google Scholar
Digital Library
- Amer Diwan, Han Lee, Dirk Grunwald, and Keith Farkas. Energy consumption and garbage collection in low-powered computing. Technical Report CU-CS-930-02, University of Colorado, 1992.Google Scholar
- Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous Java performance evaluation. In OOPSLA, 2007. Google Scholar
Digital Library
- Intel. Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide. http://www.intel.com/products/processor/manuals/. Order number: 253669--027US, July 2008.Google Scholar
- John P. A. Ioannidis. Contradicted and initially stronger effects in highly cited clinical research. The journal of the American Medical Association (JAMA), 294:218--228, 2005.Google Scholar
- Sam Kash Kachigan. Statistical Analysis: An Interdisciplinary Introduction to Univariate & Multivariate Methods. Radius Press, 1986.Google Scholar
- Tomas Kalibera, Lubomir Bulej, and Petr Tuma. Benchmark precision and random initial state. In Proceedings of the 2005 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS 2005), pages 484--490, San Diego, CA, USA, 2005. SCS.Google Scholar
- W. Korn, P. J. Teller, and G. Castillo. Just how accurate are performance counters? In Proceedings of the IEEE International Conference on Performance, Computing, and Communications (IPCCC'01), pages 303--310, 2001.Google Scholar
Cross Ref
- Han Lee, Daniel von Dincklage, Amer Diwan, and J. Eliot B. Moss. Understanding the behavior of compiler optimizations. Softw. Pract. Exper., 36(8):835--844, 2006. Google Scholar
Digital Library
- M. Maxwell, P. Teller, L. Salayandia, and S.Moore. Accuracy of performance monitoring hardware. In Proceedings of the Los Alamos Computer Science Institute Symposium (LACSI'02), October 2002.Google Scholar
- Shirley V. Moore. A comparison of counting and sampling modes of using performance monitoring hardware. In Proceedings of the International Conference on Computational Science-Part II (ICCS'02), pages 904--912, London, UK, 2002. Springer-Verlag. Google Scholar
Digital Library
- Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 1st edition, 2000. Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks. http://www.spec.org/cpu2006/. Google Scholar
Digital Library
- Dan Tsafrir, Keren Ouaknine, and Dror G. Feitelson. Reducing performance evaluation sensitivity and variability by input shaking. In MASCOTS, 2007. Google Scholar
Digital Library
Index Terms
Producing wrong data without doing anything obviously wrong!






Comments