skip to main content
research-article
Free Access

The Truth, The Whole Truth, and Nothing But the Truth: A Pragmatic Guide to Assessing Empirical Evaluations

Authors Info & Claims
Published:13 October 2016Publication History
Skip Abstract Section

Abstract

An unsound claim can misdirect a field, encouraging the pursuit of unworthy ideas and the abandonment of promising ideas. An inadequate description of a claim can make it difficult to reason about the claim, for example, to determine whether the claim is sound. Many practitioners will acknowledge the threat of unsound claims or inadequate descriptions of claims to their field. We believe that this situation is exacerbated, and even encouraged, by the lack of a systematic approach to exploring, exposing, and addressing the source of unsound claims and poor exposition.

This article proposes a framework that identifies three sins of reasoning that lead to unsound claims and two sins of exposition that lead to poorly described claims and evaluations. Sins of exposition obfuscate the objective of determining whether or not a claim is sound, while sins of reasoning lead directly to unsound claims.

Our framework provides practitioners with a principled way of critiquing the integrity of their own work and the work of others. We hope that this will help individuals conduct better science and encourage a cultural shift in our research community to identify and promulgate sound claims.

References

  1. Phillip G. Armour. 2000. The five orders of ignorance. Communications of the ACM 43, 10, 17--20. DOI:http://dx.doi.org/10.1145/352183.352194 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David H. Bailey. 2009. Misleading performance claims in parallel computation. In 46th Annual Design Automation Conference. ACM, New York, NY, 528--33. http://dx.doi.org/10.1145/1629911.1630049 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. David H. Bailey, Jonathan M. Borwein, and Victoria Stodden. 2014. Facilitating reproducibility in scientific computing: Principles and practice. Retrieved August 9, 2016 from http://www.davidhbailey.com/dhbpapers/reprod.pdf.Google ScholarGoogle Scholar
  4. M. Baker. 2012. Independent labs to verify high-profile papers: Reproducibility initiative aims to speed up preclinical research. Nature : News. (14 August 2012). doi:10.1038/nature.2012.11176Google ScholarGoogle Scholar
  5. Sharon Begley. 2012. More trial, less error - An effort to improve scientific studies. Reuters. Retrieved August 9, 2016 from http://www.reuters.com/article/2012/08/14/us-science-replication-service-idUSBRE87D0I820120814.Google ScholarGoogle Scholar
  6. Stephen M. Blackburn, Kathryn S. McKinley, Robin Garner, Chris Hoffmann, Asjad M. Khan, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovik, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2008. Wake up and smell the coffee: Evaluation methodology for the 21st century. Communications of the ACM 51, 8, 83--89. DOI:http://dx.doi.org/10.1145/1378704.1378723 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Philippe Bonnet, Stefan Manegold, Matias Bjørling, Wei Cao, Javier Gonzalez, Joel Granados, Nancy Hall, Stratos Idreos, Milena Ivanova, Ryan Johnson, David Koop, Tim Kraska, René Müller, Dan Olteanu, Paolo Papotti, Christine Reilly, Dimitris Tsirogiannis, Cong Yu, Juliana Freire, and Dennis Shasha. 2011. Repeatability and workability evaluation of SIGMOD 2011. SIGMOD Record 40, 2, 45--48. DOI:http://dx.doi.org/10.1145/2034863.2034873 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Frederick P. Brooks, Jr. 1996. The computer scientist as toolsmith II. Communications of the ACM 39, 3, 61--68. DOI:http://dx.doi.org/10.1145/227234.227243 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Buytaert, A. Georges, M. Hind, M. Arnold, L. Eeckhout, and K. De Bosschere. 2007. Using HPM-sampling to drive dynamic compilation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications. ACM, New York, NY, 553--568. http://dx.doi.org/10.1145/1297105.1297068 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Augusto Born de Oliveira, Jean-Christophe Petkovich, Thomas Reidemeister, and Sebastian Fischmeister. 2013. DataMill: Rigorous performance evaluation made easy. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE’13). ACM, New York, NY, 137--148. DOI:http://dx.doi.org/10.1145/2479871.2479892 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jens Dittrich. 2011. Paper bricks: An alternative to complete-story peer reviewing. SIGMOD Record 39, 4, 31--36. DOI:http://dx.doi.org/10.1145/1978915.1978923 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Grigori Fursin. 2015. Enabling collaborative, systematic and reproducible research and experimentation in computer engineering with an open publication model. Retrieved August 9, 2016 from http://ctuning.org/cm/wiki/index.php?title=Reproducibility.Google ScholarGoogle Scholar
  13. A. Georges, D. Buytaert, and L. Eeckhout. 2007. Statistically rigorous java performance evaluation. In Proceedings of the 22nd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications. ACM, New York, NY, 57--76. http://dx.doi.org/10.1145/1297105.1297033 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Matthias Hauswirth and Stephen M. Blackburn. 2013. Artifact Evaluation Artifact. Retrieved August 9, 2016 from http://evaluate.inf.usi.ch/artifacts/aea.Google ScholarGoogle Scholar
  15. J. P. A. Ioannidis. 2005. Contradicted and initially stronger effects in highly cited clinical research. American Medical Association 218--228.Google ScholarGoogle Scholar
  16. Richard Jones and Rafael Lins. 1996. Garbage Collection: Algorithms for Automatic Dynamic Memory Management. Wiley, Hoboken, NJ. http://dl.acm.org/citation.cfm?id=236254 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kate Keahey and Frdric Desprez. 2012. Supporting Experimental Computer Science. Technical Report MCS Technical Memo 326. Argonne National Laboratory (ANL). Retrieved August 9, 2016 from http://www.nimbusproject.org/downloads/Supporting_Experimental_Computer_Science_final_draft.pdf.Google ScholarGoogle Scholar
  18. Shriram Krishnamurthi. 2013. Artifact evaluation for software conferences. SIGPLAN Not. 48, 4S (July 2013), 17--21. DOI:http://dx.doi.org/10.1145/2502508.2502518 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Shriram Krishnamurthi, James Noble, and Jan Vitek. 2013. Should software conferences respect software? In Proceedings of the 2013 Companion Publication for Conference on Systems, Programming, 8 Applications: Software for Humanity (SPLASH’13). ACM, New York, NY, 71--72. DOI:http://dx.doi.org/10.1145/2508075.2516929 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jonah Lehrer. 2010. The truth wears off: Is there something wrong with the scientific method? The New Yorker (13 December 2010). Retrieved August 9, 2016 from http://www.newyorker.com/magazine/2010/12/13/the-truth-wears-off.Google ScholarGoogle Scholar
  21. S. Manegold, I. Manolescu, L. Afanasiev, J. Feng, G. Gou, M. Hadjieleftheriou, S. Harizopoulos, P. Kalnis, K. Karanasos, D. Laurent, M. Lupu, N. Onose, C. Ré, V. Sans, P. Senellart, T. Wu, and D. Shasha. 2010. Repeatability 8 workability evaluation of SIGMOD’09. SIGMOD Record 38, 3, 40--43. DOI:http://dx.doi.org/10.1145/1815933.1815944 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. L. Martin and D. P. Siewiorek. 2001. Nonideal battery and main memory effects on CPU speed-setting for low power. IEEE Transactions on Very Large Scale Integration Systems 9, 29--34. http://dx.doi.org/10.1109/92.920816 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2009. Producing wrong data without doing anything obviously wrong!. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 265--276. http://dx.doi.org/10.1145/1508244.1508275 Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2010. Evaluating the accuracy of java profilers. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 187--197. http://dx.doi.org/10.1145/1806596.1806618 Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Peter Norvig. 2012. Warning Signs in Experimental Design and Interpretation. Retrieved August 9, 2016 from http://norvig.com/experiment-design.html.Google ScholarGoogle Scholar
  26. Tony Nowatzki, Jaikrishnan Menon, Chen-Han Ho, and Karthikeyan Sankaralingam. 2015. Architectural simulators considered harmful. IEEE Micro. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Vreda Pieterse and David Flater. 2014. The Ghost in the Machine: Dont Let it Haunt Your Software Performance Measurements. Technical Report Technical Note 1830. NIST. Washington, DC. http://dx.doi.org/10.6028/NIST.TN.1830.Google ScholarGoogle Scholar
  28. Eric Schulte, Dan Davison, Thomas Dye, and Carsten Dominik. 2012. A multi-language computing environment for literate programming and reproducible research. Journal of Statistical Software 46, 3, 1--24. http://www.jstatsoft.org/v46/i03.Google ScholarGoogle ScholarCross RefCross Ref
  29. Jeremy Singer. 2011. A literate experimentation manifesto. In Proceedings of the 10th SIGPLAN Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!’11). ACM, New York, NY, 91--102. DOI:http://dx.doi.org/10.1145/2048237.2048249 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Victoria Stodden, Jonathan Borwein, and David H. Bailey. 2013. ‘Setting the default to reproducible’ in computational science research. SIAM News 46, 5, 4--6. http://www.siam.org/news/news.php?id=2078.Google ScholarGoogle Scholar
  31. The Economist. 2013. Unreliable research. Trouble at the lab. The Economist (19 October 2013). Retrieved August 9, 2016 from http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble.Google ScholarGoogle Scholar
  32. Sid-Ahmed-Ali Touati, Julien Worms, and Sbastien Briais. 2013. The speedup-test: A statistical methodology for program speedup analysis and computation. Journal of Concurrency and Computation: Practice and Experience 25, 10, 1410--1426. http://dx.doi.org/10.1002/cpe.2939http://hal.inria.fr/hal-00764454Google ScholarGoogle ScholarCross RefCross Ref
  33. Jan Vitek and Tomas Kalibera. 2011. Repeatability, reproducibility, and rigor in systems research. In Proceedings of the 9th ACM International Conference on Embedded Software (EMSOFT’11). ACM, New York, NY, 33--38. DOI:http://dx.doi.org/10.1145/2038642.2038650 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Truth, The Whole Truth, and Nothing But the Truth: A Pragmatic Guide to Assessing Empirical Evaluations

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Programming Languages and Systems
          ACM Transactions on Programming Languages and Systems  Volume 38, Issue 4
          October 2016
          204 pages
          ISSN:0164-0925
          EISSN:1558-4593
          DOI:10.1145/2982214
          Issue’s Table of Contents

          Copyright © 2016 ACM

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 October 2016
          • Accepted: 1 January 2016
          • Revised: 1 August 2015
          • Received: 1 March 2014
          Published in toplas Volume 38, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!