Abstract
An unsound claim can misdirect a field, encouraging the pursuit of unworthy ideas and the abandonment of promising ideas. An inadequate description of a claim can make it difficult to reason about the claim, for example, to determine whether the claim is sound. Many practitioners will acknowledge the threat of unsound claims or inadequate descriptions of claims to their field. We believe that this situation is exacerbated, and even encouraged, by the lack of a systematic approach to exploring, exposing, and addressing the source of unsound claims and poor exposition.
This article proposes a framework that identifies three sins of reasoning that lead to unsound claims and two sins of exposition that lead to poorly described claims and evaluations. Sins of exposition obfuscate the objective of determining whether or not a claim is sound, while sins of reasoning lead directly to unsound claims.
Our framework provides practitioners with a principled way of critiquing the integrity of their own work and the work of others. We hope that this will help individuals conduct better science and encourage a cultural shift in our research community to identify and promulgate sound claims.
- Phillip G. Armour. 2000. The five orders of ignorance. Communications of the ACM 43, 10, 17--20. DOI:http://dx.doi.org/10.1145/352183.352194 Google Scholar
Digital Library
- David H. Bailey. 2009. Misleading performance claims in parallel computation. In 46th Annual Design Automation Conference. ACM, New York, NY, 528--33. http://dx.doi.org/10.1145/1629911.1630049 Google Scholar
Digital Library
- David H. Bailey, Jonathan M. Borwein, and Victoria Stodden. 2014. Facilitating reproducibility in scientific computing: Principles and practice. Retrieved August 9, 2016 from http://www.davidhbailey.com/dhbpapers/reprod.pdf.Google Scholar
- M. Baker. 2012. Independent labs to verify high-profile papers: Reproducibility initiative aims to speed up preclinical research. Nature : News. (14 August 2012). doi:10.1038/nature.2012.11176Google Scholar
- Sharon Begley. 2012. More trial, less error - An effort to improve scientific studies. Reuters. Retrieved August 9, 2016 from http://www.reuters.com/article/2012/08/14/us-science-replication-service-idUSBRE87D0I820120814.Google Scholar
- Stephen M. Blackburn, Kathryn S. McKinley, Robin Garner, Chris Hoffmann, Asjad M. Khan, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanovik, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2008. Wake up and smell the coffee: Evaluation methodology for the 21st century. Communications of the ACM 51, 8, 83--89. DOI:http://dx.doi.org/10.1145/1378704.1378723 Google Scholar
Digital Library
- Philippe Bonnet, Stefan Manegold, Matias Bjørling, Wei Cao, Javier Gonzalez, Joel Granados, Nancy Hall, Stratos Idreos, Milena Ivanova, Ryan Johnson, David Koop, Tim Kraska, René Müller, Dan Olteanu, Paolo Papotti, Christine Reilly, Dimitris Tsirogiannis, Cong Yu, Juliana Freire, and Dennis Shasha. 2011. Repeatability and workability evaluation of SIGMOD 2011. SIGMOD Record 40, 2, 45--48. DOI:http://dx.doi.org/10.1145/2034863.2034873 Google Scholar
Digital Library
- Frederick P. Brooks, Jr. 1996. The computer scientist as toolsmith II. Communications of the ACM 39, 3, 61--68. DOI:http://dx.doi.org/10.1145/227234.227243 Google Scholar
Digital Library
- D. Buytaert, A. Georges, M. Hind, M. Arnold, L. Eeckhout, and K. De Bosschere. 2007. Using HPM-sampling to drive dynamic compilation. In Proceedings of the 22nd Annual ACM SIGPLAN Conference on Object-Oriented Programming Systems and Applications. ACM, New York, NY, 553--568. http://dx.doi.org/10.1145/1297105.1297068 Google Scholar
Digital Library
- Augusto Born de Oliveira, Jean-Christophe Petkovich, Thomas Reidemeister, and Sebastian Fischmeister. 2013. DataMill: Rigorous performance evaluation made easy. In Proceedings of the 4th ACM/SPEC International Conference on Performance Engineering (ICPE’13). ACM, New York, NY, 137--148. DOI:http://dx.doi.org/10.1145/2479871.2479892 Google Scholar
Digital Library
- Jens Dittrich. 2011. Paper bricks: An alternative to complete-story peer reviewing. SIGMOD Record 39, 4, 31--36. DOI:http://dx.doi.org/10.1145/1978915.1978923 Google Scholar
Digital Library
- Grigori Fursin. 2015. Enabling collaborative, systematic and reproducible research and experimentation in computer engineering with an open publication model. Retrieved August 9, 2016 from http://ctuning.org/cm/wiki/index.php?title=Reproducibility.Google Scholar
- A. Georges, D. Buytaert, and L. Eeckhout. 2007. Statistically rigorous java performance evaluation. In Proceedings of the 22nd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications. ACM, New York, NY, 57--76. http://dx.doi.org/10.1145/1297105.1297033 Google Scholar
Digital Library
- Matthias Hauswirth and Stephen M. Blackburn. 2013. Artifact Evaluation Artifact. Retrieved August 9, 2016 from http://evaluate.inf.usi.ch/artifacts/aea.Google Scholar
- J. P. A. Ioannidis. 2005. Contradicted and initially stronger effects in highly cited clinical research. American Medical Association 218--228.Google Scholar
- Richard Jones and Rafael Lins. 1996. Garbage Collection: Algorithms for Automatic Dynamic Memory Management. Wiley, Hoboken, NJ. http://dl.acm.org/citation.cfm?id=236254 Google Scholar
Digital Library
- Kate Keahey and Frdric Desprez. 2012. Supporting Experimental Computer Science. Technical Report MCS Technical Memo 326. Argonne National Laboratory (ANL). Retrieved August 9, 2016 from http://www.nimbusproject.org/downloads/Supporting_Experimental_Computer_Science_final_draft.pdf.Google Scholar
- Shriram Krishnamurthi. 2013. Artifact evaluation for software conferences. SIGPLAN Not. 48, 4S (July 2013), 17--21. DOI:http://dx.doi.org/10.1145/2502508.2502518 Google Scholar
Digital Library
- Shriram Krishnamurthi, James Noble, and Jan Vitek. 2013. Should software conferences respect software? In Proceedings of the 2013 Companion Publication for Conference on Systems, Programming, 8 Applications: Software for Humanity (SPLASH’13). ACM, New York, NY, 71--72. DOI:http://dx.doi.org/10.1145/2508075.2516929 Google Scholar
Digital Library
- Jonah Lehrer. 2010. The truth wears off: Is there something wrong with the scientific method? The New Yorker (13 December 2010). Retrieved August 9, 2016 from http://www.newyorker.com/magazine/2010/12/13/the-truth-wears-off.Google Scholar
- S. Manegold, I. Manolescu, L. Afanasiev, J. Feng, G. Gou, M. Hadjieleftheriou, S. Harizopoulos, P. Kalnis, K. Karanasos, D. Laurent, M. Lupu, N. Onose, C. Ré, V. Sans, P. Senellart, T. Wu, and D. Shasha. 2010. Repeatability 8 workability evaluation of SIGMOD’09. SIGMOD Record 38, 3, 40--43. DOI:http://dx.doi.org/10.1145/1815933.1815944 Google Scholar
Digital Library
- T. L. Martin and D. P. Siewiorek. 2001. Nonideal battery and main memory effects on CPU speed-setting for low power. IEEE Transactions on Very Large Scale Integration Systems 9, 29--34. http://dx.doi.org/10.1109/92.920816 Google Scholar
Digital Library
- Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2009. Producing wrong data without doing anything obviously wrong!. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, New York, NY, 265--276. http://dx.doi.org/10.1145/1508244.1508275 Google Scholar
Digital Library
- Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2010. Evaluating the accuracy of java profilers. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 187--197. http://dx.doi.org/10.1145/1806596.1806618 Google Scholar
Digital Library
- Peter Norvig. 2012. Warning Signs in Experimental Design and Interpretation. Retrieved August 9, 2016 from http://norvig.com/experiment-design.html.Google Scholar
- Tony Nowatzki, Jaikrishnan Menon, Chen-Han Ho, and Karthikeyan Sankaralingam. 2015. Architectural simulators considered harmful. IEEE Micro. Google Scholar
Digital Library
- Vreda Pieterse and David Flater. 2014. The Ghost in the Machine: Dont Let it Haunt Your Software Performance Measurements. Technical Report Technical Note 1830. NIST. Washington, DC. http://dx.doi.org/10.6028/NIST.TN.1830.Google Scholar
- Eric Schulte, Dan Davison, Thomas Dye, and Carsten Dominik. 2012. A multi-language computing environment for literate programming and reproducible research. Journal of Statistical Software 46, 3, 1--24. http://www.jstatsoft.org/v46/i03.Google Scholar
Cross Ref
- Jeremy Singer. 2011. A literate experimentation manifesto. In Proceedings of the 10th SIGPLAN Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software (Onward!’11). ACM, New York, NY, 91--102. DOI:http://dx.doi.org/10.1145/2048237.2048249 Google Scholar
Digital Library
- Victoria Stodden, Jonathan Borwein, and David H. Bailey. 2013. ‘Setting the default to reproducible’ in computational science research. SIAM News 46, 5, 4--6. http://www.siam.org/news/news.php?id=2078.Google Scholar
- The Economist. 2013. Unreliable research. Trouble at the lab. The Economist (19 October 2013). Retrieved August 9, 2016 from http://www.economist.com/news/briefing/21588057-scientists-think-science-self-correcting-alarming-degree-it-not-trouble.Google Scholar
- Sid-Ahmed-Ali Touati, Julien Worms, and Sbastien Briais. 2013. The speedup-test: A statistical methodology for program speedup analysis and computation. Journal of Concurrency and Computation: Practice and Experience 25, 10, 1410--1426. http://dx.doi.org/10.1002/cpe.2939http://hal.inria.fr/hal-00764454Google Scholar
Cross Ref
- Jan Vitek and Tomas Kalibera. 2011. Repeatability, reproducibility, and rigor in systems research. In Proceedings of the 9th ACM International Conference on Embedded Software (EMSOFT’11). ACM, New York, NY, 33--38. DOI:http://dx.doi.org/10.1145/2038642.2038650 Google Scholar
Digital Library
Index Terms
The Truth, The Whole Truth, and Nothing But the Truth: A Pragmatic Guide to Assessing Empirical Evaluations
Recommendations
Intuition vs. truth: evaluation of common myths about stackoverflow posts
MSR '15: Proceedings of the 12th Working Conference on Mining Software RepositoriesPosting and answering questions on StackOverflow (SO) is everyday business for many developers. We asked a group of developers what they expect to be true about questions and answers on SO. Most of their expectations were related to the likelihood of ...
From Paradox to Truth: An Introduction to Self-reference in Formal Language
Language, Logic, and ComputationAbstractWe present a short introduction to the logical analysis of truth and related concepts. We examine which assumptions are implicit in the paradoxes of truth and self-reference, and present some of the important formal theories of truth that have ...






Comments