research-article
Public Access

On the Impact of Programming Languages on Code Quality: A Reproduction Study

Online:12 October 2019Publication History

Abstract

In a 2014 article, Ray, Posnett, Devanbu, and Filkov claimed to have uncovered a statistically significant association between 11 programming languages and software defects in 729 projects hosted on GitHub. Specifically, their work answered four research questions relating to software defects and programming languages. With data and code provided by the authors, the present article first attempts to conduct an experimental repetition of the original study. The repetition is only partially successful, due to missing code and issues with the classification of languages. The second part of this work focuses on their main claim, the association between bugs and languages, and performs a complete, independent reanalysis of the data and of the statistical modeling steps undertaken by Ray et al. in 2014. This reanalysis uncovers a number of serious flaws that reduce the number of languages with an association with defects down from 11 to only 4. Moreover, the practical effect size is exceedingly small. These results thus undermine the conclusions of the original study. Correcting the record is important, as many subsequent works have cited the 2014 article and have asserted, without evidence, a causal link between the choice of programming language for a given task and the number of software defects. Causation is not supported by the data at hand; and, in our opinion, even after fixing the methodological flaws we uncovered, too many unaccounted sources of bias remain to hope for a meaningful comparison of bug rates across languages.

Supplemental Material

a21-vitek

Presentation at SIGPLAN SPLASH '19

References

  1. Yoav Benjamini and Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57, 1 (1995). DOI:https://doi.org/10.2307/2346101Google ScholarGoogle Scholar
  2. Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, and Premkumar Devanbu. 2009. Fair and balanced?: Bias in bug-fix datasets. In Proceedings of the Symposium on the Foundations of Software Engineering (ESEC/FSE’09). DOI:https://doi.org/10.1145/1595696.1595716Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Casey Casalnuovo, Yagnik Suchak, Baishakhi Ray, and Cindy Rubio-González. 2017. GitcProc: A tool for processing and classifying github commits. In Proceedings of the International Symposium on Software Testing and Analysis (ISSTA’17). DOI:https://doi.org/10.1145/3092703.3098230Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David Colquhoun. 2017. The reproducibility of research and the misinterpretation of p-values. R. Soc. Open Sci. 4, 171085 (2017). DOI:https://doi.org/10.1098/rsos.171085Google ScholarGoogle Scholar
  5. Premkumar T. Devanbu. 2018. Research Statement. Retrieved from www.cs.ucdavis.edu/∼devanbu/research.pdf.Google ScholarGoogle Scholar
  6. Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien Nguyen. 2013. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the International Conference on Software Engineering (ICSE’13). DOI:https://doi.org/10.1109/ICSE.2013.6606588Google ScholarGoogle ScholarCross RefCross Ref
  7. J. J. Faraway. 2016. Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models. CRC Press.Google ScholarGoogle Scholar
  8. Dror G. Feitelson. 2015. From repeatability to reproducibility and corroboration. SIGOPS Oper. Syst. Rev. 49, 1 (Jan. 2015). DOI:https://doi.org/10.1145/2723872.2723875Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Omar S. Gómez, Natalia Juristo Juzgado, and Sira Vegas. 2010. Replications types in experimental disciplines. In Proceedings of the Symposium on Empirical Software Engineering and Measurement (ESEM’10). DOI:https://doi.org/10.1145/1852786.1852790Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Garrett Grolemund and Hadley Wickham. 2017. R for Data Science. O’Reilly.Google ScholarGoogle Scholar
  11. Lewis G. Halsey, Douglas Curran-Everett, Sarah L. Vowler, and Gordon B. Drummond. 2015. The fickle p-value generates irreproducible results. Nat. Methods 12 (2015). DOI:https://doi.org/10.1038/nmeth.3288Google ScholarGoogle Scholar
  12. Kim Herzig, Sascha Just, and Andreas Zeller. 2013. It’s not a bug, it’s a feature: How misclassification impacts bug prediction. In Proceedings of the International Conference on Software Engineering (ICSE’13). DOI:https://doi.org/10.1109/ICSE.2013.6606585Google ScholarGoogle ScholarCross RefCross Ref
  13. John Ioannidis. 2005. Why most published research findings are false. PLoS Med 2, 8 (2005). DOI:https://doi.org/10.1371/journal.pmed.0020124Google ScholarGoogle Scholar
  14. George Klees, Andrew Ruef, Benji Cooper, Shiyi Wei, and Michael Hicks. 2018. Evaluating fuzz testing. In Proceedings of the Conference on Computer and Communications Security (CCS’18). DOI:https://doi.org/10.1145/3243734.3243804Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Paul Krill. 2014. Functional languages rack up best scores for software quality. InfoWorld (Nov. 2014). https://www.infoworld.com/article/2844268/functional-languages-rack-up-best-scores-software-quality.html.Google ScholarGoogle Scholar
  16. Shriram Krishnamurthi and Jan Vitek. 2015. The real software crisis: Repeatability as a core value. Commun. ACM 58, 3 (2015). DOI:https://doi.org/10.1145/2658987Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael H. Kutner, John Neter, Christopher J. Nachtsheim, and William Li. 2004. Applied Linear Statistical Models. McGraw–Hill Education, New York, NY. https://books.google.cz/books?id=XAzYCwAAQBAJGoogle ScholarGoogle Scholar
  18. Crista Lopes, Petr Maj, Pedro Martins, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek. 2017. Déjà Vu: A map of code duplicates on GitHub. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’17). DOI:https://doi.org/10.1145/3133908Google ScholarGoogle Scholar
  19. Audris Mockus and Lawrence Votta. 2000. Identifying reasons for software changes using historic databases. In Proceedings of the International Conference on Software Maintenance (ICSM’00). DOI:https://doi.org/10.1109/ICSM.2000.883028Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Martin Monperrus. 2014. A critical review of “automatic patch generation learned from human-written patches”: Essay on the problem statement and the evaluation of automatic software repair. In Proceedings of the International Conference on Software Engineering (ICSE’14). DOI:https://doi.org/10.1145/2568225.2568324Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sebastian Nanz and Carlo A. Furia. 2015. A comparative study of programming languages in rosetta code. In Proceedings of the International Conference on Software Engineering (ICSE’15). http://dl.acm.org/citation.cfm?id=2818754.2818848.Google ScholarGoogle Scholar
  22. Roger Peng. 2011. Reproducible research in computational science. Science 334, 1226 (2011). DOI:https://doi.org/10.1126/science.1213847Google ScholarGoogle Scholar
  23. Dong Qiu, Bixin Li, Earl T. Barr, and Zhendong Su. 2017. Understanding the syntactic rule usage in Java. J. Syst. Softw. 123 (Jan. 2017), 160--172. DOI:https://doi.org/10.1016/j.jss.2016.10.017Google ScholarGoogle ScholarCross RefCross Ref
  24. B. Ray and D. Posnett. 2016. A large ecosystem study to understand the effect of programming languages on code quality. In Perspectives on Data Science for Software Engineering. Morgan Kaufmann. DOI:https://doi.org/10.1016/B978-0-12-804206-9.00023-4Google ScholarGoogle Scholar
  25. Baishakhi Ray, Daryl Posnett, Premkumar T. Devanbu, and Vladimir Filkov. 2017. A large-scale study of programming languages and code quality in GitHub. Commun. ACM 60, 10 (2017). DOI:https://doi.org/10.1145/3126905Google ScholarGoogle Scholar
  26. Baishakhi Ray, Daryl Posnett, Vladimir Filkov, and Premkumar T. Devanbu. 2014. A large scale study of programming languages and code quality in GitHub. In Proceedings of the International Symposium on Foundations of Software Engineering (FSE’14). DOI:https://doi.org/10.1145/2635868.2635922Google ScholarGoogle Scholar
  27. Rolando P. Reyes, Oscar Dieste, Efraín R. Fonseca, and Natalia Juristo. 2018. Statistical errors in software engineering experiments: A preliminary literature review. In Proceedings of the International Conference on Software Engineering (ICSE’18). DOI:https://doi.org/10.1145/3180155.3180161Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yuan Tian, Julia Lawall, and David Lo. 2012. Identifying linux bug fixing patches. In Proceedings of the International Conference on Software Engineering (ICSE’12). DOI:https://doi.org/10.1109/ICSE.2012.6227176Google ScholarGoogle ScholarCross RefCross Ref
  29. Jan Vitek and Tomas Kalibera. 2011. Repeatability, reproducibility, and rigor in systems research. In Proceedings of the International Conference on Embedded Software (EMSOFT’11). 33--38. DOI:https://doi.org/10.1145/2038642.2038650Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ronald L. Wasserstein and Nicole A. Lazar. 2016. The ASA’s statement on p-values: Context, process, and purpose. Am. Stat. 70, 2 (2016). DOI:https://doi.org/10.1080/00031305.2016.1154108Google ScholarGoogle Scholar
  31. Jie Zhang, Feng Li, Dan Hao, Meng Wang, and Lu Zhang. 2018. How does bug-handling effort differ among different programming languages? CoRR abs/1801.01025 (2018). http://arxiv.org/abs/1801.01025.Google ScholarGoogle Scholar

Index Terms

  1. On the Impact of Programming Languages on Code Quality: A Reproduction Study

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        ACM Transactions on Programming Languages and Systems cover image
        ACM Transactions on Programming Languages and Systems  Volume 41, Issue 4
        December 2019
        186 pages
        ISSN:0164-0925
        EISSN:1558-4593
        DOI:10.1145/3366632
        Issue’s Table of Contents

        Copyright © 2019 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Online: 12 October 2019
        • Published: 12 October 2019
        • Accepted: 1 June 2019
        • Revised: 1 May 2019
        • Received: 1 December 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!