skip to main content
research-article

Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

Published:27 October 2013Publication History
Skip Abstract Section

Abstract

Software repositories contain a vast wealth of information about software development. Mining these repositories has proven useful for detecting patterns in software development, testing hypotheses for new software engineering approaches, etc. Specifically, mining source code has yielded significant insights into software development artifacts and processes. Unfortunately, mining source code at a large-scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse-grained, or sacrifice studying the history of the code due to both human and computational scalability issues. In this paper we address the substantial challenges of mining source code: a) at a very large scale; b) at a fine-grained level of detail; and c) with full history information.

To address these challenges, we present domain-specific language features for source code mining. Our language features are inspired by object-oriented visitors and provide a default depth-first traversal strategy along with two expressions for defining custom traversals. We provide an implementation of these features in the Boa infrastructure for software repository mining and describe a code generation strategy into Java code. To show the usability of our domain-specific language features, we reproduced over 40 source code mining tasks from two large-scale previous studies in just 2 person-weeks. The resulting code for these tasks show between 2.0x--4.8x reduction in code size. Finally we perform a small controlled experiment to gain insights into how easily mining tasks written using our language features can be understood, with no prior training. We show a substantial number of tasks (77%) were understood by study participants, in about 3 minutes per task.

References

  1. Hierarchical visitor pattern, c2 pattern repository. http://c2.com/cgi/wiki?HierarchicalVisitorPattern, 2012.Google ScholarGoogle Scholar
  2. Sourceforge website. http://sourceforge.net/, 2012.Google ScholarGoogle Scholar
  3. Apache Software Foundation. Hadoop: Open source implementation of MapReduce. http://hadoop.apache.org/, 2013.Google ScholarGoogle Scholar
  4. J. Bevan, E. J. Whitehead, Jr., S. Kim, and M. Godfrey. Facilitating software evolution research with Kenyon. In ESEC/FSE'05: 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 177--186, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI'04: 6th Symposium on Operating System Design and Implementation, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In ICSE'13: 35th International Conference on Software Engineering, pages 422--431, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. A large-scale empirical study of Java language feature usage. Technical report, Iowa State University, 2013.Google ScholarGoogle Scholar
  8. M. Gabel and Z. Su. A study of the uniqueness of source code. In FSE'10: 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 147--156, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng., 26 (7): 653--661, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Grechanik, C. McMillan, L. DeFerrari, M. Comi, S. Crespi, D. Poshyvanyk, C. Fu, Q. Xie, and C. Ghezzi. An empirical investigation into a large-scale Java open source code repository. In ESEM'10: International Symposium on Empirical Software Engineering and Measurement, pages 11:1--11:10, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Hajiyev, M. Verbaere, and O. de Moor. Codequest: scalable source code queries with datalog. In ECOOP'06: 20th European conference on Object-Oriented Programming, pages 2--27, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. E. Hassan. Predicting faults using the complexity of code changes. In ICSE'09: 31st International Conference on Software Engineering, pages 78--88, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Herzig and A. Zeller. Mining cause-effect-chains from version histories. In ISSRE'11: 22nd IEEE International Symposium on Software Reliability Engineering, pages 60--69, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Janzen and K. De Volder. Navigating and querying code without getting lost. In AOSD'03: 2nd international conference on Aspect-oriented software development, pages 178--187, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. P. Jones. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press, 2003.Google ScholarGoogle Scholar
  17. S. Kim, T. Zimmermann, J. Whitehead, and A. Zeller. Predicting faults from cached history. In ICSE'07: 29th International Conference on Software Engineering, pages 489--498, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Kimmig, M. Monperrus, and M. Mezini. Querying source code with natural language. In ASE'11: 26th IEEE/ACM International Conference on Automated Software Engineering, pages 376--379, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18: 300--336, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Martin, B. Livshits, and M. S. Lam. Finding application errors and security flaws using pql: a program query language. In OOPSLA'05: 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 365--383, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. McMillan, D. Poshyvanyk, M. Grechanik, Q. Xie, and C. Fu. Portfolio: Searching for relevant functions and their usages in millions of lines of code. TOSEM: ACM Transactions on Software Engineering and Methodology, page To Appear, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Mileva, A. Wasylkowski, and A. Zeller. Mining evolution of object usage. In ECOOP'11: 25th European Conference on Object-Oriented Programming, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. N. Nagappan and T. Ball. Use of relative code churn measures to predict system defect density. In ICSE'05: 27th International Conference on Software Engineering, pages 284--292, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. I. Neamtiu, J. S. Foster, and M. Hicks. Understanding source code evolution using abstract syntax tree matching. In MSR'05: International Workshop on Mining Software Repositories, pages 1--5, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Okur and D. Dig. How do developers use parallel libraries? In FSE'12: 20th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 54:1--54:11, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. C. d. S. Oliveira, M. Wang, and J. Gibbons. The visitor pattern as a reusable, generic, type-safe component. In OOPSLA'08: 23rd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 439--456, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Orleans and K. J. Lieberherr. Dj: Dynamic adaptive programming in java. In REFLECTION'01: 3rd International Conference on Metalevel Architectures and Separation of Crosscutting Concerns, pages 73--80, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Orleans and K. J. Lieberherr. DemeterJ. Technical report, Northeastern University, 2001. URL http://www.ccs.neu.edu/research/demeter/DemeterJava/.Google ScholarGoogle Scholar
  29. J. Ovlinger and M. Wand. A language for specifying recursive traversals of object structures. In OOPSLA'99: 14th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 70--81, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Parnin, C. Bird, and E. Murphy-Hill. Adoption and use of Java generics. Empirical Software Engineering, pages 1--43, 2012.Google ScholarGoogle Scholar
  31. L. S. Pinto, S. Sinha, and A. Orso. Understanding myths and realities of test-suite evolution. In FSE'12: 20th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 33:1--33:11, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. H. Rajan, T. N. Nguyen, R. Dyer, and H. A. Nguyen. Boa website. http://boa.cs.iastate.edu/, 2012.Google ScholarGoogle Scholar
  33. G. Udding, B. Dagenais, and M. P. Robillard. Temporal analysis of API usage concepts. In ICSE'12: 34th International Conference on Software Engineering, pages 804--814, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Visser. Visitor combination and traversal control. In OOPSLA'01: 16th ACM SIGPLAN conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 270--282, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. In ICSE'04: 26th International Conference on Software Engineering, pages 563--572, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 49, Issue 3
      GPCE '13
      March 2014
      181 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/2637365
      Issue’s Table of Contents
      • cover image ACM Conferences
        GPCE '13: Proceedings of the 12th international conference on Generative programming: concepts & experiences
        October 2013
        198 pages
        ISBN:9781450323734
        DOI:10.1145/2517208

      Copyright © 2013 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 October 2013

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!