Abstract
Software repositories contain a vast wealth of information about software development. Mining these repositories has proven useful for detecting patterns in software development, testing hypotheses for new software engineering approaches, etc. Specifically, mining source code has yielded significant insights into software development artifacts and processes. Unfortunately, mining source code at a large-scale remains a difficult task. Previous approaches had to either limit the scope of the projects studied, limit the scope of the mining task to be more coarse-grained, or sacrifice studying the history of the code due to both human and computational scalability issues. In this paper we address the substantial challenges of mining source code: a) at a very large scale; b) at a fine-grained level of detail; and c) with full history information.
To address these challenges, we present domain-specific language features for source code mining. Our language features are inspired by object-oriented visitors and provide a default depth-first traversal strategy along with two expressions for defining custom traversals. We provide an implementation of these features in the Boa infrastructure for software repository mining and describe a code generation strategy into Java code. To show the usability of our domain-specific language features, we reproduced over 40 source code mining tasks from two large-scale previous studies in just 2 person-weeks. The resulting code for these tasks show between 2.0x--4.8x reduction in code size. Finally we perform a small controlled experiment to gain insights into how easily mining tasks written using our language features can be understood, with no prior training. We show a substantial number of tasks (77%) were understood by study participants, in about 3 minutes per task.
- Hierarchical visitor pattern, c2 pattern repository. http://c2.com/cgi/wiki?HierarchicalVisitorPattern, 2012.Google Scholar
- Sourceforge website. http://sourceforge.net/, 2012.Google Scholar
- Apache Software Foundation. Hadoop: Open source implementation of MapReduce. http://hadoop.apache.org/, 2013.Google Scholar
- J. Bevan, E. J. Whitehead, Jr., S. Kim, and M. Godfrey. Facilitating software evolution research with Kenyon. In ESEC/FSE'05: 10th European Software Engineering Conference held jointly with 13th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 177--186, 2005. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI'04: 6th Symposium on Operating System Design and Implementation, pages 137--150, 2004. Google Scholar
Digital Library
- R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In ICSE'13: 35th International Conference on Software Engineering, pages 422--431, 2013. Google Scholar
Digital Library
- R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. A large-scale empirical study of Java language feature usage. Technical report, Iowa State University, 2013.Google Scholar
- M. Gabel and Z. Su. A study of the uniqueness of source code. In FSE'10: 18th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 147--156, 2010. Google Scholar
Digital Library
- E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley Professional, 1994. Google Scholar
Digital Library
- T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy. Predicting fault incidence using software change history. IEEE Trans. Softw. Eng., 26 (7): 653--661, 2000. Google Scholar
Digital Library
- M. Grechanik, C. McMillan, L. DeFerrari, M. Comi, S. Crespi, D. Poshyvanyk, C. Fu, Q. Xie, and C. Ghezzi. An empirical investigation into a large-scale Java open source code repository. In ESEM'10: International Symposium on Empirical Software Engineering and Measurement, pages 11:1--11:10, 2010. Google Scholar
Digital Library
- E. Hajiyev, M. Verbaere, and O. de Moor. Codequest: scalable source code queries with datalog. In ECOOP'06: 20th European conference on Object-Oriented Programming, pages 2--27, 2006. Google Scholar
Digital Library
- A. E. Hassan. Predicting faults using the complexity of code changes. In ICSE'09: 31st International Conference on Software Engineering, pages 78--88, 2009. Google Scholar
Digital Library
- K. Herzig and A. Zeller. Mining cause-effect-chains from version histories. In ISSRE'11: 22nd IEEE International Symposium on Software Reliability Engineering, pages 60--69, 2011. Google Scholar
Digital Library
- D. Janzen and K. De Volder. Navigating and querying code without getting lost. In AOSD'03: 2nd international conference on Aspect-oriented software development, pages 178--187, 2003. Google Scholar
Digital Library
- S. P. Jones. Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press, 2003.Google Scholar
- S. Kim, T. Zimmermann, J. Whitehead, and A. Zeller. Predicting faults from cached history. In ICSE'07: 29th International Conference on Software Engineering, pages 489--498, 2007. Google Scholar
Digital Library
- M. Kimmig, M. Monperrus, and M. Mezini. Querying source code with natural language. In ASE'11: 26th IEEE/ACM International Conference on Automated Software Engineering, pages 376--379, 2011. Google Scholar
Digital Library
- E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. Sourcerer: mining and searching internet-scale software repositories. Data Mining and Knowledge Discovery, 18: 300--336, April 2009. Google Scholar
Digital Library
- M. Martin, B. Livshits, and M. S. Lam. Finding application errors and security flaws using pql: a program query language. In OOPSLA'05: 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pages 365--383, 2005. Google Scholar
Digital Library
- C. McMillan, D. Poshyvanyk, M. Grechanik, Q. Xie, and C. Fu. Portfolio: Searching for relevant functions and their usages in millions of lines of code. TOSEM: ACM Transactions on Software Engineering and Methodology, page To Appear, 2013. Google Scholar
Digital Library
- Y. Mileva, A. Wasylkowski, and A. Zeller. Mining evolution of object usage. In ECOOP'11: 25th European Conference on Object-Oriented Programming, 2011. Google Scholar
Digital Library
- N. Nagappan and T. Ball. Use of relative code churn measures to predict system defect density. In ICSE'05: 27th International Conference on Software Engineering, pages 284--292, 2005. Google Scholar
Digital Library
- I. Neamtiu, J. S. Foster, and M. Hicks. Understanding source code evolution using abstract syntax tree matching. In MSR'05: International Workshop on Mining Software Repositories, pages 1--5, 2005. Google Scholar
Digital Library
- S. Okur and D. Dig. How do developers use parallel libraries? In FSE'12: 20th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 54:1--54:11, 2012. Google Scholar
Digital Library
- B. C. d. S. Oliveira, M. Wang, and J. Gibbons. The visitor pattern as a reusable, generic, type-safe component. In OOPSLA'08: 23rd ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 439--456, 2008. Google Scholar
Digital Library
- D. Orleans and K. J. Lieberherr. Dj: Dynamic adaptive programming in java. In REFLECTION'01: 3rd International Conference on Metalevel Architectures and Separation of Crosscutting Concerns, pages 73--80, 2001. Google Scholar
Digital Library
- D. Orleans and K. J. Lieberherr. DemeterJ. Technical report, Northeastern University, 2001. URL http://www.ccs.neu.edu/research/demeter/DemeterJava/.Google Scholar
- J. Ovlinger and M. Wand. A language for specifying recursive traversals of object structures. In OOPSLA'99: 14th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 70--81, 1999. Google Scholar
Digital Library
- C. Parnin, C. Bird, and E. Murphy-Hill. Adoption and use of Java generics. Empirical Software Engineering, pages 1--43, 2012.Google Scholar
- L. S. Pinto, S. Sinha, and A. Orso. Understanding myths and realities of test-suite evolution. In FSE'12: 20th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 33:1--33:11, 2012. Google Scholar
Digital Library
- H. Rajan, T. N. Nguyen, R. Dyer, and H. A. Nguyen. Boa website. http://boa.cs.iastate.edu/, 2012.Google Scholar
- G. Udding, B. Dagenais, and M. P. Robillard. Temporal analysis of API usage concepts. In ICSE'12: 34th International Conference on Software Engineering, pages 804--814, 2012. Google Scholar
Digital Library
- J. Visser. Visitor combination and traversal control. In OOPSLA'01: 16th ACM SIGPLAN conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 270--282, 2001. Google Scholar
Digital Library
- T. Zimmermann, P. Weisgerber, S. Diehl, and A. Zeller. Mining version histories to guide software changes. In ICSE'04: 26th International Conference on Software Engineering, pages 563--572, 2004. Google Scholar
Digital Library
Index Terms
Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes
Recommendations
Declarative visitors to ease fine-grained source code mining with full history on billions of AST nodes
GPCE '13: Proceedings of the 12th international conference on Generative programming: concepts & experiencesSoftware repositories contain a vast wealth of information about software development. Mining these repositories has proven useful for detecting patterns in software development, testing hypotheses for new software engineering approaches, etc. ...
Comparison and evaluation of source code mining tools and techniques: A qualitative approach
Program source code substantially is structured and contains semantically rich programming constructs such as variables, functions, data structures, and program structures which indicate patterns. Mining source code by using different data mining ...
Mining Python fix patterns via analyzing fine-grained source code changes
AbstractMany code changes are inherently repetitive, and researchers employ repetitiveness of the code changes to generate bug fix patterns. Automatic Program Repair (APR) can automatically detect and fix bugs, thus helping developers to improve the ...







Comments