Abstract
Code embedding, as an emerging paradigm for source code analysis, has attracted much attention over the past few years. It aims to represent code semantics through distributed vector representations, which can be used to support a variety of program analysis tasks (e.g., code summarization and semantic labeling). However, existing code embedding approaches are intraprocedural, alias-unaware and ignoring the asymmetric transitivity of directed graphs abstracted from source code, thus they are still ineffective in preserving the structural information of code.
This paper presents Flow2Vec, a new code embedding approach that precisely preserves interprocedural program dependence (a.k.a value-flows). By approximating the high-order proximity, i.e., the asymmetric transitivity of value-flows, Flow2Vec embeds control-flows and alias-aware data-flows of a program in a low-dimensional vector space. Our value-flow embedding is formulated as matrix multiplication to preserve context-sensitive transitivity through CFL reachability by filtering out infeasible value-flow paths. We have evaluated Flow2Vec using 32 popular open-source projects. Results from our experiments show that Flow2Vec successfully boosts the performance of two recent code embedding approaches codevec and codeseq for two client applications, i.e., code classification and code summarization. For code classification, Flow2Vec improves codevec with an average increase of 21.2%, 20.1% and 20.7% in precision, recall and F1, respectively. For code summarization, Flow2Vec outperforms codeseq by an average of 13.2%, 18.8% and 16.0% in precision, recall and F1, respectively.
Supplemental Material
- Mithun Acharya and Brian Robinson. 2011. Practical change impact analysis based on static program slicing for industrial software systems. In ICSE'11. ACM, 746ś755.Google Scholar
Digital Library
- Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting Accurate Method and Class Names. In FSE' 15. 38ś49.Google Scholar
- Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning to represent programs with graphs. ( 2018 ).Google Scholar
- Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In ICML '16. 2091ś2100.Google Scholar
- Uri Alon, Shaked Brody, Omer Levy, and Eran Yahav. 2019a. code2seq: Generating sequences from structured representations of code. ICLR ' 19 ( 2019 ).Google Scholar
- Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A General Path-Based Representation for Predicting Program Properties. In PLDI ' 18. 404ś419. https://doi.org/10.1145/3192366.3192412 Google Scholar
Digital Library
- Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019b. code2vec: Learning distributed representations of code. ACM POPL 3 ( 2019 ), 40.Google Scholar
- Lars Ole Andersen. 1994. Program analysis and specialization for the C programming language. Ph.D. Dissertation. University of Cophenhagen.Google Scholar
- George Balatsouras and Yannis Smaragdakis. 2016. Structure-sensitive points-to analysis for C and C++. In SAS '16. Springer, 84ś104.Google Scholar
Cross Ref
- Mohamad Barbar, Yulei Sui, and Shiping Chen. 2020. Flow-Sensitive Type-Based Heap Cloning. In ECOOP '20.Google Scholar
- Mikhail Belkin and Partha Niyogi. 2002. Laplacian eigenmaps and spectral techniques for embedding and clustering. In NeurIPS ' 02. 585ś591.Google Scholar
- Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural code comprehension: a learnable representation of code semantics. In NeurIPS ' 18. 3585ś3597.Google Scholar
- Rastislav Bodík and Sadun Anik. 1998. Path-sensitive value-flow analysis. In POPL '98. 237ś251.Google Scholar
Digital Library
- Hongyun Cai, Vincent W Zheng, and Kevin Chen-Chuan Chang. 2018. A comprehensive survey of graph embedding: Problems, techniques, and applications. TKDE '18 30, 9 ( 2018 ), 1616ś1637.Google Scholar
Cross Ref
- Gerardo Canfora and Luigi Cerulo. 2005. Impact analysis by mining software and change request repositories. In METRICS '05. IEEE, 9Ð-pp.Google Scholar
Digital Library
- Xinyun Chen, Chang Liu, and Dawn Song. 2018. Tree-to-tree neural networks for program translation. In NeurIPS ' 18. 2547ś2557.Google Scholar
- Jong-Deok Choi, Ron Cytron, and Jeanne Ferrante. 1991. Automatic construction of sparse data flow evaluation graphs. In POPL '91. 55ś66.Google Scholar
Digital Library
- Fred Chow, Sun Chan, Shin-Ming Liu, Raymond Lo, and Mark Streich. 1996. Efective representation of aliases and indirect memory operations in SSA form. In CC '96. Springer, 253ś267.Google Scholar
Cross Ref
- Peng Cui, Xiao Wang, Jian Pei, and Wenwu Zhu. 2018. A survey on network embedding. TKDE 31, 5 ( 2018 ), 833ś852.Google Scholar
- Lieven De Lathauwer, Bart De Moor, and Joos Vandewalle. 2000. A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications 21, 4 ( 2000 ), 1253ś1278.Google Scholar
- Jeanne Ferrante, Karl J Ottenstein, and Joe D Warren. 1987. The program dependence graph and its use in optimization. ACM TOPLAS 9, 3 ( 1987 ), 319ś349.Google Scholar
- Georgia Frantzeskou, Stephen MacDonell, Efstathios Stamatatos, and Stefanos Gritzalis. 2008. Examining the significance of high-level programming features in source code author classification. Journal of Systems and Software 81, 3 ( 2008 ), 447ś460.Google Scholar
Digital Library
- Keith Brian Gallagher and James R Lyle. 1991. Using program slicing in software maintenance. IEEE TSE 17, 8 ( 1991 ), 751ś761.Google Scholar
Digital Library
- Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In KDD '16. ACM, 855ś864.Google Scholar
Digital Library
- Ben Hardekopf and Calvin Lin. 2007. The ant and the grasshopper: fast and accurate pointer analysis for millions of lines of code. In PLDI '07. ACM, 290ś299.Google Scholar
Digital Library
- Ben Hardekopf and Calvin Lin. 2011. Flow-sensitive pointer analysis for millions of lines of code. In CGO '11. 289ś298.Google Scholar
Cross Ref
- Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In ICSE '12. IEEE, 837ś847.Google Scholar
Cross Ref
- M E Hochstenbach. 2009. A JacobiśDavidson type method for the generalized singular value problem. Linear Algebra Appl. 431, 3-4 ( 2009 ), 471ś487.Google Scholar
Cross Ref
- Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In ICPC '18. 200ś210.Google Scholar
Digital Library
- Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model. In ACL '16. Berlin, Germany, 2073ś2083. https://www.aclweb.org/anthology/P16-1195 Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE TSE 28, 7 ( 2002 ), 654ś670.Google Scholar
Cross Ref
- Leo Katz. 1953. A new status index derived from sociometric analysis. Psychometrika 18, 1 ( 1953 ), 39ś43.Google Scholar
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR '15, Yoshua Bengio and Yann LeCun (Eds.).Google Scholar
- John Kodumal and Alex Aiken. 2004. The set constraint/CFL reachability connection in practice. PLDI '04 6 ( 2004 ), 207ś218.Google Scholar
Digital Library
- Vladimir Kovalenko, Egor Bogomolov, Timofey Bryksin, and Alberto Bacchelli. 2019. PathMiner: a library for mining of path-based representations of code. In MSR '19. 13ś17.Google Scholar
Digital Library
- David J Kuck, Robert H Kuhn, David A Padua, Bruce Leasure, and Michael Wolfe. 1981. Dependence graphs and compiler optimizations. In POPL '81. ACM, 207ś218.Google Scholar
Digital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO '04. IEEE, 75ś86.Google Scholar
Digital Library
- Yuxiang Lei and Yulei Sui. 2019. Fast and precise handling of positive weight cycles for field-sensitive pointer analysis. In SAS '19. Springer, 27ś47.Google Scholar
Cross Ref
- Ondrej Lhoták and Kwok-Chiang Andrew Chung. 2011. Points-To Analysis with Eficient Strong Updates. In POPL ' 11. 3ś16.Google Scholar
- L Li, C Cifuentes, and N Keynes. 2011. Boosting the performance of flow-sensitive points-to analysis using value flow. In FSE '11. 343ś353.Google Scholar
Digital Library
- Zhen Li, Deqing Zou, Shouhuai Xu, Xinyu Ou, Hai Jin, Sujuan Wang, Zhijun Deng, and Yuyi Zhong. 2018. VulDeePecker: A deep learning-based system for vulnerability detection. NDSS '18 ( 2018 ).Google Scholar
Cross Ref
- Defu Lian, Kai Zheng, Vincent W Zheng, Yong Ge, Longbing Cao, Ivor W Tsang, and Xing Xie. 2018. High-order proximity preserving information network hashing. In KDD '18. ACM, 1744ś1753.Google Scholar
Digital Library
- V Benjamin Livshits and Monica S Lam. 2003. Tracking pointers with path and context sensitivity for bug detection in C programs. FSE '03 28, 5 ( 2003 ), 317ś326.Google Scholar
- M T Luong, H Pham, and C D Manning. 2015. Efective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 ( 2015 ).Google Scholar
- Chris Maddison and Daniel Tarlow. 2014. Structured generative models of natural source code. In ICML '14. 649ś657.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jef Dean. 2013. Distributed representations of words and phrases and their compositionality. In NeurIPS '13. 3111ś3119.Google Scholar
- Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. 2016. Asymmetric transitivity preserving graph embedding. In KDD '16. ACM, 1105ś1114.Google Scholar
Digital Library
- Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning of social representations. In KDD '14. ACM, 701ś710.Google Scholar
Digital Library
- Michael Pradel and Koushik Sen. 2018. DeepBugs: A learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages 2, OOPSLA ( 2018 ), 1ś25.Google Scholar
Digital Library
- Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In PLDI '14. ACM, 419ś428.Google Scholar
Digital Library
- Thomas Reps. 1998. Program analysis via graph reachability. IST 40, 11-12 ( 1998 ), 701ś726.Google Scholar
- Juergen Rilling and Tuomas Klemola. 2003. Identifying comprehension bottlenecks using program slicing and cognitive complexity metrics. In IEEE International Workshop on Program Comprehension. IEEE, 115ś124.Google Scholar
Cross Ref
- Hitesh Sajnani, Vaibhav Saini, Jefrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: Scaling code clone detection to big-code. In ICSE '16. IEEE, 1157ś1168.Google Scholar
Digital Library
- Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine 29, 3 ( 2008 ), 93.Google Scholar
- Qingkai Shi, Xiao Xiao, Rongxin Wu, Jinguo Zhou, Gang Fan, and Charles Zhang. 2018. Pinpoint: Fast and precise sparse value flow analysis for million lines of code. In PLDI '18. ACM, 693ś706.Google Scholar
Digital Library
- Yao Shi, Soyeon Park, Zuoning Yin, Shan Lu, Yuanyuan Zhou, Wenguang Chen, and Weimin Zheng. 2010. Do I use the wrong definition?: DeFuse: definition-use invariants for detecting concurrency and sequential bugs. OOPSLA '10 45, 10 ( 2010 ), 160ś174.Google Scholar
Digital Library
- Han Hee Song, Tae Won Cho, Vacha Dave, Yin Zhang, and Lili Qiu. 2009. Scalable proximity estimation and link prediction in online social networks. In ACM SIGCOMM. ACM, 322ś335.Google Scholar
- Manu Sridharan and Rastislav Bodík. 2006. Refinement-based context-sensitive points-to analysis for Java. PLDI 41, 6 ( 2006 ), 387ś400.Google Scholar
- Yulei Sui and Jingling Xue. 2016. SVF: interprocedural static value-flow analysis in LLVM. In CC '16. ACM, 265ś266.Google Scholar
Digital Library
- Jiankai Sun, Bortik Bandyopadhyay, Armin Bashizade, Jiongqian Liang, P Sadayappan, and Srinivasan Parthasarathy. 2019. Atp: Directed graph embedding with asymmetric transitivity preservation. In AAAI '19, Vol. 33. 265ś272.Google Scholar
Cross Ref
- Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks. In IJCNLP. Beijing, China, 1556ś1566.Google Scholar
- Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale information network embedding. In WWW '15. 1067ś1077.Google Scholar
Digital Library
- Secil Ugurel, Robert Krovetz, and C Lee Giles. 2002. What's the code?: automatic classification of source code archives. In KDD '02. ACM, 632ś638.Google Scholar
Digital Library
- Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. 2017. Community preserving network embedding. In AAAI '17.Google Scholar
- Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In AAAI '14.Google Scholar
- Mark Weiser. 1981. Program slicing. In ICSE '81. IEEE Press, 439ś449.Google Scholar
- Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML '15. 2048ś2057.Google Scholar
- Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In ICSE '19. 783ś794.Google Scholar
Digital Library
- Ziwei Zhang, Peng Cui, Xiao Wang, Jian Pei, Xuanrong Yao, and Wenwu Zhu. 2018. Arbitrary-order proximity preserved network embedding. In KDD '18. ACM, 2778ś2786.Google Scholar
Digital Library
- Gang Zhao and Jef Huang. 2018. Deepsim: deep learning code functional similarity. In FSE '18. ACM, 141ś151.Google Scholar
Digital Library
- Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019. Devign: Efective vulnerability identification by learning comprehensive program semantics via graph neural networks. In NeurIPS '19. 10197ś10207.Google Scholar
Index Terms
Flow2Vec: value-flow-based precise code embedding
Recommendations
Precise and scalable context-sensitive pointer analysis via value flow graph
ISMM '13: Proceedings of the 2013 international symposium on memory managementIn this paper, we propose a novel method for context-sensitive pointer analysis using the value flow graph (VFG) formulation. We achieve context-sensitivity by simultaneously applying function cloning and computing context-free language reachability (...
Precise and scalable context-sensitive pointer analysis via value flow graph
ISMM '13: Proceedings of the 2013 international symposium on memory managementIn this paper, we propose a novel method for context-sensitive pointer analysis using the value flow graph (VFG) formulation. We achieve context-sensitivity by simultaneously applying function cloning and computing context-free language reachability (...
Path-sensitive code embedding via contrastive learning for software vulnerability detection
ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and AnalysisMachine learning and its promising branch deep learning have shown success in a wide range of application domains. Recently, much effort has been expended on applying deep learning techniques (e.g., graph neural networks) to static vulnerability ...






Comments