Abstract
There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence of many sophisticated interprocedural analyses, few of them have been employed to improve checkers for systems code due to their complex implementations and poor scalability. In this paper, we revisit the scalability problem of interprocedural static analysis from a "Big Data" perspective. That is, we turn sophisticated code analysis into Big Data analytics and leverage novel data processing techniques to solve this traditional programming language problem. We develop Graspan, a disk-based parallel graph system that uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations. Moreover, we show that these analyses can be used to augment the existing checkers; these augmented checkers uncovered 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- The findbugs Java static checker. http://findbugs.sourceforge.net/, 2015.Google Scholar
- The Coverity code checker. http://www.coverity.com/, 2016.Google Scholar
- The GrammaTech CodeSonar static checker, 2016.Google Scholar
- The HP Fortify static checker, 2016.Google Scholar
- The KlocWork static checker, 2016.Google Scholar
- The LLVMLinux project. http://llvm.linuxfoundation.org/, 2016.Google Scholar
- The LogicBlox Datalog engine. http://www.logicblox.com/, 2016.Google Scholar
- Personal communication with John Criswell, 2016.Google Scholar
- A. Aiken, S. Bugrara, I. Dillig, T. Dillig, B. Hackett, and P. Hawkins. An overview of the Saturn project. In PASTE, pages 43--48, 2007. Google Scholar
Digital Library
- R. Alur. Marrying words and trees. In PODS, pages 233--242, 2007. Google Scholar
Digital Library
- R. Alur, M. Benedikt, K. Etessami, P. Godefroid, T. Reps, and M. Yannakakis. Analysis of recursive state machines. ACM Trans. Program. Lang. Syst., 27(4):786--818, 2005. Google Scholar
Digital Library
- R. Alur and P. Madhusudan. Visibly pushdown languages. In STOC, pages 202--211, 2004. Google Scholar
Digital Library
- M. D. Atkinson, J.-R. Sack, N. Santoro, and T. Strothotte. Min-max heaps and generalized priority queues. Commun. ACM, 29(10):996--1000, 1986. Google Scholar
Digital Library
- T. Ball, B. Cook, V. Levin, and S. K. Rajamani. SLAM and static driver verifier: Technology transfer of formal methods inside microsoft. In IFM, pages 1--20, 2004.Google Scholar
Cross Ref
- T. Ball, R. Majumdar, T. Millstein, and S. K. Rajamani. Automatic predicate abstraction of c programs. In PLDI, pages 203--213, 2001. Google Scholar
Digital Library
- O. Bastani, S. Anand, and A. Aiken. Specification inference using context-free language reachability. In POPL, pages 553--566, 2015. Google Scholar
Digital Library
- M. Bravenboer and Y. Smaragdakis. Strictly declarative specification of sophisticated points-to analyses. In OOPSLA, pages 243--262, 2009. Google Scholar
Digital Library
- F. Brown, A. Notzli, and D. Engler. How to build static checking systems using orders of magnitude less code. In ASPLOS, pages 143--157, 2016. Google Scholar
Digital Library
- S. Bugrara and A. Aiken. Verifying the safety of user pointer dereferences. In IEEE S&P, pages 325--338, 2008. Google Scholar
Digital Library
- C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, pages 209--224, 2008.Google Scholar
Digital Library
- C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler. EXE: Automatically generating inputs of death. In CCS, pages 322--335, 2006.Google Scholar
- R. Chen, X. Ding, P. Wang, H. Chen, B. Zang, and H. Guan. Computation and communication efficient graph processing with distributed immutable view. In HPDC, pages 215--226, 2014. Google Scholar
Digital Library
- R. Chen, J. Shi, Y. Chen, and H. Chen. PowerLyra: Differentiated graph computation and partitioning on skewed graphs. In EuroSys, pages 1:1--1:15, 2015.Google Scholar
Digital Library
- A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating systems errors. In SOSP, pages 73--88, 2001. Google Scholar
Digital Library
- R. DeLine and M. F\"ahndrich. Enforcing high-level protocols in low-level software. In PLDI, pages 59--69, 2001.Google Scholar
Digital Library
- N. Dor, S. Adams, M. Das, and Z. Yang. Software validation via scalable path-sensitive value flow analysis. In ISSTA, pages 12--22, 2004. Google Scholar
Digital Library
- D. Engler. Making finite verification of raw C code easier than writing a test case. In RV. Invited talk.Google Scholar
- D. Engler, B. Chelf, A. Chou, and S. Hallem. Checking system rules using system-specific, programmer-written compiler extensions. In OSDI, pages 1--1, 2000.Google Scholar
Digital Library
- D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as deviant behavior: A general approach to inferring errors in systems code. In SOSP, pages 57--72, 2001. Google Scholar
Digital Library
- S. Fink, E. Yahav, N. Dor, G. Ramalingam, and E. Geay. Effective typestate verification in the presence of aliasing. In ISSTA, pages 133--144, 2006. Google Scholar
Digital Library
- J. S. Foster, M. F\"ahndrich, and A. Aiken. A theory of type qualifiers. In PLDI, pages 192--203, 1999.Google Scholar
Digital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.Google Scholar
Digital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014.Google Scholar
Digital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In OSDI, pages 599--613, 2014.Google Scholar
Digital Library
- S. Hallem, B. Chelf, Y. Xie, and D. Engler. A system and language for building system-specific, static analyses. In PLDI, pages 69--82, 2002. Google Scholar
Digital Library
- W.-S. Han, S. Lee, K. Park, J.-H. Lee, M.-S. Kim, J. Kim, and H. Yu. TurboGraph: A fast parallel graph engine handling billion-scale graphs in a single PC. In KDD, pages 77--85, 2013. Google Scholar
Digital Library
- M. Hind. Pointer analysis: Haven't we solved this problem yet? In PASTE, pages 54--61, 2001.Google Scholar
- S. Horwitz, T. Reps, and M. Sagiv. Demand interprocedural dataflow analysis. In FSE, pages 104--115, 1995. Google Scholar
Digital Library
- G. F. Italiano. Amortized efficiency of a path retrieval data structure. Theor. Comput. Sci., 48(2--3):273--281, 1986.Google Scholar
- R. Johnson and D. Wagner. Finding user/kernel pointer bugs with type inference. In USENIX Security, pages 9--9, 2004.Google Scholar
- G. Kastrinis and Y. Smaragdakis. Hybrid context-sensitivity for points-to analysis. In PLDI, pages 423--434, 2013. Google Scholar
Digital Library
- J. Kodumal and A. Aiken. The set constraint/CFL reachability connection in practice. In PLDI, pages 207--218, 2004.Google Scholar
Digital Library
- J. Kodumal and A. Aiken. Regularly annotated set constraints. In PLDI, pages 331--341, 2007. Google Scholar
Digital Library
- A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In OSDI, pages 31--46, 2012.Google Scholar
Digital Library
- M. S. Lam, S. Guo, and J. Seo. SociaLite: Datalog extensions for efficient social network analysis. In ICDE, pages 278--289, 2013.Google Scholar
- B. W. Lampson. Hints for computer system design. In SOSP, pages 33--48, 1983. Google Scholar
Digital Library
- C. Lattner, A. Lenharth, and V. Adve. Making context-sensitive points-to analysis with heap cloning practical for the real world. In PLDI, pages 278--289, 2007. Google Scholar
Digital Library
- Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In OSDI, pages 20--20, 2004.Google Scholar
Digital Library
- Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. In FSE, pages 306--315, 2005.Google Scholar
Digital Library
- Z. Lin, M. Kahng, K. M. Sabrin, D. H. P. Chau, H. Lee, , and U. Kang. MMap: Fast billion-scale graph computation on a pc via memory mapping. In BigData, pages 159--164, 2014.Google Scholar
Cross Ref
- Y. Liu and A. Milanova. Static analysis for inference of explicit information flow. In PASTE, pages 50--56, 2008. Google Scholar
Digital Library
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, 2012. Google Scholar
Digital Library
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, G. Czajkowski, and G. Inc. Pregel: A system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google Scholar
Digital Library
- D. Melski and T. Reps. Interconvertibility of a class of set constraints and context-free-language reachability. Theoretical Computer Science, 248:29--98, 2000. Google Scholar
Digital Library
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439--455. ACM, 2013. Google Scholar
Digital Library
- G. C. Necula, J. Condit, M. Harren, S. McPeak, and W. Weimer. CCured: Type-safe retrofitting of legacy software. ACM Trans. Program. Lang. Syst., 27(3):477--526, 2005. Google Scholar
Digital Library
- D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, pages 456--471, 2013. Google Scholar
Digital Library
- Y. Padioleau, J. Lawall, R. R. Hansen, and G. Muller. Documenting and automating collateral evolutions in linux device drivers. In EuroSys, pages 247--260, 2008. Google Scholar
Digital Library
- N. Palix, G. Thomas, S. Saha, C. Calvès, J. Lawall, and G. Muller. Faults in linux: Ten years later. In ASPLOS, pages 305--318, 2011.Google Scholar
Digital Library
- M. Pundir, L. M. Leslie, I. Gupta, and R. H. Campbell. Zorro: Zero-cost reactive failure recovery in distributed graph processing. In SoCC, pages 195--208, 2015.Google Scholar
Digital Library
- J. Rehof and M. F\"ahndrich. Type-based flow analysis: From polymorphic subtyping to CFL-reachability. In POPL, pages 54--66, 2001.Google Scholar
- T. Reps. Solving demand versions of interprocedural analysis problems. In CC, pages 389--403, 1994. Google Scholar
Cross Ref
- T. Reps. Shape analysis as a generalized path problem. In PEPM, pages 1--11, 1995. Google Scholar
Digital Library
- T. Reps. Program analysis via graph reachability. Information and Software Technology, 40(11--12):701--726, 1998.Google Scholar
- T. Reps, S. Horwitz, and M. Sagiv. Precise interprocedural dataflow analysis via graph reachability. In POPL, pages 49--61, 1995. Google Scholar
Digital Library
- T. Reps, S. Horwitz, M. Sagiv, and G. Rosay. Speeding up slicing. In FSE, pages 11--20, 1994. Google Scholar
Digital Library
- L. Roditty and U. Zwick. A fully dynamic reachability algorithm for directed graphs with an almost linear update time. In STOC, pages 184--191, 2004. Google Scholar
Digital Library
- A. Roy, L. Bindschaedler, J. Malicevic, and W. Zwaenepoel. Chaos: Scale-out graph processing from secondary storage. In SOSP, pages 410--424, 2015.Google Scholar
Digital Library
- A. Roy, I. Mihailovic, and W. Zwaenepoel. X-Stream: Edge-centric graph processing using streaming partitions. In SOSP, pages 472--488, 2013.Google Scholar
Digital Library
- C. Rubio-González, H. S. Gunawi, B. Liblit, R. H. Arpaci-Dusseau, and A. C. Arpaci-Dusseau. Error propagation analysis for file systems. In PLDI, pages 270--280, 2009. Google Scholar
Digital Library
- C. Rubio-González and B. Liblit. Defective error/pointer interactions in the linux kernel. In ISSTA, pages 111--121, 2011.Google Scholar
Digital Library
- M. Sagiv, T. Reps, and S. Horwitz. Precise interprocedural dataflow analysis with applications to constant propagation. Theoretical Computer Science, 167(1--2):131--170, 1996.Google Scholar
- M. Sharir and A. Pnueli. Two approaches to interprocedural data flow analysis. In S. Muchnick and N. Jones, editors, Program Flow Analysis: Theory and Applications, pages 189--234. Prentice Hall, 1981.Google Scholar
- A. Shkapsky, M. Yang, M. Interlandi, H. Chiu, T. Condie, and C. Zaniolo. Big data analytics with datalog queries on spark. In SIGMOD, pages 1135--1149, 2016. Google Scholar
Digital Library
- J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In PPoPP, pages 135--146, 2013. Google Scholar
Digital Library
- Y. Smaragdakis, M. Bravenboer, and O. Lhoták. Pick your contexts well: Understanding object-sensitivity. In POPL, pages 17--30, 2011.Google Scholar
Digital Library
- Y. Smaragdakis, G. Kastrinis, and G. Balatsouras. Introspective analysis: Context-sensitivity, across the board. In PLDI, pages 485--495, 2014.Google Scholar
Digital Library
- M. Sridharan and R. Bodik. Refinement-based context-sensitive points-to analysis for Java. In PLDI, pages 387--400, 2006. Google Scholar
Digital Library
- M. Sridharan, D. Gopan, L. Shan, and R. Bodik. Demand-driven points-to analysis for Java. In OOPSLA, pages 59--76, 2005. Google Scholar
Digital Library
- H. Tang, X. Wang, L. Zhang, B. Xie, L. Zhang, and H. Mei. Summary-based context-sensitive data-dependence analysis in presence of callbacks. In POPL, pages 83--95, 2015. Google Scholar
Digital Library
- K. Vora, R. Gupta, and G. Xu. Synergistic analysis of evolving graphs. ACM Trans. Archit. Code Optim., 13(4):32:1--32:27, 2016.Google Scholar
Digital Library
- K. Vora, R. Gupta, and G. Xu. Kickstarter: Fast and accurate computations on streaming graphs via trimmed approximations. In ASPLOS, 2017.Google Scholar
Digital Library
- K. Vora, S. C. Koduru, and R. Gupta. ASPIRE: Exploiting asynchronous parallelism in iterative algorithms using a relaxed consistency based dsm. In OOPSLA, pages 861--878, 2014.Google Scholar
Digital Library
- K. Vora, G. Xu, and R. Gupta. Load the edges you need: A generic I/O optimization for disk-based graph processing. In USENIX ATC, pages 507--522, 2016.Google Scholar
Digital Library
- G. Wang, W. Xie, A. Demers, and J. Gehrke. Asynchronous large-scale graph processing made easy. In CIDR, 2013.Google Scholar
- J. Wang, M. Balazinska, and D. Halperin. Asynchronous and fault-tolerant recursive datalog evaluation in shared-nothing engines. PVLDB, 8(12):1542--1553, 2015. Google Scholar
Digital Library
- K. Wang, G. Xu, Z. Su, and Y. D. Liu. GraphQ: Graph query processing with abstraction refinement\textemdashprogrammable and budget-aware analytical queries over very large graphs on a single PC. In USENIX ATC, pages 387--401, 2015.Google Scholar
- X. Wang, N. Zeldovich, M. F. Kaashoek, and A. Solar-Lezama. Towards optimization-safe systems: Analyzing the impact of undefined behavior. In SOSP, pages 260--275, 2013.Google Scholar
Digital Library
- C. Weiss, C. Rubio-González, and B. Liblit. Database-backed program analysis for scalable error propagation. In ICSE, pages 586--597, 2015. Google Scholar
Cross Ref
- J. Whaley and M. Lam. Cloning-based context-sensitive pointer alias analysis using binary decision diagrams. In PLDI, pages 131--144, 2004. Google Scholar
Digital Library
- G. Xu, A. Rountev, and M. Sridharan. Scaling CFL-reachability-based points-to analysis using context-sensitive must-not-alias analysis. In ECOOP, pages 98--122, 2009.Google Scholar
Digital Library
- D. Yan, G. Xu, and A. Rountev. Demand-driven context-sensitive alias analysis for Java. In ISSTA, pages 155--165, 2011. Google Scholar
Digital Library
- J. Yang, C. Sar, and D. Engler. EXPLODE: A lightweight, general system for finding serious storage system errors. In OSDI, pages 10--10, 2006.Google Scholar
- M. Yannakakis. Graph-theoretic methods in database theory. In PODS, pages 230--242, 1990. Google Scholar
Digital Library
- D. M. Yellin. Speeding up dynamic transitive closure for bounded degree graphs. Acta Inf., 30(4):369--384, 1993. Google Scholar
Digital Library
- S. Yong, S. Horwitz, and T. Reps. Pointer analysis for programs with structures and casting. In PLDI, pages 91--103, 1999. Google Scholar
Digital Library
- Q. Zhang, M. R. Lyu, H. Yuan, and Z. Su. Fast algorithms for Dyck-CFL-reachability with applications to alias analysis. In PLDI, pages 435--446, 2013. Google Scholar
Digital Library
- Q. Zhang and Z. Su. Context-sensitive data dependence analysis via linear conjunctive language reachability. In POPL, pages 344--358, 2017. Google Scholar
Digital Library
- Q. Zhang, X. Xiao, C. Zhang, H. Yuan, and Z. Su. Efficient subcubic alias analysis for C. In OOPSLA, pages 829--845, 2014.Google Scholar
Digital Library
- D. Zheng, D. Mhembere, R. Burns, J. Vogelstein, C. E. Priebe, and A. S. Szalay. FlashGraph: processing billion-node graphs on an array of commodity ssds. In FAST, pages 45--58, 2015.Google Scholar
Digital Library
- X. Zheng and R. Rugina. Demand-driven alias analysis for C. In POPL, pages 197--208, 2008.Google Scholar
Digital Library
- X. Zhu, W. Han, and W. Chen. GridGraph: Large scale graph processing on a single machine using 2-level hierarchical partitioning. In USENIX ATC, pages 375--386, 2015.Google Scholar
Index Terms
Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code
Recommendations
Systemizing Interprocedural Static Analysis of Large-scale Systems Code with Graspan
There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence ...
Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code
ASPLOS '17: Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating SystemsThere is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence ...
Graspan: A Single-machine Disk-based Graph System for Interprocedural Static Analyses of Large-scale Systems Code
Asplos'17There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence ...







Comments