Abstract
Logic programming languages such as Datalog have become popular as Domain Specific Languages (DSLs) for solving large-scale, real-world problems, in particular, static program analysis and network analysis. The logic specifications that model analysis problems process millions of tuples of data and contain hundreds of highly recursive rules. As a result, they are notoriously difficult to debug. While the database community has proposed several data provenance techniques that address the Declarative Debugging Challenge for Databases, in the cases of analysis problems, these state-of-the-art techniques do not scale.
In this article, we introduce a novel bottom-up Datalog evaluation strategy for debugging: Our provenance evaluation strategy relies on a new provenance lattice that includes proof annotations and a new fixed-point semantics for semi-naïve evaluation. A debugging query mechanism allows arbitrary provenance queries, constructing partial proof trees of tuples with minimal height. We integrate our technique into Soufflé, a Datalog engine that synthesizes C++ code, and achieve high performance by using specialized parallel data structures. Experiments are conducted with DOOP/DaCapo, producing proof annotations for tens of millions of output tuples. We show that our method has a runtime overhead of 1.31× on average while being more flexible than existing state-of-the-art techniques.
- 2017. souffle-lang/souffle: Soufflé is a variant of Datalog for tool designers crafting analyses in Horn clauses. Soufflé synthesizes a native parallel C++ program from a logic specification. Retrieved October 19, 2017 from http://souffle-lang.github.io/.Google Scholar
- Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.Google Scholar
- Nicholas Allen, Bernhard Scholz, and Padmanabhan Krishnan. 2015. Staged Points-to Analysis for Large Code Bases. Springer, Berlin, 131--150. DOI:https://doi.org/10.1007/978-3-662-46663-6_7Google Scholar
- Bahareh Sadat Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, and Boris Glavic. 2017. Using reenactment to retroactively capture provenance for transactions. IEEE Trans. Knowl. Data Eng. 30, 3 (2017), 599--612.Google Scholar
Cross Ref
- Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, and Geoffrey Washburn. 2015. Design and implementation of the LogicBlox system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, NY, 1371--1382. DOI:https://doi.org/10.1145/2723372.2742796Google Scholar
Digital Library
- Tarun Arora, Raghu Ramakrishnan, William G. Roth, Praveen Seshadri, and Divesh Srivastava. 1993. Explaining program execution in deductive systems. In Proceedings of the 3rd International Conference on Deductive and Object-Oriented Databases (1993). 101--119.Google Scholar
Cross Ref
- Omar Benjelloun, Anish Das Sarma, Chris Hayworth, and Jennifer Widom. 2006. An introduction to ULDBs and the trio system. IEEE Data Eng. Bull. 29 (2006), 5–16.Google Scholar
- Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly declarative specification of sophisticated points-to analyses. SIGPLAN Not. 44, 10 (2009), 243--262. DOI:https://doi.org/10.1145/1639949.1640108Google Scholar
Digital Library
- Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory (2001). 316--330.Google Scholar
Digital Library
- Rafael Caballero, Yolanda García-Ruiz, and Fernando Sáenz-Pérez. 2008. A theoretical framework for the declarative debugging of datalog programs. In Proceedings of the International Workshop on Semantics in Data and Knowledge Bases. Springer, 143--159.Google Scholar
Digital Library
- Rafael Caballero, Yolanda García-Ruiz, and Fernando Sáenz-Pérez. 2015. Debugging of wrong and missing answers for Datalog programs with constraint handling rules. In Proceedings of the 17th International Symposium on Principles and Practice of Declarative Programming. 55–66. DOI:https://doi.org/10.1145/2790449.2790522Google Scholar
Digital Library
- Rafael Caballero, Adrián Riesco, and Josep Silva. 2017. A survey of algorithmic debugging. ACM Comput. Surv. 50, 4 (2017), 60.Google Scholar
Digital Library
- James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases: Why, how, and where. Found. Trends Databases 1, 4 (2009), 379--474. DOI:https://doi.org/10.1561/1900000006Google Scholar
Digital Library
- Zaheer Chothia, John Liagouris, Frank McSherry, and Timothy Roscoe. 2016. Explaining outputs in modern data analytics. Proc. VLDB Endow. 9, 12 (2016), 1137--1148.Google Scholar
Digital Library
- Daniel Deutch, Amir Gilad, and Yuval Moskovitch. 2015. Selective provenance for datalog programs using top-K queries. Proc. VLDB Endow. 8, 12 (2015), 1394--1405.Google Scholar
Digital Library
- Daniel Deutch, Amir Gilad, and Yuval Moskovitch. 2018. Efficient provenance tracking for datalog using top-k queries. VLDB J. 27, 2 (01 Apr. 2018), 245--269. DOI:https://doi.org/10.1007/s00778-018-0496-7Google Scholar
Digital Library
- Daniel Deutch, Tova Milo, Sudeepa Roy, and Val Tannen. 2014. Circuits for datalog provenance. In ICDT. 201--212. DOI:https://doi.org/10.5441/002/icdt.2014.22Google Scholar
- Lodek Drabent and Simin Nadjm-Tehrani. 1989. Algorithmic debugging with assertions. In Meta-programming in Logic Programming. Citeseer.Google Scholar
- Antonio Flores-Montoya and Eric Schulte. 2019. Datalog disassembly. arXiv preprint arXiv:1906.03969 (2019).Google Scholar
- Yoshihiko Futamura. 1999. Partial evaluation of computation process—An approach to a compiler-compiler. High. Order Symbol. Comput. 12, 4 (Dec. 1999), 381--391.Google Scholar
- Boris Glavic and Gustavo Alonso. 2009. Perm: Processing provenance and data on the same data model through query rewriting. In IEEE Int. Conf. Data Eng. 174--185.Google Scholar
Digital Library
- Boris Glavic, Renée J. Miller, and Gustavo Alonso. 2013. Using SQL for efficient generation and querying of provenance information. In Lecture Notes in Computer Science, Volume 8000 (2013), 291--320.Google Scholar
Cross Ref
- Neville Grech, Lexi Brent, Bernhard Scholz, and Yannis Smaragdakis. 2019. Gigahorse: Thorough, declarative decompilation of smart contracts (unpublished).Google Scholar
- Neville Grech, Michael Kong, Anton Jurisevic, Lexi Brent, Bernhard Scholz, and Yannis Smaragdakis. 2018. MadMax: Surviving out-of-gas conditions in ethereum smart contracts. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’18).Google Scholar
Digital Library
- Sergio Greco and Cristian Molinaro. 2015. Datalog and Logic Databases. Morgan 8 Claypool.Google Scholar
- Sergio Greco and Carlo Zaniolo. 1998. Greedy algorithms in datalog with choice and negation. In Proceedings of the IJCSLP. 294--309.Google Scholar
- Kryštof Hoder, Nikolaj Bjørner, and Leonardo de Moura. 2011. µZ—An efficient engine for fixed points with constraints. In Computer Aided Verification, Ganesh Gopalakrishnan and Shaz Qadeer (Eds.). Springer, Berlin, 457--462.Google Scholar
- Shan Shan Huang, Todd Jeffrey Green, and Boon Thau Loo. 2011. Datalog and emerging applications: An interactive tutorial. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11). ACM, 1213--1216. DOI:https://doi.org/10.1145/1989323.1989456Google Scholar
Digital Library
- Herbert Jordan, Bernhard Scholz, and Pavle Subotić. 2016. Soufflé: On synthesis of program analyzers. In Proc. Comput. Aid. Verif. 422--430.Google Scholar
Cross Ref
- Herbert Jordan, Pavle Subotić, David Zhao, and Bernhard Scholz. 2019. A specialized B-tree for concurrent datalog evaluation. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP’19). ACM, New York, NY, 327--339. DOI:https://doi.org/10.1145/3293883.3295719Google Scholar
Digital Library
- Werner Kießling and Ulrich Güntzer. 1994. Database reasoning - A deductive framework for solving large and complex problems by means of subsumption. In Workshop on Information Systems and Artificial Intelligence. Springer, 118--138.Google Scholar
Cross Ref
- Sven Köhler, Bertram Ludäscher, and Yannis Smaragdakis. 2012. Declarative datalog debugging for mere mortals. In Lecture Notes in Computer Science, Volume 7494 (2012), 111--122.Google Scholar
Digital Library
- Seokki Lee, Sven Köhler, Bertram Ludäscher, and Boris Glavic. 2017. Efficiently computing provenance graphs for queries with negation. CoRR abs/1701.05699 (2017). http://arxiv.org/abs/1701.05699.Google Scholar
- Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2018. Provenance summaries for answers and non-answers. Proc. VLDB Endow. 11, 12 (Aug. 2018), 1954--1957. DOI:https://doi.org/10.14778/3229863.3236233Google Scholar
Digital Library
- Senlin Liang, Paul Fodor, Hui Wan, and Michael Kifer. 2009. OpenRuleBench: An analysis of the performance of rule engines. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 601--610.Google Scholar
Digital Library
- Magnus Madsen, Ming-Ho Yee, and Ondřej Lhoták. 2016. From datalog to flix: A declarative language for fixed points on lattices. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). ACM, New York, NY, 194--208. DOI:https://doi.org/10.1145/2908080.2908096Google Scholar
Digital Library
- Ravi Mangal, Xin Zhang, Aditya V. Nori, and Mayur Naik. 2015. A user-guided approach to program analysis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’15). ACM, New York, NY, 462--473. DOI:https://doi.org/10.1145/2786805.2786851Google Scholar
Digital Library
- Xinming Ou, Sudhakar Govindavajhala, and Andrew W. Appel. 2005. MulVAL: A logic-based network security analyzer. In Proceedings of the 14th Conference on USENIX Security Symposium, Volume 14 (SSYM’05). USENIX Association, Berkeley, CA, 8--8.Google Scholar
- Mukund Raghothaman, Sulekha Kulkarni, Kihong Heo, and Mayur Naik. 2018. User-guided program reasoning using Bayesian inference. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 722--735. DOI:https://doi.org/10.1145/3192366.3192417Google Scholar
Digital Library
- Raghu Ramakrishnan and S. Sudarshan. 1991. Top-down vs. bottom-up revisited. In Proceedings of the International Logic Programming Symposium. MIT Press, 321--336.Google Scholar
- Ehud Y. Shapiro. 1983. Algorithmic Program DeBugging. MIT Press, Cambridge, MA.Google Scholar
- Josep Silva. 2007. A comparative study of algorithmic debugging strategies. In Logic-Based Program Synthesis and Transformation, Germán Puebla (Ed.). Springer Berlin, 143--159.Google Scholar
- Manu Sridharan, Stephen J. Fink, and Rastislav Bodik. 2007. Thin slicing. In ACM SIGPLAN Notices, Vol. 42. ACM, 112--122.Google Scholar
Digital Library
- Manu Sridharan, Denis Gopan, Lexin Shan, and Rastislav Bodík. 2005. Demand-driven points-to analysis for Java. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’05). ACM, New York, NY, 59--76. DOI:https://doi.org/10.1145/1094811.1094817Google Scholar
Digital Library
- Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2015. Decoupling provenance capture and analysis from execution. In Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP’15). USENIX Association.Google Scholar
- Pavle Subotić, Herbert Jordan, Lijun Chang, Alan Fekete, and Bernhard Scholz. 2018. Automatic index selection for large-scale datalog computation. Proc. VLDB 12, 2 (2018), 141--153.Google Scholar
Digital Library
- Jeffrey D. Ullman. 1989. Bottom-up beats top-down for datalog. In Proceedings of the 8th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’89). ACM, New York, NY, 140--149. DOI:https://doi.org/10.1145/73721.73736Google Scholar
Digital Library
- John Whaley, Dzintars Avots, Michael Carbin, and Monica S. Lam. 2005. Using Datalog with Binary Decision Diagrams for Program Analysis. Springer, Berlin, 97--118. DOI:https://doi.org/10.1007/11575467_8Google Scholar
- Jennifer Widom. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR’05). 262--276.Google Scholar
- Xin Zhang, Radu Grigore, Xujie Si, and Mayur Naik. 2017. Effective interactive resolution of static analysis alarms. Proc. ACM Program. Lang. 1 (Oct. 2017), Article 57, 30 pages. DOI:https://doi.org/10.1145/3133881Google Scholar
Digital Library
- Xin Zhang, Ravi Mangal, Radu Grigore, Mayur Naik, and Hongseok Yang. 2014. On abstraction refinement for program analyses in datalog. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). ACM, New York, NY, 239--248. DOI:https://doi.org/10.1145/2594291.2594327Google Scholar
Digital Library
- David Zhao. 2017. Large-Scale Provenance for Souffle. University of Sydney.Google Scholar
- Wenchao Zhou, Micah Sherr, Tao Tao, Xiaozhou Li, Boon Thau Loo, and Yun Mao. 2010. Efficient querying and maintenance of network provenance at Internet-scale. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010). 615--626.Google Scholar
Digital Library
Index Terms
Debugging Large-scale Datalog: A Scalable Provenance Evaluation Strategy
Recommendations
From Datalog to flix: a declarative language for fixed points on lattices
PLDI '16We present Flix, a declarative programming language for specifying and solving least fixed point problems, particularly static program analyses. Flix is inspired by Datalog and extends it with lattices and monotone functions. Using Flix, implementors ...
Precise complexity analysis for efficient datalog queries
PPDP '10: Proceedings of the 12th international ACM SIGPLAN symposium on Principles and practice of declarative programmingGiven a set of Datalog rules, facts, and a query, answers to the query can be inferred bottom-up starting with the facts or top-down starting with the query. For efficiently answering the query, top-down evaluation is extended with tabling that stores ...
From Datalog to flix: a declarative language for fixed points on lattices
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationWe present Flix, a declarative programming language for specifying and solving least fixed point problems, particularly static program analyses. Flix is inspired by Datalog and extends it with lattices and monotone functions. Using Flix, implementors ...






Comments