skip to main content
research-article
Free Access

Debugging Large-scale Datalog: A Scalable Provenance Evaluation Strategy

Published:17 April 2020Publication History
Skip Abstract Section

Abstract

Logic programming languages such as Datalog have become popular as Domain Specific Languages (DSLs) for solving large-scale, real-world problems, in particular, static program analysis and network analysis. The logic specifications that model analysis problems process millions of tuples of data and contain hundreds of highly recursive rules. As a result, they are notoriously difficult to debug. While the database community has proposed several data provenance techniques that address the Declarative Debugging Challenge for Databases, in the cases of analysis problems, these state-of-the-art techniques do not scale.

In this article, we introduce a novel bottom-up Datalog evaluation strategy for debugging: Our provenance evaluation strategy relies on a new provenance lattice that includes proof annotations and a new fixed-point semantics for semi-naïve evaluation. A debugging query mechanism allows arbitrary provenance queries, constructing partial proof trees of tuples with minimal height. We integrate our technique into Soufflé, a Datalog engine that synthesizes C++ code, and achieve high performance by using specialized parallel data structures. Experiments are conducted with DOOP/DaCapo, producing proof annotations for tens of millions of output tuples. We show that our method has a runtime overhead of 1.31× on average while being more flexible than existing state-of-the-art techniques.

References

  1. 2017. souffle-lang/souffle: Soufflé is a variant of Datalog for tool designers crafting analyses in Horn clauses. Soufflé synthesizes a native parallel C++ program from a logic specification. Retrieved October 19, 2017 from http://souffle-lang.github.io/.Google ScholarGoogle Scholar
  2. Serge Abiteboul, Richard Hull, and Victor Vianu. 1995. Foundations of Databases. Addison-Wesley.Google ScholarGoogle Scholar
  3. Nicholas Allen, Bernhard Scholz, and Padmanabhan Krishnan. 2015. Staged Points-to Analysis for Large Code Bases. Springer, Berlin, 131--150. DOI:https://doi.org/10.1007/978-3-662-46663-6_7Google ScholarGoogle Scholar
  4. Bahareh Sadat Arab, Dieter Gawlick, Vasudha Krishnaswamy, Venkatesh Radhakrishnan, and Boris Glavic. 2017. Using reenactment to retroactively capture provenance for transactions. IEEE Trans. Knowl. Data Eng. 30, 3 (2017), 599--612.Google ScholarGoogle ScholarCross RefCross Ref
  5. Molham Aref, Balder ten Cate, Todd J. Green, Benny Kimelfeld, Dan Olteanu, Emir Pasalic, Todd L. Veldhuizen, and Geoffrey Washburn. 2015. Design and implementation of the LogicBlox system. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15). ACM, New York, NY, 1371--1382. DOI:https://doi.org/10.1145/2723372.2742796Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Tarun Arora, Raghu Ramakrishnan, William G. Roth, Praveen Seshadri, and Divesh Srivastava. 1993. Explaining program execution in deductive systems. In Proceedings of the 3rd International Conference on Deductive and Object-Oriented Databases (1993). 101--119.Google ScholarGoogle ScholarCross RefCross Ref
  7. Omar Benjelloun, Anish Das Sarma, Chris Hayworth, and Jennifer Widom. 2006. An introduction to ULDBs and the trio system. IEEE Data Eng. Bull. 29 (2006), 5–16.Google ScholarGoogle Scholar
  8. Martin Bravenboer and Yannis Smaragdakis. 2009. Strictly declarative specification of sophisticated points-to analyses. SIGPLAN Not. 44, 10 (2009), 243--262. DOI:https://doi.org/10.1145/1639949.1640108Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Peter Buneman, Sanjeev Khanna, and Wang-Chiew Tan. 2001. Why and where: A characterization of data provenance. In Proceedings of the International Conference on Database Theory (2001). 316--330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Rafael Caballero, Yolanda García-Ruiz, and Fernando Sáenz-Pérez. 2008. A theoretical framework for the declarative debugging of datalog programs. In Proceedings of the International Workshop on Semantics in Data and Knowledge Bases. Springer, 143--159.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rafael Caballero, Yolanda García-Ruiz, and Fernando Sáenz-Pérez. 2015. Debugging of wrong and missing answers for Datalog programs with constraint handling rules. In Proceedings of the 17th International Symposium on Principles and Practice of Declarative Programming. 55–66. DOI:https://doi.org/10.1145/2790449.2790522Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Rafael Caballero, Adrián Riesco, and Josep Silva. 2017. A survey of algorithmic debugging. ACM Comput. Surv. 50, 4 (2017), 60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases: Why, how, and where. Found. Trends Databases 1, 4 (2009), 379--474. DOI:https://doi.org/10.1561/1900000006Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Zaheer Chothia, John Liagouris, Frank McSherry, and Timothy Roscoe. 2016. Explaining outputs in modern data analytics. Proc. VLDB Endow. 9, 12 (2016), 1137--1148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Daniel Deutch, Amir Gilad, and Yuval Moskovitch. 2015. Selective provenance for datalog programs using top-K queries. Proc. VLDB Endow. 8, 12 (2015), 1394--1405.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Daniel Deutch, Amir Gilad, and Yuval Moskovitch. 2018. Efficient provenance tracking for datalog using top-k queries. VLDB J. 27, 2 (01 Apr. 2018), 245--269. DOI:https://doi.org/10.1007/s00778-018-0496-7Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Daniel Deutch, Tova Milo, Sudeepa Roy, and Val Tannen. 2014. Circuits for datalog provenance. In ICDT. 201--212. DOI:https://doi.org/10.5441/002/icdt.2014.22Google ScholarGoogle Scholar
  18. Lodek Drabent and Simin Nadjm-Tehrani. 1989. Algorithmic debugging with assertions. In Meta-programming in Logic Programming. Citeseer.Google ScholarGoogle Scholar
  19. Antonio Flores-Montoya and Eric Schulte. 2019. Datalog disassembly. arXiv preprint arXiv:1906.03969 (2019).Google ScholarGoogle Scholar
  20. Yoshihiko Futamura. 1999. Partial evaluation of computation process—An approach to a compiler-compiler. High. Order Symbol. Comput. 12, 4 (Dec. 1999), 381--391.Google ScholarGoogle Scholar
  21. Boris Glavic and Gustavo Alonso. 2009. Perm: Processing provenance and data on the same data model through query rewriting. In IEEE Int. Conf. Data Eng. 174--185.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Boris Glavic, Renée J. Miller, and Gustavo Alonso. 2013. Using SQL for efficient generation and querying of provenance information. In Lecture Notes in Computer Science, Volume 8000 (2013), 291--320.Google ScholarGoogle ScholarCross RefCross Ref
  23. Neville Grech, Lexi Brent, Bernhard Scholz, and Yannis Smaragdakis. 2019. Gigahorse: Thorough, declarative decompilation of smart contracts (unpublished).Google ScholarGoogle Scholar
  24. Neville Grech, Michael Kong, Anton Jurisevic, Lexi Brent, Bernhard Scholz, and Yannis Smaragdakis. 2018. MadMax: Surviving out-of-gas conditions in ethereum smart contracts. In Proceedings of the ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’18).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Sergio Greco and Cristian Molinaro. 2015. Datalog and Logic Databases. Morgan 8 Claypool.Google ScholarGoogle Scholar
  26. Sergio Greco and Carlo Zaniolo. 1998. Greedy algorithms in datalog with choice and negation. In Proceedings of the IJCSLP. 294--309.Google ScholarGoogle Scholar
  27. Kryštof Hoder, Nikolaj Bjørner, and Leonardo de Moura. 2011. µZ—An efficient engine for fixed points with constraints. In Computer Aided Verification, Ganesh Gopalakrishnan and Shaz Qadeer (Eds.). Springer, Berlin, 457--462.Google ScholarGoogle Scholar
  28. Shan Shan Huang, Todd Jeffrey Green, and Boon Thau Loo. 2011. Datalog and emerging applications: An interactive tutorial. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD’11). ACM, 1213--1216. DOI:https://doi.org/10.1145/1989323.1989456Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Herbert Jordan, Bernhard Scholz, and Pavle Subotić. 2016. Soufflé: On synthesis of program analyzers. In Proc. Comput. Aid. Verif. 422--430.Google ScholarGoogle ScholarCross RefCross Ref
  30. Herbert Jordan, Pavle Subotić, David Zhao, and Bernhard Scholz. 2019. A specialized B-tree for concurrent datalog evaluation. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP’19). ACM, New York, NY, 327--339. DOI:https://doi.org/10.1145/3293883.3295719Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Werner Kießling and Ulrich Güntzer. 1994. Database reasoning - A deductive framework for solving large and complex problems by means of subsumption. In Workshop on Information Systems and Artificial Intelligence. Springer, 118--138.Google ScholarGoogle ScholarCross RefCross Ref
  32. Sven Köhler, Bertram Ludäscher, and Yannis Smaragdakis. 2012. Declarative datalog debugging for mere mortals. In Lecture Notes in Computer Science, Volume 7494 (2012), 111--122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Seokki Lee, Sven Köhler, Bertram Ludäscher, and Boris Glavic. 2017. Efficiently computing provenance graphs for queries with negation. CoRR abs/1701.05699 (2017). http://arxiv.org/abs/1701.05699.Google ScholarGoogle Scholar
  34. Seokki Lee, Bertram Ludäscher, and Boris Glavic. 2018. Provenance summaries for answers and non-answers. Proc. VLDB Endow. 11, 12 (Aug. 2018), 1954--1957. DOI:https://doi.org/10.14778/3229863.3236233Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Senlin Liang, Paul Fodor, Hui Wan, and Michael Kifer. 2009. OpenRuleBench: An analysis of the performance of rule engines. In Proceedings of the 18th International Conference on World Wide Web (WWW’09). 601--610.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Magnus Madsen, Ming-Ho Yee, and Ondřej Lhoták. 2016. From datalog to flix: A declarative language for fixed points on lattices. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). ACM, New York, NY, 194--208. DOI:https://doi.org/10.1145/2908080.2908096Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Ravi Mangal, Xin Zhang, Aditya V. Nori, and Mayur Naik. 2015. A user-guided approach to program analysis. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering (ESEC/FSE’15). ACM, New York, NY, 462--473. DOI:https://doi.org/10.1145/2786805.2786851Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Xinming Ou, Sudhakar Govindavajhala, and Andrew W. Appel. 2005. MulVAL: A logic-based network security analyzer. In Proceedings of the 14th Conference on USENIX Security Symposium, Volume 14 (SSYM’05). USENIX Association, Berkeley, CA, 8--8.Google ScholarGoogle Scholar
  39. Mukund Raghothaman, Sulekha Kulkarni, Kihong Heo, and Mayur Naik. 2018. User-guided program reasoning using Bayesian inference. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation. 722--735. DOI:https://doi.org/10.1145/3192366.3192417Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Raghu Ramakrishnan and S. Sudarshan. 1991. Top-down vs. bottom-up revisited. In Proceedings of the International Logic Programming Symposium. MIT Press, 321--336.Google ScholarGoogle Scholar
  41. Ehud Y. Shapiro. 1983. Algorithmic Program DeBugging. MIT Press, Cambridge, MA.Google ScholarGoogle Scholar
  42. Josep Silva. 2007. A comparative study of algorithmic debugging strategies. In Logic-Based Program Synthesis and Transformation, Germán Puebla (Ed.). Springer Berlin, 143--159.Google ScholarGoogle Scholar
  43. Manu Sridharan, Stephen J. Fink, and Rastislav Bodik. 2007. Thin slicing. In ACM SIGPLAN Notices, Vol. 42. ACM, 112--122.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Manu Sridharan, Denis Gopan, Lexin Shan, and Rastislav Bodík. 2005. Demand-driven points-to analysis for Java. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’05). ACM, New York, NY, 59--76. DOI:https://doi.org/10.1145/1094811.1094817Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Manolis Stamatogiannakis, Paul Groth, and Herbert Bos. 2015. Decoupling provenance capture and analysis from execution. In Proceedings of the 7th USENIX Workshop on the Theory and Practice of Provenance (TaPP’15). USENIX Association.Google ScholarGoogle Scholar
  46. Pavle Subotić, Herbert Jordan, Lijun Chang, Alan Fekete, and Bernhard Scholz. 2018. Automatic index selection for large-scale datalog computation. Proc. VLDB 12, 2 (2018), 141--153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jeffrey D. Ullman. 1989. Bottom-up beats top-down for datalog. In Proceedings of the 8th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’89). ACM, New York, NY, 140--149. DOI:https://doi.org/10.1145/73721.73736Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. John Whaley, Dzintars Avots, Michael Carbin, and Monica S. Lam. 2005. Using Datalog with Binary Decision Diagrams for Program Analysis. Springer, Berlin, 97--118. DOI:https://doi.org/10.1007/11575467_8Google ScholarGoogle Scholar
  49. Jennifer Widom. 2005. Trio: A system for integrated management of data, accuracy, and lineage. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR’05). 262--276.Google ScholarGoogle Scholar
  50. Xin Zhang, Radu Grigore, Xujie Si, and Mayur Naik. 2017. Effective interactive resolution of static analysis alarms. Proc. ACM Program. Lang. 1 (Oct. 2017), Article 57, 30 pages. DOI:https://doi.org/10.1145/3133881Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Xin Zhang, Ravi Mangal, Radu Grigore, Mayur Naik, and Hongseok Yang. 2014. On abstraction refinement for program analyses in datalog. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’14). ACM, New York, NY, 239--248. DOI:https://doi.org/10.1145/2594291.2594327Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. David Zhao. 2017. Large-Scale Provenance for Souffle. University of Sydney.Google ScholarGoogle Scholar
  53. Wenchao Zhou, Micah Sherr, Tao Tao, Xiaozhou Li, Boon Thau Loo, and Yun Mao. 2010. Efficient querying and maintenance of network provenance at Internet-scale. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010). 615--626.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Debugging Large-scale Datalog: A Scalable Provenance Evaluation Strategy

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Programming Languages and Systems
        ACM Transactions on Programming Languages and Systems  Volume 42, Issue 2
        June 2020
        286 pages
        ISSN:0164-0925
        EISSN:1558-4593
        DOI:10.1145/3395960
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 April 2020
        • Accepted: 1 January 2020
        • Revised: 1 November 2019
        • Received: 1 March 2019
        Published in toplas Volume 42, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!