skip to main content
research-article

Efficient Stream Provenance via Operator Instrumentation

Published:07 August 2014Publication History
Skip Abstract Section

Abstract

Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only for addressing complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality, such as revision processing or query debugging. This article introduces a novel approach that uses operator instrumentation, that is, modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce runtime overhead and avoid unnecessary provenance retrieval. Our proposals include computing a concise superset of the provenance (to allow lazily replaying a query and reconstruct its provenance) as well as lazy retrieval (to avoid unnecessary reconstruction of provenance). We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state of the art.

References

  1. Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stan Zdonik. 2005. The design of the Borealis stream processing engine. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). 277--289.Google ScholarGoogle Scholar
  2. Umut Acar, Peter Buneman, James Cheney, Jan van den Bussche, Natalia Kwasnikowska, and Stijn Vansummeren. 2010. A graph model of data and workflow provenance. In Proceedings of the Workshop on the Theory and Practice of Provenance (TaPP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Efficient pattern matching over event streams. In Proceedings of the International Conference on Management of Data (SIGMOD). 147--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mohamed H. Ali, Ciprian Gerea, Balan Sethu Raman, Beysim Sezgin, Tiho Tarnavski, Tomer Verona, Ping Wang, Peter Zabback, A. Ananthanarayan, A. Kirilov, Ming Lu, Alex Raizman, Ramkumar Krishnan, Raman Schindlaur, Torsten Grabs, Sharm Bjeleticl Badrish Chandramonli, Jonathan Goldstein, Sudin Bhat, Ying Li, Vincenzo Di Nicola, Xinafara Wanq, David Maier, Ivo Santos, Olivier Nano, and Stephan Grell. 2009. Microsoft CEP server and online behavioral targeting. In Proceeding of the International Conference on Very Large Data Bases (VLDB). 1558--1561.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Foteini Alvanaki, Sebastian Michel, Krithi Ramamanitham, and Gerald Weikum. 2012. See What's enBlogue: Real-time emergent topic identification in social media. In Proceeding of the International Conference on Extending Database Technology (EDBT). 336--347. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011a. Putting lipstick on pig: Enabling database-style workflow provenance. Proc. VLDB Endow. 5, 4 (2011), 346--357. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yael Amsterdamer, Daniel Deutch, Tova Milo, and Val Tannen. 2011b. On provenance minimization. In Proceedings of the Symposium on Principles of Database Systems (PODS). 1--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Manish Kumar Anand, Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher. 2009. Efficient provenance storage over nested data collections. In Proceedings of the International Conference on Extending Database Technology (EDBT). 958--969. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Omar Benjelloun, Anish Das Sarma, Alon Halevy, and Jennifer Widom. 2006. ULDBs: Databases with uncertainty and lineage. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 953--964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya. 2004. An annotation management system for relational databases. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 900--911. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renée J. Miller, and Nesime Tatbul. 2010. SECRET: A model for analysis of the execution semantics of stream processing systems. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 232--243.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Adriane P. Chapman, Hosagrahar V. Jagadish, and Prakash Ramanan. 2008. Efficient provenance storage. In Proceedings of the International Conference on Management of Data (SIGMOD). 993--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases: Why, how, and where. Found. Trends Datab. 1, 4 (2009), 379--474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yingwei Cui, Jennifer Widom, and Janet L. Wiener. 2000. Tracing the lineage of view data in a warehousing environment. ACM Trans. Datab. Syst. (TODS) 25, 2 (2000), 179--227. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Susan B. Davidson, Sarah Cohen Boulakia, Anat Eyal, Bertram Ludäscher, Timothy M. McPhillips, Shawn Bowers, Manish Kumar Anand, and Juliana Freire. 2007. Provenance in scientific workflow systems. IEEE Data Eng. Bulle. 32, 4 (2007), 44--50.Google ScholarGoogle Scholar
  16. Wim De Pauw, Mihai Leţia, Buğra Gedik, Henrique Andrade, Andy Frenkiel, Michael Pfeifer, and Daby Sow. 2010. Visual debugging for stream processing applications. In Proceedings of the International Conference on Runtime Verification (RV). 18--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Peter M. Fischer, Aayush Garg, and Kyumars Sheykh Esmaili. 2010. Extending XQuery with a pattern matching facility. In Proceeding of the International XML Database Symposium (XSym). 48--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Boris Glavic and Gustavo Alonso. 2009. Perm: Processing provenance and data on the same data model through query rewriting. In Proceedings of the International Conference on Data Engineering (ICDE). 174--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer, and Nesime Tatbul. 2011. The case for fine-grained stream provenance. In Proceedings of the BTW Workshop on Data Streams and Event Processing (DSEP). 58--61.Google ScholarGoogle Scholar
  20. Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer, and Nesime Tatbul. 2012. Ariadne: Managing fine-grained provenance on data streams. Technical Report 771. ETH Zurich.Google ScholarGoogle Scholar
  21. Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer, and Nesime Tatbul. 2013. Ariadne: Managing fine-grained provenance on data streams. In Proceedings of the International Conference on Distributed Event-Based Systems (DEBS). 39--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the Symposium on Principles of Database Systems (PODS). 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Mohammad Rezwanul Huq, Andreas Wombacher, and Peter M. G. Apers. 2011. Adaptive inference of fine-grained data provenance to achieve high accuracy at lower storage costs. In Proceedings of the IEEE International Conference on E-Science (e-Science). 202--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zachary G. Ives, Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The ORCHESTRA collaborative data sharing system. SIGMOD Record 37, 2 (2008), 26--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alberto Lerner and Dennis Shasha. 2003. The virtues and challenges of ad hoc + streams querying in finance. IEEE Data Eng. Bullet. 26, 1 (2003), 49--56.Google ScholarGoogle Scholar
  26. Yuan Mei and Samuel Madden. 2009. ZStream: A cost-based query processor for adaptively detecting composite events. In Proceedings of the International Conference on Management of Data (SIGMOD). 193--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Frederick Reiss and Joseph M. Hellerstein. 2005. Data triage: An adaptive architecture for load shedding in telegraphcq. In Proceedings of the International Conference on Data Engineering (ICDE). 155--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Esther Ryvkina, Anurag S. Maskey, Mitch Cherniack, and Stan Zdonik. 2006. Revision processing in a stream processing engine: A high-level design. In Proceedings of the International Conference on Data Engineering (ICDE). 141--141. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the International World Wide Web Conferences (WWW). 851--860. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nesime Tatbul, Uğur Çetintemel, Stan Zdonik, Mitch Cherniack, and Michael Stonebraker. 2003. Load shedding in a data stream manager. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 309--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nithya N. Vijayakumar and Beth Plale. 2006. Towards low overhead provenance tracking in near real-time stream filtering. In Proceedings of the International Provenance and Annotation Workshop (IPAW). 46--54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Min Wang, Marion Blount, John Davis, Archan Misra, and Daby Sow. 2007. A Time-and-value centric provenance model and architecture for medical event streams. In Proceeding of the ACM HealthNet Workshop. 95--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Allison Woodruff and Michael Stonebraker. 1997. Supporting fine-grained data lineage in a database visualization environment. In Proceedings of the International Conference on Data Engineering (ICDE). 91--102. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Stream Provenance via Operator Instrumentation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Internet Technology
        ACM Transactions on Internet Technology  Volume 14, Issue 1
        Special Issue on Event Recognition
        July 2014
        161 pages
        ISSN:1533-5399
        EISSN:1557-6051
        DOI:10.1145/2659232
        • Editor:
        • Munindar P. Singh
        Issue’s Table of Contents

        Copyright © 2014 ACM

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 7 August 2014
        • Accepted: 1 April 2014
        • Revised: 1 March 2014
        • Received: 1 November 2013
        Published in toit Volume 14, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!