Abstract
Managing fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only for addressing complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality, such as revision processing or query debugging. This article introduces a novel approach that uses operator instrumentation, that is, modifying the behavior of operators, to generate and propagate fine-grained provenance through several operators of a query network. In addition to applying this technique to compute provenance eagerly during query execution, we also study how to decouple provenance computation from query processing to reduce runtime overhead and avoid unnecessary provenance retrieval. Our proposals include computing a concise superset of the provenance (to allow lazily replaying a query and reconstruct its provenance) as well as lazy retrieval (to avoid unnecessary reconstruction of provenance). We develop stream-specific compression methods to reduce the computational and storage overhead of provenance generation and retrieval. Ariadne, our provenance-aware extension of the Borealis DSMS implements these techniques. Our experiments confirm that Ariadne manages provenance with minor overhead and clearly outperforms query rewrite, the current state of the art.
- Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Cetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stan Zdonik. 2005. The design of the Borealis stream processing engine. In Proceedings of the Conference on Innovative Data Systems Research (CIDR). 277--289.Google Scholar
- Umut Acar, Peter Buneman, James Cheney, Jan van den Bussche, Natalia Kwasnikowska, and Stijn Vansummeren. 2010. A graph model of data and workflow provenance. In Proceedings of the Workshop on the Theory and Practice of Provenance (TaPP). Google Scholar
Digital Library
- Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Efficient pattern matching over event streams. In Proceedings of the International Conference on Management of Data (SIGMOD). 147--160. Google Scholar
Digital Library
- Mohamed H. Ali, Ciprian Gerea, Balan Sethu Raman, Beysim Sezgin, Tiho Tarnavski, Tomer Verona, Ping Wang, Peter Zabback, A. Ananthanarayan, A. Kirilov, Ming Lu, Alex Raizman, Ramkumar Krishnan, Raman Schindlaur, Torsten Grabs, Sharm Bjeleticl Badrish Chandramonli, Jonathan Goldstein, Sudin Bhat, Ying Li, Vincenzo Di Nicola, Xinafara Wanq, David Maier, Ivo Santos, Olivier Nano, and Stephan Grell. 2009. Microsoft CEP server and online behavioral targeting. In Proceeding of the International Conference on Very Large Data Bases (VLDB). 1558--1561.Google Scholar
Digital Library
- Foteini Alvanaki, Sebastian Michel, Krithi Ramamanitham, and Gerald Weikum. 2012. See What's enBlogue: Real-time emergent topic identification in social media. In Proceeding of the International Conference on Extending Database Technology (EDBT). 336--347. Google Scholar
Digital Library
- Yael Amsterdamer, Susan B. Davidson, Daniel Deutch, Tova Milo, Julia Stoyanovich, and Val Tannen. 2011a. Putting lipstick on pig: Enabling database-style workflow provenance. Proc. VLDB Endow. 5, 4 (2011), 346--357. Google Scholar
Digital Library
- Yael Amsterdamer, Daniel Deutch, Tova Milo, and Val Tannen. 2011b. On provenance minimization. In Proceedings of the Symposium on Principles of Database Systems (PODS). 1--36. Google Scholar
Digital Library
- Manish Kumar Anand, Shawn Bowers, Timothy McPhillips, and Bertram Ludäscher. 2009. Efficient provenance storage over nested data collections. In Proceedings of the International Conference on Extending Database Technology (EDBT). 958--969. Google Scholar
Digital Library
- Omar Benjelloun, Anish Das Sarma, Alon Halevy, and Jennifer Widom. 2006. ULDBs: Databases with uncertainty and lineage. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 953--964. Google Scholar
Digital Library
- Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya. 2004. An annotation management system for relational databases. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 900--911. Google Scholar
Digital Library
- Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renée J. Miller, and Nesime Tatbul. 2010. SECRET: A model for analysis of the execution semantics of stream processing systems. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 232--243.Google Scholar
Digital Library
- Adriane P. Chapman, Hosagrahar V. Jagadish, and Prakash Ramanan. 2008. Efficient provenance storage. In Proceedings of the International Conference on Management of Data (SIGMOD). 993--1006. Google Scholar
Digital Library
- James Cheney, Laura Chiticariu, and Wang-Chiew Tan. 2009. Provenance in databases: Why, how, and where. Found. Trends Datab. 1, 4 (2009), 379--474. Google Scholar
Digital Library
- Yingwei Cui, Jennifer Widom, and Janet L. Wiener. 2000. Tracing the lineage of view data in a warehousing environment. ACM Trans. Datab. Syst. (TODS) 25, 2 (2000), 179--227. Google Scholar
Digital Library
- Susan B. Davidson, Sarah Cohen Boulakia, Anat Eyal, Bertram Ludäscher, Timothy M. McPhillips, Shawn Bowers, Manish Kumar Anand, and Juliana Freire. 2007. Provenance in scientific workflow systems. IEEE Data Eng. Bulle. 32, 4 (2007), 44--50.Google Scholar
- Wim De Pauw, Mihai Leţia, Buğra Gedik, Henrique Andrade, Andy Frenkiel, Michael Pfeifer, and Daby Sow. 2010. Visual debugging for stream processing applications. In Proceedings of the International Conference on Runtime Verification (RV). 18--35. Google Scholar
Digital Library
- Peter M. Fischer, Aayush Garg, and Kyumars Sheykh Esmaili. 2010. Extending XQuery with a pattern matching facility. In Proceeding of the International XML Database Symposium (XSym). 48--57. Google Scholar
Digital Library
- Boris Glavic and Gustavo Alonso. 2009. Perm: Processing provenance and data on the same data model through query rewriting. In Proceedings of the International Conference on Data Engineering (ICDE). 174--185. Google Scholar
Digital Library
- Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer, and Nesime Tatbul. 2011. The case for fine-grained stream provenance. In Proceedings of the BTW Workshop on Data Streams and Event Processing (DSEP). 58--61.Google Scholar
- Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer, and Nesime Tatbul. 2012. Ariadne: Managing fine-grained provenance on data streams. Technical Report 771. ETH Zurich.Google Scholar
- Boris Glavic, Kyumars Sheykh Esmaili, Peter M. Fischer, and Nesime Tatbul. 2013. Ariadne: Managing fine-grained provenance on data streams. In Proceedings of the International Conference on Distributed Event-Based Systems (DEBS). 39--50. Google Scholar
Digital Library
- Todd J. Green, Grigoris Karvounarakis, and Val Tannen. 2007. Provenance semirings. In Proceedings of the Symposium on Principles of Database Systems (PODS). 31--40. Google Scholar
Digital Library
- Mohammad Rezwanul Huq, Andreas Wombacher, and Peter M. G. Apers. 2011. Adaptive inference of fine-grained data provenance to achieve high accuracy at lower storage costs. In Proceedings of the IEEE International Conference on E-Science (e-Science). 202--209. Google Scholar
Digital Library
- Zachary G. Ives, Todd J. Green, Grigoris Karvounarakis, Nicholas E. Taylor, Val Tannen, Partha Pratim Talukdar, Marie Jacob, and Fernando Pereira. 2008. The ORCHESTRA collaborative data sharing system. SIGMOD Record 37, 2 (2008), 26--32. Google Scholar
Digital Library
- Alberto Lerner and Dennis Shasha. 2003. The virtues and challenges of ad hoc + streams querying in finance. IEEE Data Eng. Bullet. 26, 1 (2003), 49--56.Google Scholar
- Yuan Mei and Samuel Madden. 2009. ZStream: A cost-based query processor for adaptively detecting composite events. In Proceedings of the International Conference on Management of Data (SIGMOD). 193--206. Google Scholar
Digital Library
- Frederick Reiss and Joseph M. Hellerstein. 2005. Data triage: An adaptive architecture for load shedding in telegraphcq. In Proceedings of the International Conference on Data Engineering (ICDE). 155--156. Google Scholar
Digital Library
- Esther Ryvkina, Anurag S. Maskey, Mitch Cherniack, and Stan Zdonik. 2006. Revision processing in a stream processing engine: A high-level design. In Proceedings of the International Conference on Data Engineering (ICDE). 141--141. Google Scholar
Digital Library
- Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes twitter users: Real-time event detection by social sensors. In Proceedings of the International World Wide Web Conferences (WWW). 851--860. Google Scholar
Digital Library
- Nesime Tatbul, Uğur Çetintemel, Stan Zdonik, Mitch Cherniack, and Michael Stonebraker. 2003. Load shedding in a data stream manager. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 309--320. Google Scholar
Digital Library
- Nithya N. Vijayakumar and Beth Plale. 2006. Towards low overhead provenance tracking in near real-time stream filtering. In Proceedings of the International Provenance and Annotation Workshop (IPAW). 46--54. Google Scholar
Digital Library
- Min Wang, Marion Blount, John Davis, Archan Misra, and Daby Sow. 2007. A Time-and-value centric provenance model and architecture for medical event streams. In Proceeding of the ACM HealthNet Workshop. 95--100. Google Scholar
Digital Library
- Allison Woodruff and Michael Stonebraker. 1997. Supporting fine-grained data lineage in a database visualization environment. In Proceedings of the International Conference on Data Engineering (ICDE). 91--102. Google Scholar
Digital Library
Index Terms
Efficient Stream Provenance via Operator Instrumentation
Recommendations
Ariadne: managing fine-grained provenance on data streams
DEBS '13: Proceedings of the 7th ACM international conference on Distributed event-based systemsManaging fine-grained provenance is a critical requirement for data stream management systems (DSMS), not only to address complex applications that require diagnostic capabilities and assurance, but also for providing advanced functionality such as ...
Querying data provenance
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataMany advanced data management operations (e.g., incremental maintenance, trust assessment, debugging schema mappings, keyword search over databases, or query answering in probabilistic databases), involve computations that look at how a tuple was ...
On Provenance Minimization
Provenance information has been proved to be very effective in capturing the computational process performed by queries, and has been used extensively as the input to many advanced data management tools (e.g., view maintenance, trust assessment, or ...






Comments