skip to main content
10.1145/3076246.3076248acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Versioning for End-to-End Machine Learning Pipelines

Published: 14 May 2017 Publication History

Abstract

End-to-end machine learning pipelines that run in shared environments are challenging to implement. Production pipelines typically consist of multiple interdependent processing stages. Between stages, the intermediate results are persisted to reduce redundant computation and to improve robustness. Those results might come in the form of datasets for data processing pipelines or in the form of model coefficients in case of model training pipelines. Reusing persisted results improves efficiency but at the same time creates complicated dependencies. Every time one of the processing stages is changed, either due to code change or due to parameters change, it becomes difficult to find which datasets can be reused and which should be recomputed.
In this paper we build upon previous work to produce derivations of datasets to ensure that multiple versions of a pipeline can run in parallel while minimizing the amount of redundant computations. Our extensions include partial derivations to simplify navigation and reuse, explicit support for schema changes of pipelines, and a central registry of running pipelines to coordinate upgrading pipelines between teams.

References

[1]
M. Aly, A. Hatch, V. Josifovski, and V. K. Narayanan. Web-scale user modeling for targeting. In Proceedings of the 21st International Conference on World Wide Web, pages 3--12. ACM, 2012.
[2]
A. Bhardwaj, A. Deshpande, A. J. Elmore, D. Karger, S. Madden, A. Parameswaran, H. Subramanyam, E. Wu, and R. Zhang. Collaborative data analytics with datahub. volume 8, pages 1916--1919. VLDB Endowment, Aug. 2015.
[3]
A. Chen, Y. Wu, A. Haeberlen, B. T. Loo, and W. Zhou. Data provenance at internet scale: architecture, experiences, and the road ahead. In Conference on Innovative Data Systems Research (CIDR), 2017.
[4]
D. Crankshaw, P. Bailis, J. E. Gonzalez, H. Li, Z. Zhang, M. J. Franklin, A. Ghodsi, and M. I.Jordan. The missing piece in complex analytics: Low latency, scalable model management and serving with velox. In Conference on Innovative Data Systems Research (CIDR), 2015.
[5]
D. Dig and R. Johnson. How do apis evolve? A story of refactoring. Journal of software maintenance and evolution: Research and Practice, 18(2):83--107, 2006.
[6]
E. Dolstra, A. Löh, and N. Pierron. Nixos: A purely functional linux distribution. Journal of Functional Programming, 20(5-6):577--615, 2010.
[7]
M. Greenwood, C. Goble, R. D. Stevens, J. Zhao, M. Addis, D. Marvin, L. Moreau, and T. Oinn. Provenance of e-science experiments-experience from bioinformatics. In Proceedings of UK e-Science All Hands Meeting 2003, pages 223--226, 2003.
[8]
A. Halevy, F. Korn, N. F. Noy, C. Olston, N. Polyzotis, S. Roy, and S. E. Whang. Goods: Organizing google's datasets. In Proceedings of the 2016 International Conference on Management of Data, pages 795--806. ACM, 2016.
[9]
D. P. Lanter. Design of a lineage-based meta-data base for gis. Cartography and Geographic Information Systems, 18(4):255--261, 1991.
[10]
H. Li, A. Ghodsi, M. Zaharia, S. Shenker, and I. Stoica. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In Proceedings of the ACM Symposium on Cloud Computing, SOCC '14, pages 6:1--6:15, New York, NY, USA, 2014. ACM.
[11]
M. Maddox, D. Goehring, A. J. Elmore, S. Madden, A. Parameswaran, and A. Deshpande. Decibel: The relational dataset branching system. volume 9, pages 624-635. VLDB Endowment, May 2016.
[12]
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen, D. Xin, R. Xin, M. J. Franklin, R. Zadeh, M. Zaharia, and A. Talwalkar. Mllib: Machine learning in apache spark. Journal of Machine Learning Research, 17(1):1235--1241, Jan. 2016.
[13]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.
[14]
T. Preston-Werner. Semantic versioning 2.0.0. http://semver.org/spec/v2.0.0.html.
[15]
S. Raemaekers, A. van Deursen, and J. Visser. Semantic versioning and impact of breaking changes in the maven repository. Journal of Systems and Software, 2016.
[16]
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young. Machine learning: The high interest credit card of technical debt. In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop), 2014.
[17]
T. van der Weide, O. Smirnov, M. Zielinski, D. Papadopoulos, and T. van Kasteren. Versioned machine learning pipelines for batch experimentation. In ML Systems Workshop NIPS 2016, 2016.
[18]
M. Vartak, P. Ortiz, K. Siegel, H. Subramanyam, S. Madden, and M. Zaharia. Supporting fast iteration in model building. LearningSys, 2015.
[19]
M. Vartak, H. Subramanyam, W.-E. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. Model db: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, page 14. ACM, 2016.
[20]
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML '09, pages 1113--1120, New York, NY, USA, 2009. ACM.

Cited By

View all
  • (2024)MLOps-Enabled Security Strategies for Next-Generation Operational TechnologiesProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661283(662-667)Online publication date: 18-Jun-2024
  • (2024)Software engineering practices for machine learning — Adoption, effects, and team assessmentJournal of Systems and Software10.1016/j.jss.2023.111907209(111907)Online publication date: Mar-2024
  • (2023)Metadata Representations for Queryable Repositories of Machine Learning ModelsIEEE Access10.1109/ACCESS.2023.333064711(125616-125630)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DEEM'17: Proceedings of the 1st Workshop on Data Management for End-to-End Machine Learning
May 2017
36 pages
ISBN:9781450350266
DOI:10.1145/3076246
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2017

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SIGMOD/PODS'17

Acceptance Rates

Overall Acceptance Rate 44 of 67 submissions, 66%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)70
  • Downloads (Last 6 weeks)7
Reflects downloads up to 23 Sep 2024

Other Metrics

Citations

Cited By

View all
  • (2024)MLOps-Enabled Security Strategies for Next-Generation Operational TechnologiesProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661283(662-667)Online publication date: 18-Jun-2024
  • (2024)Software engineering practices for machine learning — Adoption, effects, and team assessmentJournal of Systems and Software10.1016/j.jss.2023.111907209(111907)Online publication date: Mar-2024
  • (2023)Metadata Representations for Queryable Repositories of Machine Learning ModelsIEEE Access10.1109/ACCESS.2023.333064711(125616-125630)Online publication date: 2023
  • (2023)Incorporating experts’ judgment into machine learning modelsExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.120118228:COnline publication date: 15-Oct-2023
  • (2023)Ant: a process aware annotation software for regulatory complianceArtificial Intelligence and Law10.1007/s10506-023-09372-9Online publication date: 9-Aug-2023
  • (2022)Materialization and Reuse Optimizations for Production Data Science PipelinesProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526186(1962-1976)Online publication date: 10-Jun-2022
  • (2022)The art and practice of data science pipelinesProceedings of the 44th International Conference on Software Engineering10.1145/3510003.3510057(2091-2103)Online publication date: 21-May-2022
  • (2022)Software Engineering for AI-Based Systems: A SurveyACM Transactions on Software Engineering and Methodology10.1145/348704331:2(1-59)Online publication date: 1-Apr-2022
  • (2022)On Distribution of Intelligence: From the Perspective of Edge Computing2022 13th International Conference on Information and Communication Technology Convergence (ICTC)10.1109/ICTC55196.2022.9952652(504-506)Online publication date: 19-Oct-2022
  • (2022)MLOps: A Taxonomy and a MethodologyIEEE Access10.1109/ACCESS.2022.318173010(63606-63618)Online publication date: 2022
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media