skip to main content
research-article

Tuplex: robust, efficient analytics when Python rules

Published:01 August 2019Publication History
Skip Abstract Section

Abstract

Spark became the defacto industry standard as an execution engine for data preparation, cleaning, distributed machine learning, streaming and, warehousing over raw data. However, with the success of Python the landscape is shifting again; there is a strong demand for tools which better integrate with the Python landscape and do not have the impedance mismatch like Spark. In this paper, we demonstrate Tuplex (short for tuples and exceptions), a Python-native data preparation framework that allows users to develop and deploy pipelines faster and more robustly while providing bare-metal execution times through code compilation whenever possible.

References

  1. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In SIGMOD, pages 1383--1394. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Cetintemel, and S. Zdonik. An architecture for compiling udf-centric workflows. PVLDB, 8(12):1466--1477, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: a fast json parser for data analytics. PVLDB, 10(10):1118--1129, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Pérez and B. E. Granger. IPython: a system for interactive scientific computing. Computing in Science and Engineering, 9(3):21--29, May 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Piatetsky. Python eats away at r: Top software for analytics, data science, machine learning in 2018: Trends and analysis, May 2018.Google ScholarGoogle Scholar
  8. picloud. The cloudpickle package. (acc. 11/25/2018).Google ScholarGoogle Scholar
  9. A. Rubin. Column store database benchmarks: Mariadb columnstore vs. clickhouse vs. apache spark - percona database performance blog. https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks\-mariadb-columnstore-vs-clickhouse-\vs-apache-spark/, mar 2017. (acc. 03/18/2019).Google ScholarGoogle Scholar
  10. M. Stonebraker. Technical perspective - one size fits all: an idea whose time has come and gone. Commun. ACM, 51(12):76, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Van Rossum et al. Python programming language. In USENIX Annual Technical Conference, volume 41, page 36, 2007.Google ScholarGoogle Scholar
  12. T. Würthinger, C. Wimmer, C. Humer, A. Wöß, L. Stadler, C. Seaton, G. Duboscq, D. Simon, and M. Grimmer. Practical partial evaluation for high-performance dynamic language runtimes. In ACM SIGPLAN Notices, volume 52, pages 662--676. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 12, Issue 12
    August 2019
    547 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 August 2019
    Published in pvldb Volume 12, Issue 12

    Qualifiers

    • research-article
  • Article Metrics

    • Downloads (Last 12 months)17
    • Downloads (Last 6 weeks)1

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!