Abstract
Spark became the defacto industry standard as an execution engine for data preparation, cleaning, distributed machine learning, streaming and, warehousing over raw data. However, with the success of Python the landscape is shifting again; there is a strong demand for tools which better integrate with the Python landscape and do not have the impedance mismatch like Spark. In this paper, we demonstrate Tuplex (short for tuples and exceptions), a Python-native data preparation framework that allows users to develop and deploy pipelines faster and more robustly while providing bare-metal execution times through code compilation whenever possible.
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In SIGMOD, pages 1383--1394. ACM, 2015. Google Scholar
Digital Library
- A. Crotty, A. Galakatos, K. Dursun, T. Kraska, C. Binnig, U. Cetintemel, and S. Zdonik. An architecture for compiling udf-centric workflows. PVLDB, 8(12):1466--1477, 2015. Google Scholar
Digital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google Scholar
Digital Library
- Y. Li, N. R. Katsipoulakis, B. Chandramouli, J. Goldstein, and D. Kossmann. Mison: a fast json parser for data analytics. PVLDB, 10(10):1118--1129, 2017. Google Scholar
Digital Library
- T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google Scholar
Digital Library
- F. Pérez and B. E. Granger. IPython: a system for interactive scientific computing. Computing in Science and Engineering, 9(3):21--29, May 2007. Google Scholar
Digital Library
- G. Piatetsky. Python eats away at r: Top software for analytics, data science, machine learning in 2018: Trends and analysis, May 2018.Google Scholar
- picloud. The cloudpickle package. (acc. 11/25/2018).Google Scholar
- A. Rubin. Column store database benchmarks: Mariadb columnstore vs. clickhouse vs. apache spark - percona database performance blog. https://www.percona.com/blog/2017/03/17/column-store-database-benchmarks\-mariadb-columnstore-vs-clickhouse-\vs-apache-spark/, mar 2017. (acc. 03/18/2019).Google Scholar
- M. Stonebraker. Technical perspective - one size fits all: an idea whose time has come and gone. Commun. ACM, 51(12):76, 2008. Google Scholar
Digital Library
- G. Van Rossum et al. Python programming language. In USENIX Annual Technical Conference, volume 41, page 36, 2007.Google Scholar
- T. Würthinger, C. Wimmer, C. Humer, A. Wöß, L. Stadler, C. Seaton, G. Duboscq, D. Simon, and M. Grimmer. Practical partial evaluation for high-performance dynamic language runtimes. In ACM SIGPLAN Notices, volume 52, pages 662--676. ACM, 2017. Google Scholar
Digital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, pages 2--2. USENIX Association, 2012. Google Scholar
Digital Library
Recommendations
Tuplex: Data Science in Python at Native Code Speed
SIGMOD '21: Proceedings of the 2021 International Conference on Management of DataToday's data science pipelines often rely on user-defined functions (UDFs) written in Python. But interpreted Python code is slow, and Python UDFs cannot be compiled to machine code easily.
We present Tuplex, a new data analytics framework that ...






Comments