Abstract
We are currently witnessing the rapid evolution and adoption of various data science frameworks that function external to the database. Any support from conventional RDBMS implementations for data science applications has been limited to procedural paradigms such as user-defined functions (UDFs) that lack exploratory programming support. Therefore, the current status quo is that during the exploratory phase, data scientists usually use the database system as the "data storage" layer of the data science framework, whereby the majority of computation and analysis is performed outside the database, e.g., at the client node. We demonstrate AIDA, an in-database framework for data scientists. AIDA allows users to write interactive Python code using a development environment such as a Jupyter notebook. The actual execution itself takes place inside the database (near-data), where a server component of AIDA, that resides inside the embedded Python interpreter of the RDBMS, manages the data sets and computations. The demonstration will also show the visualization capabilities of AIDA where the progress of computation can be observed through live updates. Our evaluations show that AIDA performs several times faster compared to contemporary external data science frameworks, but is much easier to use for exploratory development compared to database UDFs.
- J. V. D'Silva, F. De Moor, and B. Kemme. AIDA-Abstraction for Advanced In-Database Analytics. PVLDB, 11(11):1400--1413, 2018. Google Scholar
Digital Library
- J. Lajus and H. Mühleisen. Efficient Data Management and Statistics with Zero-Copy Integration. In SSDBM, pages 12:1--12:10. ACM, 2014. Google Scholar
Digital Library
- W. McKinney. pandas: a Foundational Python Library for Data Analysis and Statistics. Python for High Performance and Scientific Computing, pages 1--9, 2011.Google Scholar
- S. Melnik, A. Adya, and P. A. Bernstein. Compiling Mappings to Bridge Applications and Databases. Transactions on Database Systems, 33(4):22, 2008. Google Scholar
Digital Library
- H. Mühleisen and T. Lumley. Best of Both Worlds: Relational Databases and Statistics. In SSDBM, pages 32:1--32:4. ACM, 2013. Google Scholar
Digital Library
- M. Raasveldt and H. Mühleisen. Vectorized UDFs in Column-Stores. In SSDBM, pages 16:1--16:12. ACM, 2016. Google Scholar
Digital Library
- T. Vincenty. Direct and Inverse Solutions of Geodesics on the Ellipsoid with Application of Nested Equations. Survey Review, 23(176):88--93, 1975.Google Scholar
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster Computing With Working Sets. HotCloud, 10(10--10):95, 2010. Google Scholar
Digital Library
Recommendations
Managing Large Scale Unstructured Data with RDBMS
DASC '13: Proceedings of the 2013 IEEE 11th International Conference on Dependable, Autonomic and Secure ComputingWith the rapid development of information technology, the needs of unstructured data storage and processing is growing rapidly, which develops a new requirement for the database storage. Traditional row-oriented relational databases appear to be ...
The RDBMS Industry: A Northern California Perspective
This article describes the origins and development of the relational database management systems (RDBMS) industry, focusing on the firms IBM, Oracle, Ingres, Informix, and Sybase in the 1980s. The author analyzes the industry's evolution in terms of the ...
SQL2Cypher: Automated Data and Query Migration from RDBMS to GDBMS
Web Information Systems Engineering – WISE 2021AbstractThere are many real-world application domains where data can be naturally modelled as a graph, such as social networks and computer networks. Relational Database Management Systems (RDBMS) find it hard to capture the relationships and inherent ...






Comments