Abstract
Big data analytics frameworks like Apache Spark and Flink enable users to implement queries over large, distributed databases using functional APIs. In recent years, these APIs have grown in popularity because their functional interfaces abstract away much of the minutiae of distributed programming required by traditional query languages like SQL. However, the convenience of these APIs comes at a cost because functional queries are often less efficient than their SQL counterparts. Motivated by this observation, we present a new technique for automatically transpiling functional queries to SQL. While our approach is based on the standard paradigm of counterexample-guided inductive synthesis, it uses a novel column-wise decomposition technique to split the synthesis task into smaller subquery synthesis problems. We have implemented this approach as a new tool called RDD2SQL for translating Spark RDD queries to SQL and empirically evaluate the effectiveness of RDD2SQL on a set of real-world RDD queries. Our results show that (1) most RDD queries can be translated to SQL, (2) our tool is very effective at automating this translation, and (3) performing this translation offers significant performance benefits.
Supplemental Material
Available for Download
Appendices A) SUPPORTED SPARK RDD APIS B) TRANSLATING TARGET LANGUAGE TO SQL C) PROOF OF THEOREM 5.4 D) PROOF OF THEOREM 5.6 E) PROOF OF THEOREM 5.7 F) PROOF OF THEOREM 5.8
- Maaz Bin Safeer Ahmad, Jonathan Ragan-Kelley, Alvin Cheung, and Shoaib Kamil. 2019. Automatically translating image processing libraries to halide. ACM Transactions on Graphics (TOG), 38, 6 (2019), 1–13. https://doi.org/10.1145/3355089.3356549
Google Scholar
Digital Library
- Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, and Volker Markl. 2014. The stratosphere platform for big data analytics. The VLDB Journal, 23, 6 (2014), 939–964. https://doi.org/10.1007/s00778-014-0357-y
Google Scholar
Digital Library
- Rajeev Alur, Pavol Černý, and Arjun Radhakrishna. 2015. Synthesis Through Unification. In Computer Aided Verification, Daniel Kroening and Corina S. Păsăreanu (Eds.). Springer International Publishing, Cham. 163–179. isbn:978-3-319-21668-3 https://doi.org/10.1007/978-3-319-21668-3_10
Google Scholar
Cross Ref
- Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. 2017. Scaling Enumerative Program Synthesis via Divide and Conquer. In Tools and Algorithms for the Construction and Analysis of Systems, Axel Legay and Tiziana Margaria (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 319–336. isbn:978-3-662-54577-5 https://doi.org/10.1007/978-3-319-21668-3_10
Google Scholar
Cross Ref
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA. 1383–1394. isbn:9781450327589 https://doi.org/10.1145/2723372.2742797
Google Scholar
Digital Library
- Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA. 221–230. isbn:9781450347037 https://doi.org/10.1145/3183713.3190662
Google Scholar
Digital Library
- Dirk Beyer, Andreas Holzer, Michael Tautschnig, and Helmut Veith. 2013. Information Reuse for Multi-goal Reachability Analyses. In Programming Languages and Systems, Matthias Felleisen and Philippa Gardner (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 472–491. isbn:978-3-642-37036-6 https://doi.org/10.1007/978-3-642-37036-6_26
Google Scholar
Digital Library
- Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 38, 4 (2015).
Google Scholar
- Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. Scope: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1, 2 (2008), 1265–1276. https://doi.org/10.14778/1454159.1454166
Google Scholar
Digital Library
- Lin Cheng. 2019. SqlSol: An accurate SQL Query Synthesizer. In Formal Methods and Software Engineering, Yamine Ait-Ameur and Shengchao Qin (Eds.). Springer International Publishing, Cham. 104–120. isbn:978-3-030-32409-4 https://doi.org/10.1007/978-3-030-32409-4_7
Google Scholar
Digital Library
- Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing database-backed applications with query synthesis. ACM SIGPLAN Notices, 48, 6 (2013), 3–14. https://doi.org/10.1145/2499370.2462180
Google Scholar
Digital Library
- Edmund Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004), Kurt Jensen and Andreas Podelski (Eds.) (Lecture Notes in Computer Science, Vol. 2988). Springer, 168–176. isbn:3-540-21299-X https://doi.org/10.1007/978-3-540-24730-2_15
Google Scholar
Cross Ref
- Patrick Cousot and Radhia Cousot. 1977. Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints. In Proceedings of the 4th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL ’77). Association for Computing Machinery, New York, NY, USA. 238–252. isbn:9781450373500 https://doi.org/10.1145/512950.512973
Google Scholar
Digital Library
- Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Çetintemel, and Stanley B Zdonik. 2015. Tupleware:" Big" Data, Big Analytics, Small Clusters.. In CIDR.
Google Scholar
- K. Venkatesh Emani, Tejas Deshpande, Karthik Ramachandra, and S. Sudarshan. 2017. DBridge: Translating Imperative Code to SQL. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA. 1663–1666. isbn:9781450341974 https://doi.org/10.1145/3035918.3058747
Google Scholar
Digital Library
- Gregory Essertel, Ruby Tahboub, James Decker, Kevin Brown, Kunle Olukotun, and Tiark Rompf. 2018. Flare: Optimizing apache spark with native compilation for scale-up architectures and medium-size data. In 13th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 18). 799–815.
Google Scholar
- Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-based synthesis of table consolidation and transformation tasks from examples. ACM SIGPLAN Notices, 52, 6 (2017), 422–436. https://doi.org/10.1145/3140587.3062351
Google Scholar
Digital Library
- John K Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-output examples. ACM SIGPLAN Notices, 50, 6 (2015), 229–239. https://doi.org/10.1145/2813885.2737977
Google Scholar
Digital Library
- G. Graefe and W.J. McKenna. 1993. The Volcano optimizer generator: extensibility and efficient search. In Proceedings of IEEE 9th International Conference on Data Engineering. 209–218. https://doi.org/10.1109/ICDE.1993.344061
Google Scholar
Cross Ref
- Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program synthesis. Foundations and Trends® in Programming Languages, 4, 1-2 (2017), 1–119. https://doi.org/10.1561/2500000010
Google Scholar
Cross Ref
- Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting code optimizations in data-parallel pipelines through periscope. In 10th $USENIX$ Symposium on Operating Systems Design and Implementation ($OSDI$ 12). 121–133.
Google Scholar
- Sankha Narayan Guria, Jeffrey S. Foster, and David Van Horn. 2021. RbSyn: Type- and Effect-Guided Program Synthesis. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). Association for Computing Machinery, New York, NY, USA. 344–358. isbn:9781450383912 https://doi.org/10.1145/3453483.3454048
Google Scholar
Digital Library
- Andreas Holzer, Christian Schallhart, Michael Tautschnig, and Helmut Veith. 2010. How Did You Specify Your Test Suite. In Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering (ASE ’10). Association for Computing Machinery, New York, NY, USA. 407–416. isbn:9781450301169 https://doi.org/10.1145/1858996.1859084
Google Scholar
Digital Library
- Zhongjun Jin, Michael R. Anderson, Michael Cafarella, and H. V. Jagadish. 2017. Foofah: Transforming Data By Example. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA. 683–698. isbn:9781450341974 https://doi.org/10.1145/3035918.3064034
Google Scholar
Digital Library
- Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. 2015. Trash day: Coordinating garbage collection in distributed systems. In 15th Workshop on Hot Topics in Operating Systems (HotOS $XV$).
Google Scholar
- Benjamin Mariano, Yanju Chen, Yu Feng, Shuvendu K Lahiri, and Isil Dillig. 2020. Demystifying Loops in Smart Contracts. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, New York, NY, USA. 262–274. https://doi.org/10.1145/3324884.3416626
Google Scholar
Digital Library
- Ruben Martins, Jia Chen, Yanju Chen, Yu Feng, and Isil Dillig. 2019. Trinity: An extensible synthesis framework for data science. Proceedings of the VLDB Endowment, 12, 12 (2019), 1914–1917. https://doi.org/10.14778/3352063.3352098
Google Scholar
Digital Library
- Christian Navasca, Cheng Cai, Khanh Nguyen, Brian Demsky, Shan Lu, Miryung Kim, and Guoqing Harry Xu. 2019. Gerenuk: Thin Computation over Big Native Data Using Speculative Program Transformation. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). Association for Computing Machinery, New York, NY, USA. 538–553. isbn:9781450368735 https://doi.org/10.1145/3341301.3359643
Google Scholar
Digital Library
- Khanh Nguyen, Lu Fang, Christian Navasca, Guoqing Xu, Brian Demsky, and Shan Lu. 2018. Skyway: Connecting managed heaps in distributed big data systems. ACM SIGPLAN Notices, 53, 2 (2018), 56–69. https://doi.org/10.1145/3296957.3173200
Google Scholar
Digital Library
- Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang, Jianfei Hu, and Guoqing Xu. 2015. Facade: A compiler and runtime for (almost) object-bounded big data applications. ACM SIGARCH Computer Architecture News, 43, 1 (2015), 675–690. https://doi.org/10.1145/2786763.2694345
Google Scholar
Digital Library
- Md Hasanuzzaman Noor and Leonidas Fegaras. 2020. Translation of Array-Based Loops to Spark SQL. In 2020 IEEE International Conference on Big Data (Big Data). 469–476. https://doi.org/10.1109/BigData50022.2020.9378136
Google Scholar
Cross Ref
- Pedro Orvalho, Miguel Terra-Neves, Miguel Ventura, Ruben Martins, and Vasco Manquinho. 2020. SQUARES: a SQL synthesizer using query reverse engineering. Proceedings of the VLDB Endowment, 13, 12 (2020), 2853–2856. https://doi.org/10.14778/3415478.3415492
Google Scholar
Digital Library
- Peter-Michael Osera and Steve Zdancewic. 2015. Type-and-example-directed program synthesis. ACM SIGPLAN Notices, 50, 6 (2015), 619–630. https://doi.org/10.1145/2813885.2738007
Google Scholar
Digital Library
- Shankara Pailoor, Yuepeng Wang, Xinyu Wang, and Isil Dillig. 2021. Synthesizing Data Structure Refinements from Integrity Constraints. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). Association for Computing Machinery, New York, NY, USA. 574–587. isbn:9781450383912 https://doi.org/10.1145/3453483.3454063
Google Scholar
Digital Library
- Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. 2016. Program synthesis from polymorphic refinement types. ACM SIGPLAN Notices, 51, 6 (2016), 522–538. https://doi.org/10.1145/2980983.2908093
Google Scholar
Digital Library
- Mohammad Raza and Sumit Gulwani. 2020. Web Data Extraction Using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA. 1967–1978. isbn:9781450367356 https://doi.org/10.1145/3318464.3380608
Google Scholar
Digital Library
- Xuanhua Shi, Zhixiang Ke, Yongluan Zhou, Hai Jin, Lu Lu, Xiong Zhang, Ligang He, Zhenyu Hu, and Fei Wang. 2019. Deca: a garbage collection optimizer for in-memory data processing. ACM Transactions on Computer Systems (TOCS), 36, 1 (2019), 1–47. https://doi.org/10.1145/3310361
Google Scholar
Digital Library
- Subhajit Sidhanta, Wojciech Golab, and Supratik Mukhopadhyay. 2016. OptEx: A Deadline-Aware Cost Optimization Model for Spark. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid). 193–202. https://doi.org/10.1109/CCGrid.2016.10
Google Scholar
Digital Library
- Rishabh Singh. 2016. Blinkfill: Semi-supervised programming by example for syntactic string transformations. Proceedings of the VLDB Endowment, 9, 10 (2016), 816–827. https://doi.org/10.14778/2977797.2977807
Google Scholar
Digital Library
- Rishabh Singh and Sumit Gulwani. 2012. Learning Semantic String Transformations from Examples. Proc. VLDB Endow., 5, 8 (2012), apr, 740–751. issn:2150-8097 https://doi.org/10.14778/2212351.2212356
Google Scholar
Digital Library
- Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proceedings of the VLDB Endowment, 11, 2 (2017), 189–202. https://doi.org/10.14778/3149193.3149199
Google Scholar
Digital Library
- Anirudh Sivaraman, Alvin Cheung, Mihai Budiu, Changhoon Kim, Mohammad Alizadeh, Hari Balakrishnan, George Varghese, Nick McKeown, and Steve Licking. 2016. Packet Transactions: High-Level Programming for Line-Rate Switches. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM ’16). Association for Computing Machinery, New York, NY, USA. 15–28. isbn:9781450341936 https://doi.org/10.1145/2934872.2934900
Google Scholar
Digital Library
- Calvin Smith and Aws Albarghouthi. 2016. MapReduce program synthesis. Acm Sigplan Notices, 51, 6 (2016), 326–340.
Google Scholar
Digital Library
- Armando Solar-Lezama. 2008. Program synthesis by sketching. University of California, Berkeley.
Google Scholar
Digital Library
- Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, and Vijay Saraswat. 2006. Combinatorial Sketching for Finite Programs. SIGARCH Comput. Archit. News, 34, 5 (2006), oct, 404–415. issn:0163-5964 https://doi.org/10.1145/1168919.1168907
Google Scholar
Digital Library
- Marcelo Sousa, Isil Dillig, Dimitrios Vytiniotis, Thomas Dillig, and Christos Gkantsidis. 2014. Consolidation of queries with user-defined functions. ACM SIGPLAN Notices, 49, 6 (2014), 554–564. https://doi.org/10.1145/2666356.2594305
Google Scholar
Digital Library
- Keita Takenouchi, Takashi Ishio, Joji Okada, and Yuji Sakata. 2021. PATSQL: Efficient Synthesis of SQL Queries from Example Tables with Quick Inference of Projected Columns. Proc. VLDB Endow., 14, 11 (2021), 1937–1949. http://www.vldb.org/pvldb/vol14/p1937-takenouchi.pdf
Google Scholar
Digital Library
- Aalok Thakkar, Aaditya Naik, Nathaniel Sands, Rajeev Alur, Mayur Naik, and Mukund Raghothaman. 2021. Example-Guided Synthesis of Relational Queries. In Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2021). Association for Computing Machinery, New York, NY, USA. 1110–1125. isbn:9781450383912 https://doi.org/10.1145/3453483.3454098
Google Scholar
Digital Library
- Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. 2009. Query by Output. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data (SIGMOD ’09). Association for Computing Machinery, New York, NY, USA. 535–548. isbn:9781605585512 https://doi.org/10.1145/1559845.1559902
Google Scholar
Digital Library
- Chenglong Wang, Alvin Cheung, and Rastislav Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-Output Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). Association for Computing Machinery, New York, NY, USA. 452–466. isbn:9781450349888 https://doi.org/10.1145/3062341.3062365
Google Scholar
Digital Library
- Kewen Wang, Mohammad Maifi Hasan Khan, Nhan Nguyen, and Swapna Gokhale. 2019. Design and implementation of an analytical framework for interference aware job scheduling on apache spark platform. Cluster Computing, 22, 1 (2019), 2223–2237. https://doi.org/10.1007/s10586-017-1466-3
Google Scholar
Digital Library
- Yuepeng Wang, Rushi Shah, Abby Criswell, Rong Pan, and Isil Dillig. 2020. Data Migration using Datalog Program Synthesis. Proc. VLDB Endow., 13, 7 (2020), 1006–1019. https://doi.org/10.14778/3384345.3384350
Google Scholar
Digital Library
- Guoqing Harry Xu, Margus Veanes, Michael Barnett, Madan Musuvathi, Todd Mytkowicz, Ben Zorn, Huan He, and Haibo Lin. 2019. Niijima: Sound and Automated Computation Consolidation for Efficient Multilingual Data-Parallel Pipelines. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). Association for Computing Machinery, New York, NY, USA. 306–321. isbn:9781450368735 https://doi.org/10.1145/3341301.3359649
Google Scholar
Digital Library
- Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing transformations on hierarchically structured data. ACM SIGPLAN Notices, 51, 6 (2016), 508–521. https://doi.org/10.1145/2980983.2908088
Google Scholar
Digital Library
- Navid Yaghmazadeh, Xinyu Wang, and Isil Dillig. 2018. Automated migration of hierarchical data to relational tables using programming-by-example. Proceedings of the VLDB Endowment, 11, 5 (2018), 580–593. https://doi.org/10.1145/3187009.3177735
Google Scholar
Digital Library
- Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer: query synthesis from natural language. Proceedings of the ACM on Programming Languages, 1, OOPSLA (2017), 1–26. https://doi.org/10.1145/3133887
Google Scholar
Digital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In 9th $USENIX$ Symposium on Networked Systems Design and Implementation ($NSDI$ 12). 15–28.
Google Scholar
- Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets.. HotCloud, 10, 10-10 (2010), 95.
Google Scholar
Digital Library
- Guoqiang Zhang, Yuanchao Xu, Xipeng Shen, and Işıl Dillig. 2021. UDF to SQL translation through compositional lazy inductive synthesis. Proceedings of the ACM on Programming Languages, 5, OOPSLA (2021), 1–26. https://doi.org/10.1145/3485489
Google Scholar
Digital Library
- Sai Zhang and Yuyin Sun. 2013. Automatically synthesizing SQL queries from input-output examples. In 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE). 224–234. https://doi.org/10.1109/ASE.2013.6693082
Google Scholar
Digital Library
Index Terms
Automated Translation of Functional Big Data Queries to SQL
Recommendations
UDF to SQL translation through compositional lazy inductive synthesis
Many data processing systems allow SQL queries that call user-defined functions (UDFs) written in conventional programming languages. While such SQL extensions provide convenience and flexibility to users, queries involving UDFs are not as efficient as ...
Optimizing Big-Data Queries Using Program Synthesis
SOSP '17: Proceedings of the 26th Symposium on Operating Systems PrinciplesClassical query optimization relies on a predefined set of rewrite rules to re-order and substitute SQL operators at a logical level. This paper proposes Blitz, a system that can synthesize efficient query-specific operators using automated program ...
Integrating Big Data and Relational Data with a Functional SQL-like Query Language
DEXA 2015: Proceedings, Part I, of the 26th International Conference on Database and Expert Systems Applications - Volume 9261Multistore systems have been recently proposed to provide integrated access to multiple, heterogeneous data stores through a single query engine. In particular, much attention is being paid on the integration of unstructured big data typically stored in ...






Comments