Abstract
Many data processing systems allow SQL queries that call user-defined functions (UDFs) written in conventional programming languages. While such SQL extensions provide convenience and flexibility to users, queries involving UDFs are not as efficient as their pure SQL counterparts that invoke SQL’s highly-optimized built-in functions. Motivated by this problem, we propose a new technique for translating SQL queries with UDFs to pure SQL expressions. Unlike prior work in this space, our method is not based on syntactic rewrite rules and can handle a much more general class of UDFs. At a high-level, our method is based on counterexample-guided inductive synthesis (CEGIS) but employs a novel compositional strategy that decomposes the synthesis task into simpler sub-problems. However, because there is no universal decomposition strategy that works for all UDFs, we propose a novel lazy inductive synthesis approach that generates a sequence of decompositions that correspond to increasingly harder inductive synthesis problems. Because most realistic UDF-to-SQL translation tasks are amenable to a fine-grained decomposition strategy, our lazy inductive synthesis method scales significantly better than traditional CEGIS.
We have implemented our proposed technique in a tool called CLIS for optimizing Spark SQL programs containing Scala UDFs. To evaluate CLIS, we manually study 100 randomly selected UDFs and find that 63 of them can be expressed in pure SQL. Our evaluation on these 63 UDFs shows that CLIS can automatically synthesize equivalent SQL expressions in 92% of the cases and that it can solve 2.4× more benchmarks compared to a baseline that does not use our compositional approach. We also show that CLIS yields an average speed-up of 3.5× for individual UDFs and 1.3× to 3.1× in terms of end-to-end application performance.
Supplemental Material
- Maaz Bin Safeer Ahmad and Alvin Cheung. 2018. Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA. 1205–1220. isbn:9781450347037 https://doi.org/10.1145/3183713.3196891 Google Scholar
Digital Library
- Maaz Bin Safeer Ahmad, Jonathan Ragan-Kelley, Alvin Cheung, and Shoaib Kamil. 2019. Automatically Translating Image Processing Libraries to Halide. ACM Trans. Graph., 38, 6 (2019), Article 204, Nov., 13 pages. issn:0730-0301 https://doi.org/10.1145/3355089.3356549 Google Scholar
Digital Library
- Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo M. K. Martin, Mukund Raghothaman, Sanjit A. Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis. In 2013 Formal Methods in Computer-Aided Design. 1–8. https://doi.org/10.1109/FMCAD.2013.6679385 Google Scholar
Cross Ref
- Armin Biere, Alessandro Cimatti, Edmund M Clarke, Ofer Strichman, and Yunshan Zhu. 2003. Bounded model checking.Google Scholar
- Jia Chen, Jiayi Wei, Yu Feng, Osbert Bastani, and Isil Dillig. 2019. Relational Verification Using Reinforcement Learning. 3, OOPSLA (2019), Article 141, Oct., 30 pages. https://doi.org/10.1145/3360567 Google Scholar
Digital Library
- Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing Database-Backed Applications with Query Synthesis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). Association for Computing Machinery, New York, NY, USA. 3–14. isbn:9781450320146 https://doi.org/10.1145/2491956.2462180 Google Scholar
Digital Library
- Edmund Clarke, Daniel Kroening, and Flavio Lerda. 2004. A Tool for Checking ANSI-C Programs. In Tools and Algorithms for the Construction and Analysis of Systems (TACAS 2004), Kurt Jensen and Andreas Podelski (Eds.) (Lecture Notes in Computer Science, Vol. 2988). Springer, 168–176. isbn:3-540-21299-X https://doi.org/10.1007/978-3-540-24730-2_15 Google Scholar
- Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Cetintemel, and Stan Zdonik. 2015. An Architecture for Compiling UDF-Centric Workflows. Proc. VLDB Endow., 8, 12 (2015), Aug., 1466–1477. issn:2150-8097 https://doi.org/10.14778/2824032.2824045 Google Scholar
Digital Library
- Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. 1991. Efficiently Computing Static Single Assignment Form and the Control Dependence Graph. ACM Trans. Program. Lang. Syst., 13, 4 (1991), Oct., 451–490. issn:0164-0925 https://doi.org/10.1145/115372.115320 Google Scholar
Digital Library
- K. Venkatesh Emani, Tejas Deshpande, Karthik Ramachandra, and S. Sudarshan. 2017. DBridge: Translating Imperative Code to SQL. In Proceedings of the 2017 ACM International Conference on Management of Data (SIGMOD ’17). Association for Computing Machinery, New York, NY, USA. 1663–1666. isbn:9781450341974 https://doi.org/10.1145/3035918.3058747 Google Scholar
Digital Library
- K. Venkatesh Emani, Karthik Ramachandra, Subhro Bhattacharya, and S. Sudarshan. 2016. Extracting Equivalent SQL from Imperative Code in Database Applications. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA. 1781–1796. isbn:9781450335317 https://doi.org/10.1145/2882903.2882926 Google Scholar
Digital Library
- Gregory Essertel, Ruby Tahboub, James Decker, Kevin Brown, Kunle Olukotun, and Tiark Rompf. 2018. Flare: Optimizing Apache Spark with Native Compilation for Scale-Up Architectures and Medium-Size Data. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA. 799–815. isbn:978-1-939133-08-3 https://www.usenix.org/conference/osdi18/presentation/essertelGoogle Scholar
Digital Library
- John K. Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing Data Structure Transformations from Input-Output Examples. SIGPLAN Not., 50, 6 (2015), June, 229–239. issn:0362-1340 https://doi.org/10.1145/2813885.2737977 Google Scholar
Digital Library
- Zhenyu Guo, Xuepeng Fan, Rishan Chen, Jiaxing Zhang, Hucheng Zhou, Sean McDirmid, Chang Liu, Wei Lin, Jingren Zhou, and Lidong Zhou. 2012. Spotting Code Optimizations in Data-Parallel Pipelines through PeriSCOPE. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI’12). USENIX Association, USA. 121–133. isbn:9781931971966Google Scholar
Digital Library
- Surabhi Gupta, Sanket Purandare, and Karthik Ramachandra. 2020. Aggify: Lifting the Curse of Cursor Loops Using Custom Aggregates. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA. 559–573. isbn:9781450367356 https://doi.org/10.1145/3318464.3389736 Google Scholar
Digital Library
- Kangjing Huang, Xiaokang Qiu, Peiyuan Shen, and Yanjun Wang. 2020. Reconciling Enumerative and Deductive Program Synthesis. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020). Association for Computing Machinery, New York, NY, USA. 1159–1174. isbn:9781450376136 https://doi.org/10.1145/3385412.3386027 Google Scholar
Digital Library
- Fabian Hueske, Mathias Peters, Aljoscha Krettek, Matthias Ringwald, Kostas Tzoumas, Volker Markl, and Johann-Christoph Freytag. 2013. Peeking into the optimization of data flow programs with MapReduce-style UDFs. In 2013 IEEE 29th International Conference on Data Engineering (ICDE). 1292–1295. https://doi.org/10.1109/ICDE.2013.6544927 Google Scholar
Digital Library
- Shuvendu K Lahiri, Chris Hawblitzel, Ming Kawaguchi, and Henrique Rebêlo. 2012. Symdiff: A language-agnostic semantic diff tool for imperative programs. In International Conference on Computer Aided Verification. 712–717. https://doi.org/10.1007/978-3-642-31424-7_54 Google Scholar
Digital Library
- Ruben Martins, Jia Chen, Yanju Chen, Yu Feng, and Isil Dillig. 2019. Trinity: An extensible synthesis framework for data science. Proceedings of the VLDB Endowment, 12, 12 (2019), 1914–1917.Google Scholar
Digital Library
- Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. 2016. Program Synthesis from Polymorphic Refinement Types. PLDI ’16. Association for Computing Machinery, New York, NY, USA. 522–538. isbn:9781450342612 https://doi.org/10.1145/2908080.2908093 Google Scholar
Digital Library
- Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. SIGPLAN Not., 50, 10 (2015), Oct., 107–126. issn:0362-1340 https://doi.org/10.1145/2858965.2814310 Google Scholar
Digital Library
- Karthik Ramachandra and Kwanghyun Park. 2019. BlackMagic: Automatic Inlining of Scalar UDFs into SQL Queries with Froid. Proc. VLDB Endow., 12, 12 (2019), Aug., 1810–1813. issn:2150-8097 https://doi.org/10.14778/3352063.3352072 Google Scholar
Digital Library
- Karthik Ramachandra, Kwanghyun Park, K. Venkatesh Emani, Alan Halverson, César Galindo-Legaria, and Conor Cunningham. 2017. Froid: Optimization of Imperative Programs in a Relational Database. Proc. VLDB Endow., 11, 4 (2017), Dec., 432–444. issn:2150-8097 https://doi.org/10.1145/3186728.3164140 Google Scholar
Digital Library
- Karthik Ramachandra, Kwanghyun Park, K. Venkatesh Emani, Alan Halverson, César A. Galindo-Legaria, and Conor Cunningham. 2017. Optimization of Imperative Programs in a Relational Database. CoRR, abs/1712.00498 (2017), arxiv:1712.00498. arxiv:1712.00498Google Scholar
- Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. 2015. Compositional Program Synthesis from Natural Language and Examples. In Proceedings of the 24th International Conference on Artificial Intelligence (IJCAI’15). AAAI Press, 792–800. isbn:9781577357384Google Scholar
Digital Library
- Astrid Rheinländer, Martin Beckmann, Anja Kunkel, Arvid Heise, Thomas Stoltmann, and Ulf Leser. 2014. Versatile Optimization of UDF-Heavy Data Flows with Sofa. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD ’14). Association for Computing Machinery, New York, NY, USA. 685–688. isbn:9781450323765 https://doi.org/10.1145/2588555.2594517 Google Scholar
Digital Library
- Tiark Rompf and Martin Odersky. 2012. Lightweight Modular Staging: A Pragmatic Approach to Runtime Code Generation and Compiled DSLs. Commun. ACM, 55, 6 (2012), June, 121–130. issn:0001-0782 https://doi.org/10.1145/2184319.2184345 Google Scholar
Digital Library
- Eric Schkufza, Rahul Sharma, and Alex Aiken. 2013. Stochastic Superoptimization. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’13). Association for Computing Machinery, New York, NY, USA. 305–316. isbn:9781450318709 https://doi.org/10.1145/2451116.2451150 Google Scholar
Digital Library
- Matthias Schlaipfer, Kaushik Rajan, Akash Lal, and Malavika Samak. 2017. Optimizing Big-Data Queries Using Program Synthesis. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP ’17). Association for Computing Machinery, New York, NY, USA. 631–646. isbn:9781450350853 https://doi.org/10.1145/3132747.3132773 Google Scholar
Digital Library
- Varun Simhadri, Karthik Ramachandra, Arun Chaitanya, Ravindra Guravannavar, and S. Sudarshan. 2014. Decorrelation of user defined function invocations in queries. In 2014 IEEE 30th International Conference on Data Engineering. 532–543. https://doi.org/10.1109/ICDE.2014.6816679 Google Scholar
Cross Ref
- Anirudh Sivaraman, Alvin Cheung, Mihai Budiu, Changhoon Kim, Mohammad Alizadeh, Hari Balakrishnan, George Varghese, Nick McKeown, and Steve Licking. 2016. Packet Transactions: High-Level Programming for Line-Rate Switches. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM ’16). Association for Computing Machinery, New York, NY, USA. 15–28. isbn:9781450341936 https://doi.org/10.1145/2934872.2934900 Google Scholar
Digital Library
- Armando Solar-Lezama, Liviu Tancau, Rastislav Bodik, Sanjit Seshia, and Vijay Saraswat. 2006. Combinatorial Sketching for Finite Programs. SIGARCH Comput. Archit. News, 34, 5 (2006), Oct., 404–415. issn:0163-5964 https://doi.org/10.1145/1168919.1168907 Google Scholar
Digital Library
- Marcelo Sousa, Isil Dillig, Dimitrios Vytiniotis, Thomas Dillig, and Christos Gkantsidis. 2014. Consolidation of Queries with User-Defined Functions. SIGPLAN Not., 49, 6 (2014), June, 554–564. issn:0362-1340 https://doi.org/10.1145/2666356.2594305 Google Scholar
Digital Library
- TPC. 2005. TPC-H Benchmark Specification. http://www.tpc.orgGoogle Scholar
- Jacob Van Geffen, Luke Nelson, Isil Dillig, Xi Wang, and Emina Torlak. 2020. Synthesizing JIT Compilers for In-Kernel DSLs. In Computer Aided Verification, Shuvendu K. Lahiri and Chao Wang (Eds.). Springer International Publishing, Cham. 564–586. isbn:978-3-030-53291-8 https://doi.org/10.1007/978-3-030-53291-8_29 Google Scholar
Cross Ref
- Yuepeng Wang, James Dong, Rushi Shah, and Isil Dillig. 2019. Synthesizing Database Programs for Schema Refactoring. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA. 286–300. isbn:9781450367127 https://doi.org/10.1145/3314221.3314588 Google Scholar
Digital Library
- Ben Wiedermann and William R. Cook. 2007. Extracting Queries by Static Analysis of Transparent Persistence. In Proceedings of the 34th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’07). Association for Computing Machinery, New York, NY, USA. 199–210. isbn:1595935754 https://doi.org/10.1145/1190216.1190248 Google Scholar
Digital Library
- Ben Wiedermann, Ali Ibrahim, and William R. Cook. 2008. Interprocedural Query Extraction for Transparent Persistence. SIGPLAN Not., 43, 10 (2008), Oct., 19–36. issn:0362-1340 https://doi.org/10.1145/1449955.1449767 Google Scholar
Digital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets.. HotCloud, 10, 10-10 (2010), 95.Google Scholar
Digital Library
Index Terms
UDF to SQL translation through compositional lazy inductive synthesis
Recommendations
Automated Translation of Functional Big Data Queries to SQL
Big data analytics frameworks like Apache Spark and Flink enable users to implement queries over large, distributed databases using functional APIs. In recent years, these APIs have grown in popularity because their functional interfaces abstract away ...
Translating SQL Into Relational Algebra: Optimization, Semantics, and Equivalence of SQL Queries
In this paper, we present a translator from a relevant subset of SQL into relational algebra. The translation is syntax-directed, with translation rules associated with grammar productions; each production corresponds to a particular type of SQL ...






Comments