Abstract
This paper presents a new technique for automatically synthesizing SQL queries from natural language (NL). At the core of our technique is a new NL-based program synthesis methodology that combines semantic parsing techniques from the NLP community with type-directed program synthesis and automated program repair. Starting with a program sketch obtained using standard parsing techniques, our approach involves an iterative refinement loop that alternates between probabilistic type inhabitation and automated sketch repair. We use the proposed idea to build an end-to-end system called SQLIZER that can synthesize SQL queries from natural language. Our method is fully automated, works for any database without requiring additional customization, and does not require users to know the underlying database schema. We evaluate our approach on over 450 natural language queries concerning three different databases, namely MAS, IMDB, and YELP. Our experiments show that the desired query is ranked within the top 5 candidates in close to 90% of the cases and that SQLIZER outperforms NALIR, a state-of-the-art tool that won a best paper award at VLDB'14.
- I Androutsopoulos, G Ritchie, and P Thanisch. 1993. Masque/sql: An Efficient and Portable Natural Language Query Interface for Relational Databases. Tech report, University of Edinburgh (1993).Google Scholar
- Ion Androutsopoulos, Graeme D. Ritchie, and Peter Thanisch. 1995. Natural language interfaces to databases - An Introduction. Natural Language Engineering (1995).Google Scholar
- Thomas Ball, Mayur Naik, and Sriram K. Rajamani. 2003. From symptom to cause: localizing errors in counterexample traces. In POPL. 97–105. Google Scholar
Digital Library
- Daniel W. Barowy, Sumit Gulwani, Ted Hart, and Benjamin G. Zorn. 2015. FlashRelate: extracting relational data from semi-structured spreadsheets using examples. ACM, 218–228.Google Scholar
- Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In EMNLP. 1533–1544.Google Scholar
- Bob Carpenter. 1997. Type-logical semantics. MIT press.Google Scholar
- Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing database-backed applications with query synthesis. In PLDI. 3–14. Google Scholar
Digital Library
- E. F. Codd. 1974. Seven Steps to Rendezvous with the Casual User. In IFIP Working Conference Data Base Management. 179–200.Google Scholar
- Aditya Desai, Sumit Gulwani, Vineet Hingorani, Nidhi Jain, Amey Karkare, Mark Marron, Sailesh R, and Subhajit Roy. 2016. Program synthesis using natural language. In ICSE.Google Scholar
- Ramez Elmasri and Shamkant B. Navathe. 2011. Fundamentals of Database Systems. Addison-Wesley.Google Scholar
- Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017a. Component-based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Programming Language Design and Implementation. Google Scholar
Digital Library
- Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. 2017b. Component-based synthesis for complex APIs. In POPL. 599–612. Google Scholar
Digital Library
- John K Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-output examples. In PLDI. 229–239.Google Scholar
- Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair. In ICSE. 3–13.Google Scholar
- Alex Groce and Willem Visser. 2003. What Went Wrong: Explaining Counterexamples. In SPIN.Google Scholar
- Barbara J. Grosz, Douglas E. Appelt, Paul A. Martin, and Fernando C. N. Pereira. 1987. TEAM: An Experiment in the Design of Transportable Natural-Language Interfaces. Artificial Intelligence 32, 2 (1987), 173–243. Google Scholar
Digital Library
- Sumit Gulwani. 2011. Automating string processing in spreadsheets using input-output examples. In ACM SIGPLAN Notices, Vol. 46. ACM, 317–330. Google Scholar
Digital Library
- Sumit Gulwani and Mark Marron. 2014. NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation. In SIGMOD. Google Scholar
Digital Library
- Tihomir Gvero, Viktor Kuncak, Ivan Kuraj, and Ruzica Piskac. 2013. Complete completion using types and weights. In ACM SIGPLAN Notices , Vol. 48. ACM, 27–38. Google Scholar
Digital Library
- Gary G. Hendrix, Earl D. Sacerdoti, Daniel Sagalowicz, and Jonathan Slocum. 1978. Developing a Natural Language Interface to Complex Data. TODS 3, 2 (1978), 105–147. Google Scholar
Digital Library
- James A. Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE. 273–282. Google Scholar
Digital Library
- James A. Jones, Mary Jean Harrold, and John T. Stasko. 2002. Visualization of test information to assist fault localization. In ICSE . 467–477. Google Scholar
Cross Ref
- Manu Jose and Rupak Majumdar. 2011a. Bug-Assist: assisting fault localization in ANSI-C programs. In CAV. Google Scholar
Cross Ref
- Manu Jose and Rupak Majumdar. 2011b. Cause clue clauses: error localization using maximum satisfiability. PLDI (2011).Google Scholar
Digital Library
- Rohit J. Kate and Raymond J. Mooney. 2006. Using String-Kernels for Learning Semantic Parsers. In ACL. Google Scholar
Digital Library
- Rohit J. Kate, Yuk Wah Wong, and Raymond J. Mooney. 2005. Learning to Transform Natural to Formal Languages. In AAAI. 1062–1068.Google Scholar
- Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. ACM, 542–553.Google Scholar
- Vu Le, Sumit Gulwani, and Zhendong Su. 2013. Smartsynth: Synthesizing smartphone automation scripts from natural language. In MobiSys. 193–206.Google Scholar
Digital Library
- Fei Li and H. V. Jagadish. 2014. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB 8, 1 (2014), 73–84. Google Scholar
Digital Library
- Yunyao Li, Huahai Yang, and H. V. Jagadish. 2006. Constructing a Generic Natural Language Interface for an XML Database. In EDBT. 737–754. Google Scholar
Digital Library
- Percy Liang and Christopher Potts. 2015. Bringing machine learning and compositional semantics together. Annual Review of Linguistics 1, 1 (2015), 355–376. Google Scholar
Cross Ref
- Fan Long and Martin Rinard. 2015. Staged program repair with condition synthesis. In ESEC/FSE. Google Scholar
Digital Library
- Fan Long and Martin Rinard. 2016. Automatic patch generation by learning correct code. In POPL. Google Scholar
Digital Library
- Bill MacCartney and Christopher D Manning. 2009. An extended model of natural logic. In IWCS. 140–156.Google Scholar
- Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In ACL System Demonstrations. 55–60. Google Scholar
Cross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NIPS.Google Scholar
- Scott Miller, David Stallard, Robert J. Bobrow, and Richard M. Schwartz. 1996. A Fully Statistical Approach to Natural Language Interfaces. In ACL. 55–61. Google Scholar
Digital Library
- Hoang Duong Thien Nguyen, Dawei Qi, Abhik Roychoudhury, and Satish Chandra. 2013. SemFix: program repair via semantic analysis. In ICSE. 772–781.Google Scholar
- Peter-Michael Osera and Steve Zdancewic. 2015. Type- and example-directed program synthesis. In PLDI.Google Scholar
- Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. 2016. Program synthesis from polymorphic refinement types. In PLDI. 522–538. Google Scholar
Digital Library
- Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A framework for inductive program synthesis. ACM, 107–126.Google Scholar
- Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern Natural Language Interfaces to Databases: Composing Statistical Parsing with Semantic Tractability. In COLING.Google Scholar
- Ana-Maria Popescu, Oren Etzioni, and Henry A. Kautz. 2003. Towards a theory of natural language interfaces to databases. In IUI. 149–157.Google Scholar
- Chris Quirk, Raymond Mooney, and Michel Galley. 2015. Language to code: Learning semantic parsers for if-this-then-that recipes. In ACL. 878–888.Google Scholar
- Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. 2015. Compositional Program Synthesis from Natural Language and Examples. In IJCAI.Google Scholar
- Armando Solar Lezama. 2008. Program Synthesis By Sketching. Ph.D. Dissertation. EECS Department, University of California, Berkeley.Google Scholar
- Armando Solar-Lezama, Rodric M. Rabbah, Rastislav Bodík, and Kemal Ebcioglu. 2005. Programming by sketching for bit-streaming programs. In PLDI. Google Scholar
Digital Library
- Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In ASPLOS. 404–415. Google Scholar
Digital Library
- Lappoon R Tang and Raymond J Mooney. 2000. Automated construction of database interfaces: Integrating statistical and relational learning for semantic parsing. In EMNLP. 133–141.Google Scholar
- Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. 2009. Query by output. In SIGMOD. 535–548. Google Scholar
Digital Library
- Chenglong Wang, Alvin Cheung, and Ras Bodik. 2017. Synthesizing Highly Expressive SQL Queries from Input-Output Examples. In Programming Language Design and Implementation. Google Scholar
Digital Library
- David H. D. Warren and Fernando C. N. Pereira. 1982. An Efficient Easily Adaptable System for Interpreting Natural Language Queries. American Journal of Computational Linguistics 8, 3-4 (1982), 110–122.Google Scholar
- Westley Weimer, ThanhVu Nguyen, Claire Le Goues, and Stephanie Forrest. 2009. Automatically finding patches using genetic programming. In ICSE. 364–374.Google Scholar
- Ben Wiedermann, Ali Ibrahim, and William R. Cook. 2008. Interprocedural query extraction for transparent persistence. In OOPSLA . 19–36. Google Scholar
Digital Library
- William A Woods, Ronald M Kaplan, and Bonnie Nash-Webber. 1972. The lunar sciences natural language information system. Bolt, Beranek and Newman.Google Scholar
- Navid Yaghmazadeh, Christian Klinger, Isil Dillig, and Swarat Chaudhuri. 2016. Synthesizing transformations on hierarchically structured data. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation . ACM, 508–521. Google Scholar
Digital Library
- John M. Zelle and Raymond J. Mooney. 1993. Learning Semantic Grammars with Constructive Inductive Logic Programming. In AAAI. 817–822.Google Scholar
- John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In AAAI . 1050–1055.Google Scholar
- Sai Zhang and Yuyin Sun. 2013. Automatically synthesizing SQL queries from input-output examples. In ASE. Google Scholar
Digital Library
- Moshé M. Zloof. 1975. Query-by-Example: the Invocation and Definition of Tables and Forms. In VLDB. Google Scholar
Digital Library
Index Terms
SQLizer: query synthesis from natural language
Recommendations
Query Interoperation Among Object-Oriented and Relational Databases
ICDE '95: Proceedings of the Eleventh International Conference on Data EngineeringWe develop an efficient algorithm for the query interoperation among existing heterogeneous object-oriented and relational databases. Our algorithm utilizes a canonical deductive database as a uniform representation of object-oriented schema and data. ...
Type-directed synthesis of visualizations from natural language queries
We propose a new technique based on program synthesis for automatically generating visualizations from natural language queries. Our method parses the natural language query into a refinement type specification using the intents-and-slots paradigm and ...
Access to Indexed Hierarchical Databases Using a Relational Query Language
An efficient means of accessing indexed hierarchical databases using a relational query language is presented. The purpose is to achieve an effective sharing of heterogeneous distributed databases. Translation of hierarchical data to an equivalent ...






Comments