skip to main content
research-article
Open Access

Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis

Published:15 October 2021Publication History
Skip Abstract Section

Abstract

Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trained models (PTMs) are adept at handling ambiguous natural language, but struggle with generating syntactically and semantically precise code. Program synthesis techniques can generate correct code, often even from incomplete but precise specifications, such as examples, but they are unable to work with the ambiguity of natural languages. We present an approach that combines PTMs with component-based synthesis (CBS): PTMs are used to generate candidates programs from the natural language description of the task, which are then used to guide the CBS procedure to find the program that matches the precise examples-based specification. We use our combination approach to instantiate multi-modal synthesis systems for two programming domains: the domain of regular expressions and the domain of CSS selectors. Our evaluation demonstrates the effectiveness of our domain-agnostic approach in comparison to a state-of-the-art specialized system, and the generality of our approach in providing multi-modal program synthesis from natural language and examples in different programming domains.

Skip Supplemental Material Section

Supplemental Material

Auxiliary Presentation Video

Multi-modal program synthesis refers to the task of synthesizing programs (code) from their specification given in different forms, such as a combination of natural language and examples. Examples provide a precise but incomplete specification, and natural language provides an ambiguous but more "complete" task description. Machine-learned pre-trained models (PTMs) are adept at handling ambiguous natural language, but struggle with generating syntactically and semantically precise code. Program synthesis techniques can generate correct code, often even from incomplete but precise specifications, such as examples, but they are unable to work with the ambiguity of natural languages. We present an approach that combines PTMs with component-based synthesis (CBS): PTMs are used to generate candidates programs from the natural language description of the task, which are then used to guide the CBS procedure to find the program that matches the precise examples-based specification.

References

  1. R. Agrawal, T. Imielinski, and A. Swami. 1993. Mining Association Rules Between Sets of Items in Large Databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data (SIGMOD ’93, Vol. 22). ACM, New York, NY, USA. 207–216. isbn:0-89791-592-5 https://doi.org/10.1145/170035.170072 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Alur, R. Bodik, G. Juniwal, M. M. K. Martin, M. Raghothaman, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. 2013. Syntax-guided synthesis. In 2013 Formal Methods in Computer-Aided Design. 1–8. https://doi.org/10.1109/FMCAD.2013.6679385 Google ScholarGoogle ScholarCross RefCross Ref
  3. Rajeev Alur, Pavol Cerny, and Arjun Radhakrishna. 2015. Synthesis Through Unification. In Computer Aided Verification (CAV). https://www.microsoft.com/en-us/research/publication/synthesis-through-unification/Google ScholarGoogle Scholar
  4. Rajeev Alur, Arjun Radhakrishna, and Abhishek Udupa. 2017. Scaling Enumerative Program Synthesis via Divide and Conquer. In Tools and Algorithms for the Construction and Analysis of Systems, Axel Legay and Tiziana Margaria (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg. 319–336. isbn:978-3-662-54577-5Google ScholarGoogle Scholar
  5. Dana Angluin. 1978. On the complexity of minimum inference of regular sets. Information and Control, 39, 3 (1978), 337–350. issn:0019-9958 https://doi.org/10.1016/S0019-9958(78)90683-6 Google ScholarGoogle Scholar
  6. Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Inf. Comput., 75, 2 (1987), Nov., 87–106. issn:0890-5401 https://doi.org/10.1016/0890-5401(87)90052-6 Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Matej Balog, Alexander Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2017. DeepCoder: Learning to Write Programs. In Proceedings of ICLR’17 (proceedings of iclr’17 ed.). https://www.microsoft.com/en-us/research/publication/deepcoder-learning-write-programs/Google ScholarGoogle Scholar
  8. Paul E. Black. 1999. Dictionary of Algorithms and Data Structures [online]. https://www.nist.gov/dads/HTML/Levenshtein.html (Accessed , March 2021)Google ScholarGoogle Scholar
  9. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). 33, Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdfGoogle ScholarGoogle Scholar
  10. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. CoRR, abs/2107.03374 (2021), arxiv:2107.03374. arxiv:2107.03374Google ScholarGoogle Scholar
  11. Qiaochu Chen, Xinyu Wang, Xi Ye, Greg Durrett, and Isil Dillig. 2020. Multi-Modal Synthesis of Regular Expressions. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020). Association for Computing Machinery, New York, NY, USA. 487–502. isbn:9781450376136 https://doi.org/10.1145/3385412.3385988 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Yanju Chen, Ruben Martins, and Yu Feng. 2019. Maximal multi-layer specification synthesis. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 602–612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota. 4171–4186. https://doi.org/10.18653/v1/N19-1423 Google ScholarGoogle ScholarCross RefCross Ref
  14. Dana Drachsler-Cohen, Sharon Shoham, and Eran Yahav. 2017. Synthesis with Abstract Examples. In Computer Aided Verification, Rupak Majumdar and Viktor Kunčak (Eds.). Springer International Publishing, Cham. 254–278. isbn:978-3-319-63387-9Google ScholarGoogle Scholar
  15. Yu Feng, Ruben Martins, Jacob Van Geffen, Isil Dillig, and Swarat Chaudhuri. 2017. Component-Based Synthesis of Table Consolidation and Transformation Tasks from Examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). Association for Computing Machinery, New York, NY, USA. 422–436. isbn:9781450349888 https://doi.org/10.1145/3062341.3062351 Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Yu Feng, Ruben Martins, Yuepeng Wang, Isil Dillig, and Thomas W. Reps. 2017. Component-based synthesis for complex APIs. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages. ACM. https://doi.org/10.1145/3009837.3009851 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Sumit Gulwani. 2011. Automating String Processing in Spreadsheets using Input-Output Examples. In PoPL’11, January 26-28, 2011, Austin, Texas, USA. https://www.microsoft.com/en-us/research/publication/automating-string-processing-spreadsheets-using-input-output-examples/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sumit Gulwani, Susmit Jha, Ashish Tiwari, and Ramarathnam Venkatesan. 2011. Synthesis of Loop-Free Programs. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’11). Association for Computing Machinery, New York, NY, USA. 62–73. isbn:9781450306638 https://doi.org/10.1145/1993498.1993506 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sumit Gulwani and Mark Marron. 2014. NLyze: Interactive Programming by Natural Language for SpreadSheet Data Analysis and Manipulation. In SIGMOD ’14 Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (sigmod ’14 proceedings of the 2014 acm sigmod international conference on management of data ed.). Association for Computing Machinery, 803–814. https://www.microsoft.com/en-us/research/publication/nlyze-interactive-programming-by-natural-language-for-spreadsheet-data-analysis-and-manipulation/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. CoRR, abs/2105.09938 (2021), arxiv:2105.09938. arxiv:2105.09938Google ScholarGoogle Scholar
  21. Kangjing Huang, Xiaokang Qiu, Peiyuan Shen, and Yanjun Wang. 2020. Reconciling Enumerative and Deductive Program Synthesis. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020). Association for Computing Machinery, New York, NY, USA. 1159–1174. isbn:9781450376136 https://doi.org/10.1145/3385412.3386027 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Po-Sen Huang, Chenglong Wang, Rishabh Singh, Wen-tau Yih, and Xiaodong He. 2018. Natural Language to Structured Query Generation via Meta-Learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana. 732–738. https://doi.org/10.18653/v1/N18-2115 Google ScholarGoogle ScholarCross RefCross Ref
  23. Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. 2010. Oracle-Guided Component-Based Program Synthesis. In ICSE ’10, May 2-8 2010, Cape Town, South Africa (icse ’10, may 2-8 2010, cape town, south africa ed.). https://www.microsoft.com/en-us/research/publication/oracle-guided-component-based-program-synthesis/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Jones. 1972. A statistical interpretation of term specificity and its application in retrieval. J. Documentation, 60 (1972), 493–502. https://doi.org/10.1108/eb026526 Google ScholarGoogle ScholarCross RefCross Ref
  25. Nate Kushman and Regina Barzilay. 2013. Using Semantic Unification to Generate Regular Expressions from Natural Language. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Atlanta, Georgia. 826–836. https://www.aclweb.org/anthology/N13-1103Google ScholarGoogle Scholar
  26. Vu Le, Sumit Gulwani, and Zhendong Su. 2013. SmartSynth: Synthesizing Smartphone Automation Scripts from Natural Language. In MobiSys’13, June 25-28, 2013, Taipei, Taiwan (mobisys’13, june 25-28, 2013, taipei, taiwan ed.). https://www.microsoft.com/en-us/research/publication/smartsynth-synthesizing-smartphone-automation-scripts-natural-language/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mina Lee, Sunbeom So, and Hakjoo Oh. 2016. Synthesizing Regular Expressions from Examples for Introductory Automata Assignments. SIGPLAN Not., 52, 3 (2016), Oct., 70–80. issn:0362-1340 https://doi.org/10.1145/3093335.2993244 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yeting Li, Shuaimin Li, Zhiwu Xu, Jialun Cao, Zixuan Chen, Yun Hu, Haiming Chen, and Shing-Chi Cheung. 2020. TransRegex: Multi-modal Regular Expression Synthesis by Generate-and-Repair. CoRR, abs/2012.15489 (2020), arxiv:2012.15489. arxiv:2012.15489Google ScholarGoogle Scholar
  29. Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. 2018. NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan. https://www.aclweb.org/anthology/L18-1491Google ScholarGoogle Scholar
  30. Nicholas Locascio, Karthik Narasimhan, Eduardo DeLeon, Nate Kushman, and Regina Barzilay. 2016. Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge. In EMNLP (emnlp ed.). https://www.microsoft.com/en-us/research/publication/neural-generation-regular-expressions-natural-language-minimal-domain-knowledge/Google ScholarGoogle Scholar
  31. Mehdi Manshadi, Daniel Gildea, and James Allen. 2013. Integrating programming by example and natural language programming. In Proceedings of the AAAI Conference on Artificial Intelligence. 27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Richard Meyes, Melanie Lu, Constantin Waubert de Puiseau, and Tobias Meisen. 2019. Ablation Studies in Artificial Neural Networks. CoRR, abs/1901.08644 (2019), arxiv:1901.08644. arxiv:1901.08644Google ScholarGoogle Scholar
  33. OpenAI. 2021. GPT-3 powers the next generation of apps. https://openai.com/blog/gpt-3-apps/Google ScholarGoogle Scholar
  34. Rong Pan, Qinheping Hu, Gaowei Xu, and Loris D’Antoni. 2019. Automatic Repair of Regular Expressions. Proc. ACM Program. Lang., 3, OOPSLA (2019), Article 139, Oct., 29 pages. https://doi.org/10.1145/3360565 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jian Pei, Jiawei Han, and Runying Mao. 2000. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery. 21–30. http://dblp.uni-trier.de/db/conf/dmkd/dmkd2000.html##PeiHM00Google ScholarGoogle Scholar
  36. Nadia Polikarpova, Ivan Kuraj, and Armando Solar-Lezama. 2016. Program Synthesis from Polymorphic Refinement Types. SIGPLAN Not., 51, 6 (2016), June, 522–538. issn:0362-1340 https://doi.org/10.1145/2980983.2908093 Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Kia Rahmani, Mohammad Raza, Sumit Gulwani, Vu Le, Daniel Morris, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari. 2021. Multi-modal Program Inference: a Marriage of Pre-trainedLanguage Models and Component-based Synthesis. arxiv:2109.02445. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mohammad Raza and Sumit Gulwani. 2017. Automated Data Extraction Using Predictive Program Synthesis. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17). AAAI Press, 882–890. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Mohammad Raza and Sumit Gulwani. 2020. Web Data Extraction Using Hybrid Program Synthesis: A Combination of Top-down and Bottom-up Inference. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA. 1967–1978. isbn:9781450367356 https://doi.org/10.1145/3318464.3380608 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mohammad Raza, Sumit Gulwani, and Natasa Milic-Frayling. 2015. Compositional Program Synthesis from Natural Language and Examples. In IJCAI 2015 (ijcai 2015 ed.). https://www.microsoft.com/en-us/research/publication/compositional-program-synthesis-natural-language-examples/ Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Armando Solar-Lezama. 2008. Program Synthesis by Sketching. Ph.D. Dissertation. USA. isbn:9781109097450 AAI3353225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Saurabh Srivastava, Sumit Gulwani, and Jeffrey S. Foster. 2010. From Program Verification to Program Synthesis. In Proceedings of the 37th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’10). Association for Computing Machinery, New York, NY, USA. 313–326. isbn:9781605584799 https://doi.org/10.1145/1706299.1706337 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W3C. 2020. CSS Snapshot 2020 [online]. https://www.w3.org/TR/CSS/Google ScholarGoogle Scholar
  44. Bailin Wang, Richard Shin, Xiaodong Liu, Oleksandr Polozov, and Matthew Richardson. 2020. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online. 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677 Google ScholarGoogle ScholarCross RefCross Ref
  45. Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. SQLizer: Query Synthesis from Natural Language. Proc. ACM Program. Lang., 1, OOPSLA (2017), Article 63, Oct., 26 pages. https://doi.org/10.1145/3133887 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Zexuan Zhong, Jiaqi Guo, Wei Yang, Jian Peng, Tao Xie, Jian-Guang Lou, Ting Liu, and Dongmei Zhang. 2018. SemRegex: A Semantics-Based Approach for Generating Regular Expressions from Natural Language Specifications. In EMNP’18. ACL. https://www.microsoft.com/en-us/research/publication/semregex-a-semantics-based-approach-for-generating-regular-expressions-from-natural-language-specifications-2/Google ScholarGoogle Scholar

Index Terms

  1. Multi-modal program inference: a marriage of pre-trained language models and component-based synthesis

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Programming Languages
          Proceedings of the ACM on Programming Languages  Volume 5, Issue OOPSLA
          October 2021
          2001 pages
          EISSN:2475-1421
          DOI:10.1145/3492349
          Issue’s Table of Contents

          Copyright © 2021 Owner/Author

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 October 2021
          Published in pacmpl Volume 5, Issue OOPSLA

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!