Abstract
Cleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty printing their string representations. This presents many challenges to users because cleaning such data requires some background knowledge about the data itself and moreover this data is typically non-uniform, unstructured, and ambiguous. Spreadsheet systems and Programming Languages provide some UI-based and programmatic solutions for this problem but they are either insufficient for the user's needs or are beyond their expertise. In this paper, we present a programming by example methodology of cleaning data types that learns the desired transformation from a few input-output examples. We propose a domain specific language with probabilistic semantics that is parameterized with declarative data type definitions. The probabilistic semantics is based on three key aspects: (i) approximate predicate matching, (ii) joint learning of data type interpretation, and (iii) weighted branches. This probabilistic semantics enables the language to handle non-uniform, unstructured, and ambiguous data. We then present a synthesis algorithm that learns the desired program in this language from a set of input-output examples. We have implemented our algorithm as an Excel add-in and present its successful evaluation on 55 benchmark problems obtained from online help forums and Excel product team.
- R. Alur, R. Bod´ık, E. Dallal, D. Fisman, P. Garg, G. Juniwal, H. Kress-Gazit, P. Madhusudan, M. M. K. Martin, M. Raghothaman, S. Saha, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. Syntax-guided synthesis. In Dependable Software Systems Engineering, pages 1–25. 2015.Google Scholar
- S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. ACM Sigmod record, 26(1):65–74, 1997. Google Scholar
Digital Library
- K. Fisher and R. Gruber. PADS: a domain-specific language for processing ad hoc data. In PLDI, pages 295–304, 2005. Google Scholar
Digital Library
- K. Fisher, Y. Mandelbaum, and D. Walker. The next 700 data description languages. In POPL, pages 2–15, 2006. Google Scholar
Digital Library
- K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In POPL, 2008. Google Scholar
Digital Library
- N. D. Goodman, V. K. Mansinghka, D. M. Roy, K. Bonawitz, and J. B. Tenenbaum. Church: a language for generative models. In UAI, pages 220–229, 2008.Google Scholar
- A. D. Gordon, T. Graepel, N. Rolland, C. Russo, J. Borgstrom, and J. Guiver. Tabular: A schema-driven probabilistic programming language. In POPL, pages 321–334, 2014. Google Scholar
Digital Library
- S. Gulwani. Program Analysis using Random Interpretation. PhD thesis, EECS Dept., UC Berkeley, 2005. Google Scholar
Digital Library
- S. Gulwani. Automating string processing in spreadsheets using inputoutput examples. In POPL, pages 317–330, 2011. Google Scholar
Digital Library
- S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Communications of the ACM, 55(8), 2012. Google Scholar
Digital Library
- S. Gulwani and G. C. Necula. Precise interprocedural analysis using random interpretation. In POPL, 2005. Google Scholar
Digital Library
- P. Hawkins, A. Aiken, K. Fisher, M. C. Rinard, and M. Sagiv. Data representation synthesis. In PLDI, pages 38–49, 2011. Google Scholar
Digital Library
- W. Kim and J. Seo. Classifying schematic and data heterogeneity in multidatabase systems. Computer, 24(12):12–18, 1991. Google Scholar
Digital Library
- V. Kuncak, M. Mayer, R. Piskac, and P. Suter. Complete functional synthesis. In PLDI, pages 316–329, 2010. Google Scholar
Digital Library
- T. Lau, S. Wolfman, P. Domingos, and D. Weld. Programming by demonstration using version space algebra. Machine Learning, 53(1- 2), 2003. Google Scholar
Digital Library
- R. C. Miller and B. A. Myers. Interactive simultaneous editing of multiple text regions. In USENIX Annual Technical Conference, 2001. Google Scholar
Digital Library
- R. P. Nix. Editing by example. TOPLAS, 7(4):600–621, 1985. Google Scholar
Digital Library
- A. V. Nori, C. Hur, S. K. Rajamani, and S. Samuel. R2: an efficient MCMC sampler for probabilistic programs. In AAAI, pages 2476– 2482, 2014.Google Scholar
- A. V. Nori, S. Ozair, S. K. Rajamani, and D. Vijaykeerthy. Efficient synthesis of probabilistic programs. In PLDI, pages 208–217, 2015. Google Scholar
Digital Library
- P. M. Phothilimthana, T. Jelvis, R. Shah, N. Totla, S. Chasins, and R. Bod´ık. Chlorophyll: synthesis-aided compiler for low-power spatial architectures. In PLDI, page 42, 2014. Google Scholar
Digital Library
- V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, pages 381–390, 2001. Google Scholar
Digital Library
- V. Raychev, M. Schäfer, M. Sridharan, and M. T. Vechev. Refactoring with synthesis. In OOPSLA, pages 339–354, 2013. Google Scholar
Digital Library
- C. Scaffidi. Topes: Enabling end-user programmers to validate and reformat data, 2009.Google Scholar
- C. Scaffidi, B. A. Myers, and M. Shaw. Intelligently creating and recommending reusable reformatting rules. In IUI, 2009. Google Scholar
Digital Library
- R. Singh and S. Gulwani. Learning semantic string transformations from examples. PVLDB, 2012. Google Scholar
Digital Library
- R. Singh and S. Gulwani. Synthesizing number transformations from input-output examples. In CAV, 2012. Google Scholar
Digital Library
- R. Singh, S. Gulwani, and A. Solar-Lezama. Automated feedback generation for introductory programming assignments. In PLDI, pages 15–26, 2013. Google Scholar
Digital Library
- R. Singh and A. Solar-Lezama. Synthesizing data structure manipulations from storyboards. In SIGSOFT FSE, pages 289–299, 2011. Google Scholar
Digital Library
- A. Solar-Lezama. Program Synthesis By Sketching. PhD thesis, EECS Dept., UC Berkeley, 2008. Google Scholar
Digital Library
- A. Solar-Lezama, R. Rabbah, R. Bodik, and K. Ebcioglu. Programming by sketching for bit-streaming programs. In PLDI, 2005. Google Scholar
Digital Library
- E. Torlak and R. Bod´ık. Growing solver-aided languages with rosette. In Onward, pages 135–152, 2013. Google Scholar
Digital Library
- E. Torlak and R. Bod´ık. A lightweight symbolic virtual machine for solver-aided host languages. In PLDI, page 54, 2014. Google Scholar
Digital Library
- M. Vechev, E. Yahav, and G. Yorsh. Abstraction-guided synthesis of synchronization. In POPL, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- Q. Xi and D. Walker. A context-free markup language for semistructured text. In PLDI, pages 221–232, 2010. Google Scholar
Digital Library
Recommendations
Transforming spreadsheet data types using examples
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesCleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty ...
Polymorphic type inference and abstract data types
Many statically typed programming languages provide an abstract data type construct, such as the module in Modula-2. However, in most of these languages, implementations of abstract data types are not first-class values. Thus, they cannot be assigned to ...






Comments