Abstract
Data filtering in spreadsheets is a common problem faced by millions of end-users. The task of data filtering requires a computational model that can separate intended positive and negative string instances. We present a system, FIDEX, that can efficiently learn desired data filtering expressions from a small set of positive and negative string examples.
There are two key ideas of our approach. First, we design an expressive DSL to represent disjunctive filter expressions needed for several real-world data filtering tasks. Second, we develop an efficient synthesis algorithm for incrementally learning consistent filter expressions in the DSL from very few positive and negative examples. A DAG-based data structure is used to succinctly represent a large number of filter expressions, and two corresponding operators are defined for algorithmically handling positive and negative examples, namely, the intersection and subtraction operators. FIDEX is able to learn data filters for 452 out of 460 real-world data filtering tasks in real time (0.22s), using only 2.2 positive string instances and 2.7 negative string instances on average.
- R. Alquezar and A. Sanfeliu. Incremental grammatical inference from positive and negative data using unbiased finite state automata. In In Proceedings of the ACL02 Workshop on Unsupervised Lexical Acquisition, pages 291–300, 1994.Google Scholar
- R. Alur, R. Bod´ık, E. Dallal, D. Fisman, P. Garg, G. Juniwal, H. Kress-Gazit, P. Madhusudan, M. M. K. Martin, M. Raghothaman, S. Saha, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. Syntax-guided synthesis. In Dependable Software Systems Engineering, pages 1–25. 2015.Google Scholar
- D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39:337–350, 1978.Google Scholar
- D. Angluin. A note on the number of queries needed to identify regular languages. Information and Control, 51:76– 87, 1981.Google Scholar
Cross Ref
- D. Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87–106, 1987. Google Scholar
Digital Library
- D. W. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: Extracting relational data from semi-structured spreadsheets using examples. PLDI, pages 218–228, 2015. Google Scholar
Digital Library
- A. Bartoli, G. Davanzo, A. D. Lorenzo, E. Medvet, and E. Sorio. Automatic synthesis of regular expressions from examples. Computer, 99(PrePrints):1, 2013. Google Scholar
Digital Library
- R. A. Cochran, L. D’Antoni, B. Livshits, D. Molnar, and M. Veanes. Program boosting: Program synthesis via crowdsourcing. POPL, pages 677–688, 2015. Google Scholar
Digital Library
- T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., 1 edition, 2003. Google Scholar
Digital Library
- F. Denis, A. Lemay, and A. Terlutte. Learning regular languages using rfsas. Theor. Comput. Sci., 313(2):267–294, 2004. Google Scholar
Digital Library
- P. Dupont. Incremental regular inference. In Proceedings of the Third ICGI-96, pages 222–237. Springer, 1996. Google Scholar
Digital Library
- J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structure transformations from input-output examples. In PLDI, pages 229–239, 2015. Google Scholar
Digital Library
- L. Firoiu, T. Oates, and P. R. Cohen. Learning regular languages from positive evidence. In In Twentieth Annual Meeting of the Cognitive Science Society, pages 350–355, 1998.Google Scholar
Cross Ref
- M. E. Gold. Complexity of automaton identification from given data. Information and Control, 37:302–320, 1978.Google Scholar
- M. Gualtieri. Deputize end-user developers to deliver business agility and reduce costs. In Forrester Report for Application Development and Program Management Professionals, April 2009.Google Scholar
- S. Gulwani. Automating string processing in spreadsheets using input-output examples. POPL, pages 317–330, 2011. Google Scholar
Digital Library
- S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Commun. ACM, 55(8):97–105, Aug. 2012. Google Scholar
Digital Library
- J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR 2015, 2015.Google Scholar
- S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363–3372, 2011. Google Scholar
Digital Library
- V. Kuncak, M. Mayer, R. Piskac, and P. Suter. Complete functional synthesis. In PLDI, pages 316–329, 2010. Google Scholar
Digital Library
- N. Kushmerick. Wrapper Induction for Information Extraction. PhD thesis, 1997. Google Scholar
Digital Library
- K. J. Lang. Random dfa’s can be approximately learned from sparse uniform examples. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 45–52. ACM, 1992. Google Scholar
Digital Library
- T. Lau, S. A. Wolfman, P. Domingos, and D. S. Weld. Programming by demonstration using version space algebra. Mach. Learn., 53(1-2):111–156, Oct. 2003. Google Scholar
Digital Library
- V. Le and S. Gulwani. Flashextract: A framework for data extraction by examples. PLDI, pages 542–553, 2014. Google Scholar
Digital Library
- A. Leung, J. Sarracino, and S. Lerner. Interactive parser synthesis by example. In PLDI, pages 565–574, 2015. Google Scholar
Digital Library
- N. Meng, M. Kim, and K. S. McKinley. Systematic editing: generating program transformations from an example. In PLDI, pages 329–342, 2011. Google Scholar
Digital Library
- T. M. Mitchell. Generalization as search. Artif. Intell., 18(2): 203–226, 1982.Google Scholar
Cross Ref
- J. Oncina and P. Garcia. Inferring regular languages in polynomial update time. In N. P. de la Blanca, A. Sanfeliu, and E. Vidal, editors, Pattern Recognition and Image Analysis, volume 1 of Series in Machine Perception and Artificial Intelligence, pages 49–61. World Scientific, Singapore, 1992.Google Scholar
- P. Osera and S. Zdancewic. Type-and-example-directed program synthesis. In PLDI, pages 619–630, 2015. Google Scholar
Digital Library
- P. M. Phothilimthana, T. Jelvis, R. Shah, N. Totla, S. Chasins, and R. Bod´ık. Chlorophyll: synthesis-aided compiler for lowpower spatial architectures. In PLDI, page 42, 2014. Google Scholar
Digital Library
- L. Pitt and M. K. Warmuth. The minimum consistent dfa problem cannot be approximated within any polynomial. J. ACM, 40:95–142, 1993. Google Scholar
Digital Library
- O. Polozov and S. Gulwani. Flashmeta: A framework for inductive program synthesis. OOPSLA, pages 107–126, 2015. Google Scholar
Digital Library
- V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, pages 381–390, 2001. Google Scholar
Digital Library
- V. Raychev, M. Schäfer, M. Sridharan, and M. T. Vechev. Refactoring with synthesis. In OOPSLA, pages 339–354, 2013. Google Scholar
Digital Library
- V. Raychev, P. Bielik, M. T. Vechev, and A. Krause. Learning programs from noisy data. In POPL, pages 761–774, 2016. Google Scholar
Digital Library
- R. L. Rivest and R. E. Schapire. Inference of finite automata using homing sequences. STOC ’89, pages 411–420. ACM, 1989. Google Scholar
Digital Library
- R. Singh and S. Gulwani. Predicting a correct program in programming by example. In CAV, pages 398–414, 2015.Google Scholar
- R. Singh and S. Gulwani. Transforming spreadsheet data types using examples. POPL, pages 343–356. ACM, 2016. Google Scholar
Digital Library
- R. Singh, S. Gulwani, and A. Solar-Lezama. Automated feedback generation for introductory programming assignments. In PLDI, pages 15–26, 2013. Google Scholar
Digital Library
- A. Solar-Lezama, R. Rabbah, R. Bodik, and K. Ebcioglu. Programming by sketching for bit-streaming programs. In PLDI, 2005. Google Scholar
Digital Library
- M. Vechev, E. Yahav, and G. Yorsh. Abstraction-guided synthesis of synchronization. In POPL, New York, NY, USA, 2010. ACM. Google Scholar
Digital Library
- M. D. Wulf, L. Doyen, T. A. Henzinger, and J. Raskin. Antichains: A new algorithm for checking universality of finite automata. In CAV, pages 17–30, 2006. Google Scholar
Digital Library
Recommendations
FIDEX: filtering spreadsheet data using examples
OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and ApplicationsData filtering in spreadsheets is a common problem faced by millions of end-users. The task of data filtering requires a computational model that can separate intended positive and negative string instances. We present a system, FIDEX, that can ...
Smoothing of Data by Least Squares Procedures and by Filtering
It is shown that when discrete experimental data are smoothed by fitting 2m + 1 consecutive data to a polynomial of 2nth degree, with n≪m, and when n and m are increased indefinitely, the smoothing obtained is equivalent to passing the original data ...
Transforming spreadsheet data types using examples
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesCleaning spreadsheet data types is a common problem faced by millions of spreadsheet users. Data types such as date, time, name, and units are ubiquitous in spreadsheets, and cleaning transformations on these data types involve parsing and pretty ...







Comments