skip to main content
research-article

FIDEX: filtering spreadsheet data using examples

Published:19 October 2016Publication History
Skip Abstract Section

Abstract

Data filtering in spreadsheets is a common problem faced by millions of end-users. The task of data filtering requires a computational model that can separate intended positive and negative string instances. We present a system, FIDEX, that can efficiently learn desired data filtering expressions from a small set of positive and negative string examples.

There are two key ideas of our approach. First, we design an expressive DSL to represent disjunctive filter expressions needed for several real-world data filtering tasks. Second, we develop an efficient synthesis algorithm for incrementally learning consistent filter expressions in the DSL from very few positive and negative examples. A DAG-based data structure is used to succinctly represent a large number of filter expressions, and two corresponding operators are defined for algorithmically handling positive and negative examples, namely, the intersection and subtraction operators. FIDEX is able to learn data filters for 452 out of 460 real-world data filtering tasks in real time (0.22s), using only 2.2 positive string instances and 2.7 negative string instances on average.

References

  1. R. Alquezar and A. Sanfeliu. Incremental grammatical inference from positive and negative data using unbiased finite state automata. In In Proceedings of the ACL02 Workshop on Unsupervised Lexical Acquisition, pages 291–300, 1994.Google ScholarGoogle Scholar
  2. R. Alur, R. Bod´ık, E. Dallal, D. Fisman, P. Garg, G. Juniwal, H. Kress-Gazit, P. Madhusudan, M. M. K. Martin, M. Raghothaman, S. Saha, S. A. Seshia, R. Singh, A. Solar-Lezama, E. Torlak, and A. Udupa. Syntax-guided synthesis. In Dependable Software Systems Engineering, pages 1–25. 2015.Google ScholarGoogle Scholar
  3. D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39:337–350, 1978.Google ScholarGoogle Scholar
  4. D. Angluin. A note on the number of queries needed to identify regular languages. Information and Control, 51:76– 87, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  5. D. Angluin. Learning regular sets from queries and counterexamples. Inf. Comput., 75(2):87–106, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. W. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: Extracting relational data from semi-structured spreadsheets using examples. PLDI, pages 218–228, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Bartoli, G. Davanzo, A. D. Lorenzo, E. Medvet, and E. Sorio. Automatic synthesis of regular expressions from examples. Computer, 99(PrePrints):1, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. A. Cochran, L. D’Antoni, B. Livshits, D. Molnar, and M. Veanes. Program boosting: Program synthesis via crowdsourcing. POPL, pages 677–688, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Inc., 1 edition, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Denis, A. Lemay, and A. Terlutte. Learning regular languages using rfsas. Theor. Comput. Sci., 313(2):267–294, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Dupont. Incremental regular inference. In Proceedings of the Third ICGI-96, pages 222–237. Springer, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structure transformations from input-output examples. In PLDI, pages 229–239, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Firoiu, T. Oates, and P. R. Cohen. Learning regular languages from positive evidence. In In Twentieth Annual Meeting of the Cognitive Science Society, pages 350–355, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. E. Gold. Complexity of automaton identification from given data. Information and Control, 37:302–320, 1978.Google ScholarGoogle Scholar
  15. M. Gualtieri. Deputize end-user developers to deliver business agility and reduce costs. In Forrester Report for Application Development and Program Management Professionals, April 2009.Google ScholarGoogle Scholar
  16. S. Gulwani. Automating string processing in spreadsheets using input-output examples. POPL, pages 317–330, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. Commun. ACM, 55(8):97–105, Aug. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Heer, J. M. Hellerstein, and S. Kandel. Predictive interaction for data transformation. In CIDR 2015, 2015.Google ScholarGoogle Scholar
  19. S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Wrangler: interactive visual specification of data transformation scripts. In CHI, pages 3363–3372, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Kuncak, M. Mayer, R. Piskac, and P. Suter. Complete functional synthesis. In PLDI, pages 316–329, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Kushmerick. Wrapper Induction for Information Extraction. PhD thesis, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. J. Lang. Random dfa’s can be approximately learned from sparse uniform examples. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pages 45–52. ACM, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Lau, S. A. Wolfman, P. Domingos, and D. S. Weld. Programming by demonstration using version space algebra. Mach. Learn., 53(1-2):111–156, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Le and S. Gulwani. Flashextract: A framework for data extraction by examples. PLDI, pages 542–553, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Leung, J. Sarracino, and S. Lerner. Interactive parser synthesis by example. In PLDI, pages 565–574, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Meng, M. Kim, and K. S. McKinley. Systematic editing: generating program transformations from an example. In PLDI, pages 329–342, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T. M. Mitchell. Generalization as search. Artif. Intell., 18(2): 203–226, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Oncina and P. Garcia. Inferring regular languages in polynomial update time. In N. P. de la Blanca, A. Sanfeliu, and E. Vidal, editors, Pattern Recognition and Image Analysis, volume 1 of Series in Machine Perception and Artificial Intelligence, pages 49–61. World Scientific, Singapore, 1992.Google ScholarGoogle Scholar
  29. P. Osera and S. Zdancewic. Type-and-example-directed program synthesis. In PLDI, pages 619–630, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. M. Phothilimthana, T. Jelvis, R. Shah, N. Totla, S. Chasins, and R. Bod´ık. Chlorophyll: synthesis-aided compiler for lowpower spatial architectures. In PLDI, page 42, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Pitt and M. K. Warmuth. The minimum consistent dfa problem cannot be approximated within any polynomial. J. ACM, 40:95–142, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. O. Polozov and S. Gulwani. Flashmeta: A framework for inductive program synthesis. OOPSLA, pages 107–126, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. V. Raman and J. M. Hellerstein. Potter’s wheel: An interactive data cleaning system. In VLDB, pages 381–390, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. V. Raychev, M. Schäfer, M. Sridharan, and M. T. Vechev. Refactoring with synthesis. In OOPSLA, pages 339–354, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. Raychev, P. Bielik, M. T. Vechev, and A. Krause. Learning programs from noisy data. In POPL, pages 761–774, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. L. Rivest and R. E. Schapire. Inference of finite automata using homing sequences. STOC ’89, pages 411–420. ACM, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. R. Singh and S. Gulwani. Predicting a correct program in programming by example. In CAV, pages 398–414, 2015.Google ScholarGoogle Scholar
  38. R. Singh and S. Gulwani. Transforming spreadsheet data types using examples. POPL, pages 343–356. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Singh, S. Gulwani, and A. Solar-Lezama. Automated feedback generation for introductory programming assignments. In PLDI, pages 15–26, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. A. Solar-Lezama, R. Rabbah, R. Bodik, and K. Ebcioglu. Programming by sketching for bit-streaming programs. In PLDI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Vechev, E. Yahav, and G. Yorsh. Abstraction-guided synthesis of synchronization. In POPL, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. D. Wulf, L. Doyen, T. A. Henzinger, and J. Raskin. Antichains: A new algorithm for checking universality of finite automata. In CAV, pages 17–30, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 51, Issue 10
    OOPSLA '16
    October 2016
    915 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/3022671
    Issue’s Table of Contents
    • cover image ACM Conferences
      OOPSLA 2016: Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications
      October 2016
      915 pages
      ISBN:9781450344449
      DOI:10.1145/2983990

    Copyright © 2016 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 19 October 2016

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader
About Cookies On This Site

We use cookies to ensure that we give you the best experience on our website.

Learn more

Got it!