Abstract
We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153 tasks over 75 large real datasets, we observe a median profiling time of only ∼ 0.7s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as Flash Fill.
Supplemental Material
- Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling Relational Data: A Survey. VLDB J. 24, 4 (2015), 557–581. Google Scholar
Digital Library
- Dana Angluin. 1987. Learning Regular Sets from Queries and Counterexamples. Inf. Comput. 75, 2 (1987), 87–106. Google Scholar
Digital Library
- David Arthur and Sergei Vassilvitskii. 2007. k-means++: The Advantages of Careful Seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 7-9, 2007, Nikhil Bansal, Kirk Pruhs, and Clifford Stein (Eds.). SIAM, 1027–1035. http://dl.acm.org/citation.cfm?id=1283383.1283494 Google Scholar
Digital Library
- Ataccama. 2017. Ataccama One Platform. https://www.ataccama.com/ .Google Scholar
- Alberto Bartoli, Giorgio Davanzo, Andrea De Lorenzo, Marco Mauri, Eric Medvet, and Enrico Sorio. 2012. Automatic Generation of Regular Expressions from Examples with Genetic Programming. In Genetic and Evolutionary Computation Conference, GECCO ’12, Philadelphia, PA, USA, July 7-11, 2012, Companion Material Proceedings, Terence Soule and Jason H. Moore (Eds.). ACM, 1477–1478. Google Scholar
Digital Library
- Osbert Bastani, Rahul Sharma, Alex Aiken, and Percy Liang. 2017. Synthesizing program input grammars. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017, Albert Cohen and Martin T. Vechev (Eds.). ACM, 95–110. Google Scholar
Digital Library
- Arka Aloke Bhattacharya, Dezhi Hong, David E. Culler, Jorge Ortiz, Kamin Whitehouse, and Eugene Wu. 2015. Automated Metadata Construction to Support Portable Building Applications. In Proceedings of the 2nd ACM International Conference on Embedded Systems for Energy-Efficient Built Environments, BuildSys 2015, Seoul, South Korea, November 4-5, 2015, David Culler, Yuvraj Agarwal, and Rahul Mangharam (Eds.). ACM, 3–12. Google Scholar
Digital Library
- Christopher M. Bishop. 2016. Pattern Recognition and Machine Learning. Springer New York. http://www.worldcat.org/ oclc/1005113608Google Scholar
- Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32. Google Scholar
Digital Library
- Graham Cormode, Minos N. Garofalakis, Peter J. Haas, and Chris Jermaine. 2012. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4, 1-3 (2012), 1–294. Google Scholar
Digital Library
- Colin De la Higuera. 2010. Grammatical inference: learning automata and grammars. Cambridge University Press. Google Scholar
Digital Library
- Xin Luna Dong and Divesh Srivastava. 2013. Big Data Integration. PVLDB 6, 11 (2013), 1188–1189. http://www.vldb.org/ pvldb/vol6/p1188- srivastava.pdf Google Scholar
Digital Library
- Kevin Ellis and Sumit Gulwani. 2017. Learning to Learn Programs from Examples: Going Beyond Program Structure. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, Carles Sierra (Ed.). ijcai.org, 1638–1645. Google Scholar
Digital Library
- Kathleen Fisher, David Walker, and Kenny Qili Zhu. 2008. LearnPADS: automatic tool generation from ad hoc data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, Jason Tsong-Li Wang (Ed.). ACM, 1299–1302. Google Scholar
Digital Library
- Wael H Gomaa and Aly A Fahmy. 2013. A survey of text similarity approaches. International Journal of Computer Applications 68, 13 (April 2013), 13–18.Google Scholar
Cross Ref
- Google. 2017. OpenRefine: A free, open source, powerful tool for working with messy data. http://openrefine.org/ .Google Scholar
- Ronald L. Graham, Donald E. Knuth, and Oren Patashnik. 1994. Concrete Mathematics - A Foundation for Computer Science, 2nd Edition). Addison-Wesley. http://www.worldcat.org/oclc/992331503 Google Scholar
Digital Library
- Sumit Gulwani. 2011. Automating String Processing in Spreadsheets Using Input-Output Examples. In Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2011, Austin, TX, USA, January 26-28, 2011. 317–330. Google Scholar
Digital Library
- Sumit Gulwani, Oleksandr Polozov, and Rishabh Singh. 2017. Program Synthesis. Foundations and Trends in Programming Languages 4, 1-2 (2017), 1–119.Google Scholar
Cross Ref
- Peter J. Haas, Jeffrey F. Naughton, S. Seshadri, and Lynne Stokes. 1995. Sampling-Based Estimation of the Number of Distinct Values of an Attribute. In VLDB’95, Proceedings of 21th International Conference on Very Large Data Bases, September 11-15, 1995, Zurich, Switzerland., Umeshwar Dayal, Peter M. D. Gray, and Shojiro Nishio (Eds.). Morgan Kaufmann, 311–322. http://www.vldb.org/conf/1995/P311.PDF Google Scholar
Digital Library
- Maria Halkidi, Yannis Batistakis, and Michalis Vazirgiannis. 2001. On Clustering Validation Techniques. J. Intell. Inf. Syst. 17, 2-3 (2001), 107–145. Google Scholar
Digital Library
- Steve Hanneke. 2014. Theory of Disagreement-Based Active Learning. Found. Trends Mach. Learn. 7, 2-3 (June 2014), 131–309. Google Scholar
Digital Library
- Yannis E. Ioannidis. 2003. The History of Histograms (abridged). In VLDB 2003, Proceedings of 29th International Conference on Very Large Data Bases, September 9-12, 2003, Berlin, Germany, Johann Christoph Freytag, Peter C. Lockemann, Serge Abiteboul, Michael J. Carey, Patricia G. Selinger, and Andreas Heuer (Eds.). Morgan Kaufmann, 19–30. http: //www.vldb.org/conf/2003/papers/S02P01.pdf Google Scholar
Digital Library
- Anil K. Jain, M. Narasimha Murty, and Patrick J. Flynn. 1999. Data Clustering: A Review. ACM Comput. Surv. 31, 3 (1999), 264–323. Google Scholar
Digital Library
- Panagiotis Karras and Nikos Mamoulis. 2007. The Haar+ Tree: A Refined Synopsis Data Structure. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007, Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K. Sellis (Eds.). IEEE Computer Society, 436–445.Google Scholar
Cross Ref
- Dileep Kini and Sumit Gulwani. 2015. FlashNormalize: Programming by Examples for Text Normalization. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, Qiang Yang and Michael Wooldridge (Eds.). AAAI Press, 776–783. http://ijcai.org/Abstract/15/115 Google Scholar
Digital Library
- Tessa Lau. 2009. Why Programming-By-Demonstration Systems Fail: Lessons Learned for Usable AI. AI Magazine 30, 4 (2009), 65–67.Google Scholar
Digital Library
- Vu Le and Sumit Gulwani. 2014. FlashExtract: a framework for data extraction by examples. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, Edinburgh, United Kingdom - June 09 - 11, 2014, Michael F. P. O’Boyle and Keshav Pingali (Eds.). ACM, 542–553. Google Scholar
Digital Library
- Vladimir I Levenshtein. 1966. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. In Soviet Physics Doklady, Vol. 10. 707–710. http://adsabs.harvard.edu/abs/1966SPhD...10..707LGoogle Scholar
- Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. 2008. Regular Expression Learning for Information Extraction. In 2008 Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, Proceedings of the Conference, 25-27 October 2008, Honolulu, Hawaii, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. ACL, 21–30. http://www.aclweb.org/anthology/D08- 1003 Google Scholar
Digital Library
- Henry Lieberman. 2001. Your wish is my command: Programming by example. Morgan Kaufmann.Google Scholar
- Steve Lohr. 2014. For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. New York Times 17 (2014). https: //www.nytimes.com/2014/08/18/technology/for- big- data- scientists- hurdle- to- insights- is- janitor- work.htmlGoogle Scholar
- James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA., 281–297.Google Scholar
- Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to information retrieval. Cambridge University Press. Google Scholar
Digital Library
- Arkady Maydanchik. 2007. Data Quality Assessment. Technics Publications. https://technicspub.com/ data- quality- assessment/ Google Scholar
Digital Library
- Mikaël Mayer, Gustavo Soares, Maxim Grechkin, Vu Le, Mark Marron, Oleksandr Polozov, Rishabh Singh, Benjamin G. Zorn, and Sumit Gulwani. 2015. User Interaction Models for Disambiguation in Programming by Example. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology, UIST 2015, Charlotte, NC, USA, November 8-11, 2015, Celine Latulipe, Bjoern Hartmann, and Tovi Grossman (Eds.). ACM, 291–301. Google Scholar
Digital Library
- Microsoft. 2017a. Azure Machine Learning By-Example Data Transform. https://www.youtube.com/watch?v=9KG0Sc2B2KI .Google Scholar
- Microsoft. 2017b. Data Transformations "By Example" in the Azure ML Workbench. https://blogs.technet.microsoft.com/ machinelearning/2017/09/25/by- example- transformations- in- the- azure- machine- learning- workbench/ .Google Scholar
- Microsoft. 2017c. Microsoft SQL Server Data Tools (SSDT). https://docs.microsoft.com/en- gb/sql/ssdt .Google Scholar
- Microsoft. 2017d. Program Synthesis using Examples SDK. https://microsoft.github.io/prose/ .Google Scholar
- José Oncina and Pedro García. 1992. Identifying regular languages in polynomial time. Advances in Structural and Syntactic Pattern Recognition 5, 99-108 (1992), 15–20.Google Scholar
Cross Ref
- Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, part of SPLASH 2015, Pittsburgh, PA, USA, October 25-30, 2015, Jonathan Aldrich and Patrick Eugster (Eds.). ACM, 107–126. Google Scholar
Digital Library
- Vijayshankar Raman and Joseph M. Hellerstein. 2001. Potter’s Wheel: An Interactive Data Cleaning System. In VLDB 2001, Proceedings of 27th International Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy, Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, Kotagiri Ramamohanarao, and Richard T. Snodgrass (Eds.). Morgan Kaufmann, 381–390. http://www.vldb.org/conf/2001/P381.pdf Google Scholar
Digital Library
- Veselin Raychev, Pavol Bielik, Martin T. Vechev, and Andreas Krause. 2016. Learning programs from noisy data. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016, St. Petersburg, FL, USA, January 20 - 22, 2016, Rastislav Bodík and Rupak Majumdar (Eds.). ACM, 761–774. Google Scholar
Digital Library
- Rishabh Singh. 2016. BlinkFill: Semi-supervised Programming By Example for Syntactic String Transformations. PVLDB 9, 10 (2016), 816–827. http://www.vldb.org/pvldb/vol9/p816- singh.pdf Google Scholar
Digital Library
- Armando Solar-Lezama, Liviu Tancau, Rastislav Bodík, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorial sketching for finite programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006, John Paul Shen and Margaret Martonosi (Eds.). ACM, 404–415. Google Scholar
Digital Library
- Thorvald Sørensen. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol. Skr. 5 (1948), 1–34.Google Scholar
- Borge Svingen. 1998. Learning Regular Languages using Genetic Programming. Proc. Genetic Programming (1998), 374–376.Google Scholar
- Andrei N Tikhonov. 1963. Solution of Incorrectly Formulated Problems and the Regularization Method. In Dokl. Akad. Nauk., Vol. 151. 1035–1038.Google Scholar
- Trifacta. 2017. Trifacta Wrangler. https://www.trifacta.com/products/wrangler/ .Google Scholar
- William E Winkler. 1999. The State of Record Linkage and Current Research Problems.Google Scholar
- Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2017. Data Mining: Practical Machine Learning Tools and Techniques, 4th Edition. Elsevier Science & Technology. http://www.worldcat.org/oclc/1007085077 Google Scholar
Digital Library
- Rui Xu and Donald C. Wunsch II. 2005. Survey of Clustering Algorithms. IEEE Trans. Neural Networks 16, 3 (2005), 645–678. Google Scholar
Digital Library
- Kenny Qili Zhu, Kathleen Fisher, and David Walker. 2012. LearnPADS++ : Incremental Inference of Ad Hoc Data Formats. In Practical Aspects of Declarative Languages - 14th International Symposium, PADL 2012, Philadelphia, PA, USA, January 23-24, 2012. Proceedings (Lecture Notes in Computer Science), Claudio V. Russo and Neng-Fa Zhou (Eds.), Vol. 7149. Springer, 168–182. Google Scholar
Digital Library
Index Terms
FlashProfile: a framework for synthesizing data profiles
Recommendations
Synthesizing transformations on hierarchically structured data
PLDI '16This paper presents a new approach for synthesizing transformations on tree-structured data, such as Unix directories and XML documents. We consider a general abstraction for such data, called hierarchical data trees (HDTs) and present a novel example-...
Data Profiling: A Tutorial
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Datais to understand the dataset at hand and its metadata. The process of metadata discovery is known as data profiling. Profiling activities range from ad-hoc approaches, such as eye-balling random subsets of the data or formulating aggregation queries, to ...
Can reactive synthesis and syntax-guided synthesis be friends?
PLDI 2022: Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and ImplementationWhile reactive synthesis and syntax-guided synthesis (SyGuS) have seen enormous progress in recent years, combining the two approaches has remained a challenge. In this work, we present the synthesis of reactive programs from Temporal Stream Logic ...






Comments