ABSTRACT
An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result infully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearlyin the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of theavailable data.
- Dana Angluin. Inference of reversible languages. Journal of the ACM, 29 (3):741--765, 1982. Google Scholar
Digital Library
- Arvind Arasu and Hector Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003. Google Scholar
Digital Library
- Geert Jan Bex, Frank Neven, Thomas Schwentick, and Karl Tuyls. Inference of concise DTDs from XML data. In VLDB, pages 115--126, 2006. Google Scholar
Digital Library
- Geert Jan Bex, Frank Neven, and Stijn Vansummeren. Inferring XML schema definitions from XML data. In VLDB, pages 998--1009, 2007. Google Scholar
Digital Library
- Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of text into structured records. In SIGMOD, pages 175--186, New York, NY, USA, 2001. Google Scholar
Digital Library
- David Burke, Kathleen Fisher, David Walker, Peter White, and Kenny Q. Zhu. Towards 1-click tool generation with PADS. In CAGI, Corvallis, OR, June 2007.Google Scholar
- Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, and Jennifer Widom. The TSIMMIS project: Integration of heterogeneous information sources. In 16th Meeting of the Information Processing Society of Japan, pages 7--18, Tokyo, Japan, 1994.Google Scholar
- Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, San Francisco, CA, USA, 2001. Google Scholar
Digital Library
- François Denis, Aurélien Lemay, and Alain Terlutte. Learning regular languages using RFSAs. Theoretical Computer Science, 313(2):267--294, 2004. Google Scholar
Digital Library
- Mary F. Fernández, Kathleen Fisher, Robert Gruber, and Yitzhak Mandelbaum. PADX: Querying large-scale ad hoc data with XQuery. In PLAN-X, January 2006.Google Scholar
- Henning Fernau. Learning XML grammars. In MLDM, pages 73--87, 2001. Google Scholar
Digital Library
- Kathleen Fisher and Robert Gruber. PADS: A domain specific language for processing ad hoc data. In PLDI, pages 295--304, June 2005. Google Scholar
Digital Library
- Kathleen Fisher, Yitzhak Mandelbaum, and David Walker. The next 700 data description languages. In POPL, January 2006. Google Scholar
Digital Library
- Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, and Kyuseok Shim. XTRACT: A system for extracting document type descriptors from XML documents. In SIGMOD, pages 165--176, 2000. Google Scholar
Digital Library
- E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.Google Scholar
Cross Ref
- Peter D. Grünwald. The Minimum Description Length Principle. MIT Press, May 2007.Google Scholar
Digital Library
- Theodore W. Hong. Grammatical Inference for Information Extraction and Visualisation on the Web. Ph.D. Thesis, Imperial College London, 2002.Google Scholar
- Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.Google Scholar
Cross Ref
- Jason L. Hutchens and Michael D. Alder. Finding structure via compression. In David M. W. Powers, editor, Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning, pages 79--82. 1998.Google Scholar
- N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Department of Computer Science and Engineering. Google Scholar
Digital Library
- Nicholas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997. Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004.Google Scholar
- Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004. Google Scholar
Digital Library
- J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145--151, 1991.Google Scholar
Digital Library
- Yitzhak Mandelbaum, Kathleen Fisher, David Walker, Mary Fernandez, and Artem Gleyzer. PADS/ML: A functional data description language. In POPL, January 2007. Google Scholar
Digital Library
- Wim Martens, Frank Neven, Thomas Schwentick, and Geert Jan Bex. Expressiveness and complexity of XML schema. ACM Transactions on Database Systems, 31(3):770--813, 2006. Google Scholar
Digital Library
- Ion Muslea, Steve Minton, and Craig Knoblock. Active learning with strong and weak views: a case study on wrapper induction. In IJCAI, pages 415--420, 2003.Google Scholar
Digital Library
- Hwee Tou Ng, Chung Yong Lim, and Jessica Li Teng Koo. Learning to recognize tables in free text. In ACL, pages 443--450, Morristown, NJ, USA, 1999. Google Scholar
Digital Library
- J. Oncina and P. Garcia. Inferring regular languages in polynomial updated time. Machine Perception and Artificial Intelligence, 1:29--61, 1992.Google Scholar
- PADS Project. PADS project. http://www.padsproj.org/, 2007.Google Scholar
- David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. Table extraction using conditional random fields. In SIGIR, pages 235--242, New York, NY, USA, 2003. Google Scholar
Digital Library
- Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche. Learning (k, l)-contextual tree languages for information extraction. In ECML, pages 305--316, 2005. Google Scholar
Digital Library
- Vijayshankar Raman and Joseph M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381 -- 390, 2001. Google Scholar
Digital Library
- Kurt A. Shoens, Allen Luniewski, Peter M. Schwarz, James W. Stamos, and II Joachim Thomas. The Rufus system: Information organization for semi--structured data. In VLDB, pages 97--107, San Francisco, CA, USA, 1993. Google Scholar
Digital Library
- Stephen Soderland. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1--3):233--272, 1999. Google Scholar
Digital Library
- Andreas Stolcke and Stephen Omohundro. Inducing probabilistic grammars by bayesian model merging. In ICGI, pages 106--118, 1994. Google Scholar
Digital Library
- Enrique Vidal. Grammatical inference: An introduction survey. In ICGI, pages 1--4, 1994. Google Scholar
Digital Library
- R. M. Wharton. Approximate language identification. Information and Control, 26(3):236--255, 1974.Google Scholar
Cross Ref
Index Terms
From dirt to shovels: fully automatic tool generation from ad hoc data
Recommendations
From dirt to shovels: fully automatic tool generation from ad hoc data
POPL '08An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, ...
LearnPADS: automatic tool generation from ad hoc data
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataIn this demonstration, we will present LEARNPADS, a fully automatic system for generating ad hoc data processing tools. When presented with a collection of ad hoc data, the system (1) analyzes the data, (2) infers a PADS [4, 5] description, (3) ...
A context-free markup language for semi-structured text
PLDI '10An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data ...







Comments