skip to main content
10.1145/1328438.1328488acmconferencesArticle/Chapter ViewAbstractPublication PagespoplConference Proceedingsconference-collections
research-article

From dirt to shovels: fully automatic tool generation from ad hoc data

Published:07 January 2008Publication History

ABSTRACT

An ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, financial analysts and hosts of others on a regular basis. In this paper, we demonstrate that it is possible to generate a suite of useful data processing tools, including a semi-structured query engine, several format converters, a statistical analyzer and data visualization routines directly from the ad hoc data itself, without any human intervention. The key technical contribution of the work is a multi-phase algorithm that automatically infers the structure of an ad hoc data source and produces a format specification in the PADS data description language. Programmers wishing to implement custom data analysis tools can use such descriptions to generate printing and parsing libraries for the data. Alternatively, our software infrastructure will push these descriptions through the PADS compiler, creating format-dependent modules that, when linked with format-independent algorithms for analysis and transformation, result infully functional tools. We evaluate the performance of our inference algorithm, showing it scales linearlyin the size of the training data - completing in seconds, as opposed to the hours or days it takes to write a description by hand. We also evaluate the correctness of the algorithm, demonstrating that generating accurate descriptions often requires less than 5% of theavailable data.

References

  1. Dana Angluin. Inference of reversible languages. Journal of the ACM, 29 (3):741--765, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arvind Arasu and Hector Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Geert Jan Bex, Frank Neven, Thomas Schwentick, and Karl Tuyls. Inference of concise DTDs from XML data. In VLDB, pages 115--126, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Geert Jan Bex, Frank Neven, and Stijn Vansummeren. Inferring XML schema definitions from XML data. In VLDB, pages 998--1009, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Vinayak Borkar, Kaustubh Deshmukh, and Sunita Sarawagi. Automatic segmentation of text into structured records. In SIGMOD, pages 175--186, New York, NY, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David Burke, Kathleen Fisher, David Walker, Peter White, and Kenny Q. Zhu. Towards 1-click tool generation with PADS. In CAGI, Corvallis, OR, June 2007.Google ScholarGoogle Scholar
  7. Sudarshan Chawathe, Hector Garcia-Molina, Joachim Hammer, Kelly Ireland, Yannis Papakonstantinou, Jeffrey D. Ullman, and Jennifer Widom. The TSIMMIS project: Integration of heterogeneous information sources. In 16th Meeting of the Information Processing Society of Japan, pages 7--18, Tokyo, Japan, 1994.Google ScholarGoogle Scholar
  8. Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, San Francisco, CA, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. François Denis, Aurélien Lemay, and Alain Terlutte. Learning regular languages using RFSAs. Theoretical Computer Science, 313(2):267--294, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Mary F. Fernández, Kathleen Fisher, Robert Gruber, and Yitzhak Mandelbaum. PADX: Querying large-scale ad hoc data with XQuery. In PLAN-X, January 2006.Google ScholarGoogle Scholar
  11. Henning Fernau. Learning XML grammars. In MLDM, pages 73--87, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kathleen Fisher and Robert Gruber. PADS: A domain specific language for processing ad hoc data. In PLDI, pages 295--304, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Kathleen Fisher, Yitzhak Mandelbaum, and David Walker. The next 700 data description languages. In POPL, January 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Minos N. Garofalakis, Aristides Gionis, Rajeev Rastogi, S. Seshadri, and Kyuseok Shim. XTRACT: A system for extracting document type descriptors from XML documents. In SIGMOD, pages 165--176, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.Google ScholarGoogle ScholarCross RefCross Ref
  16. Peter D. Grünwald. The Minimum Description Length Principle. MIT Press, May 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Theodore W. Hong. Grammatical Inference for Information Extraction and Visualisation on the Web. Ph.D. Thesis, Imperial College London, 2002.Google ScholarGoogle Scholar
  18. Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. TANE: An efficient algorithm for discovering functional and approximate dependencies. The Computer Journal, 42(2):100--111, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  19. Jason L. Hutchens and Michael D. Alder. Finding structure via compression. In David M. W. Powers, editor, Proceedings of the Joint Conference on New Methods in Language Processing and Computational Natural Language Learning, pages 79--82. 1998.Google ScholarGoogle Scholar
  20. N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Department of Computer Science and Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Nicholas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997. Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004.Google ScholarGoogle Scholar
  22. Kristina Lerman, Lise Getoor, Steven Minton, and Craig Knoblock. Using the structure of web sites for automatic segmentation of tables. In SIGMOD, pages 119--130, New York, NY, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1):145--151, 1991.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yitzhak Mandelbaum, Kathleen Fisher, David Walker, Mary Fernandez, and Artem Gleyzer. PADS/ML: A functional data description language. In POPL, January 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wim Martens, Frank Neven, Thomas Schwentick, and Geert Jan Bex. Expressiveness and complexity of XML schema. ACM Transactions on Database Systems, 31(3):770--813, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Ion Muslea, Steve Minton, and Craig Knoblock. Active learning with strong and weak views: a case study on wrapper induction. In IJCAI, pages 415--420, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Hwee Tou Ng, Chung Yong Lim, and Jessica Li Teng Koo. Learning to recognize tables in free text. In ACL, pages 443--450, Morristown, NJ, USA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Oncina and P. Garcia. Inferring regular languages in polynomial updated time. Machine Perception and Artificial Intelligence, 1:29--61, 1992.Google ScholarGoogle Scholar
  29. PADS Project. PADS project. http://www.padsproj.org/, 2007.Google ScholarGoogle Scholar
  30. David Pinto, Andrew McCallum, Xing Wei, and W. Bruce Croft. Table extraction using conditional random fields. In SIGIR, pages 235--242, New York, NY, USA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche. Learning (k, l)-contextual tree languages for information extraction. In ECML, pages 305--316, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Vijayshankar Raman and Joseph M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381 -- 390, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kurt A. Shoens, Allen Luniewski, Peter M. Schwarz, James W. Stamos, and II Joachim Thomas. The Rufus system: Information organization for semi--structured data. In VLDB, pages 97--107, San Francisco, CA, USA, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Stephen Soderland. Learning information extraction rules for semistructured and free text. Machine Learning, 34(1--3):233--272, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Andreas Stolcke and Stephen Omohundro. Inducing probabilistic grammars by bayesian model merging. In ICGI, pages 106--118, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Enrique Vidal. Grammatical inference: An introduction survey. In ICGI, pages 1--4, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. R. M. Wharton. Approximate language identification. Information and Control, 26(3):236--255, 1974.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. From dirt to shovels: fully automatic tool generation from ad hoc data

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!