skip to main content
research-article

A context-free markup language for semi-structured text

Published:05 June 2010Publication History
Skip Abstract Section

Abstract

An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer edits the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constants, optional data, alternatives, enumerations, sequences, tabular data, and recursive patterns. The ANNE system uses a combination of user annotations and the raw data itself to extract a context-free grammar from the document. This context-free grammar can then be used to parse the data and transform it into an XML parse tree, which may be viewed through a browser for analysis or debugging purposes. In addition, the ANNE system generates a PADS/ML description, which may be saved as lasting documentation of the data format or compiled into a host of useful data processing tools.

In addition to designing and implementing ANNE, we have devised a semantic theory for the core elements of the language. This semantic theory describes the editing process, which translates a raw, unannotated text document into an annotated document, and the grammar extraction process, which generates a context-free grammar from an annotated document. We also present an alternative characterization of system behavior by drawing upon ideas from the field of relevance logic. This secondary characterization, which we call relevance analysis, specifies a direct relationship between unannotated documents and the context-free grammars that our system can generate from them. Relevance analysis allows us to prove important theorems concerning the expressiveness and utility of our system.

References

  1. A. R. Anderson, N. Belnap, and J. Dunn. Entailment: The Logic of Relevance and Necessity. Princeton University Press, Princeton, NJ, 1975.Google ScholarGoogle Scholar
  2. A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Back. DataScript -- A specification and scripting language for binary data. In GPCE, volume 2487, pages 66--77. Lecture Notes in Computer Science, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. J. Bex, F. Neven, and S. Vansummeren. SchemaScope: a system for inferring and cleaning xml schemas. In SIGMOD, pages 1259--1262, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, San Francisco, CA, USA, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. F. Denis, A. Lemay, and A. Terlutte. Learning regular languages using rfsas. Theor. Comput. Sci., 313(2):267--294, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Fernandez, K. Fisher, J. Foster, M. Greenberg, and Y. Mandelbaum. A generic programming toolkit for PADS/ML: First-class upgrades for third-party developers. In PADL, pages 133--149, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. F. Fernández, K. Fisher, R. Gruber, and Y. Mandelbaum. PADX: Querying large-scale ad hoc data with xquery. In PLANX, Jan. 2006.Google ScholarGoogle Scholar
  9. K. Fisher and R. Gruber. PADS: A domain specific language for processing ad hoc data. In PLDI, pages 295--304, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Fisher, Y. Mandelbaum, and D. Walker. The next 700 data description languages. In POPL, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL, pages 421--434, Jan. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Garcia and E. Vidal. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell., 12(9):920--925, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.Google ScholarGoogle ScholarCross RefCross Ref
  14. P. Gustafsson and K. Sagonas. Adaptive pattern matching on binary data. In ESOP, pages 124--139. Springer, Mar. 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. T. W. Hong and K. L. Clark. Using grammatical inference to automate information extraction from the Web. Lecture Notes in Computer Science, 2168:216+, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Department of Computer Science and Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Lerman, L. Getoor, S. Minton, and C. Knoblock. Using the structure of web sites for automatic segmentation of tables. pages 119--130, New York, NY, USA, 2004.Google ScholarGoogle Scholar
  18. K. J. Lieberherr and A. J. Riel. Demeter: A CASE study of software growth through parameterized classes. 1(3):8--22, August 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Mandelbaum, K. Fisher, D. Walker, M. Fernandez, and A. Gleyzer. PADS/ML: A functional data description language. In POPL, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and complexity of XML schema. ACM Trans. Database Syst., 31(3):770--813, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. McCann and S. Chandra. PacketTypes: Abstract specification of network protocol messages. In SIGCOM, pages 321--333. ACM Press, August 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. T. Ng, C. Y. Lim, and J. L. T. Koo. Learning to recognize tables in free text. pages 443--450, Morristown, NJ, USA, 1999.Google ScholarGoogle Scholar
  23. J. Oncina and P. Garcia. Inferring regular languages in polynomial updated time. Machine Perception and Artificial Intelligence, 1:29--61, 1992.Google ScholarGoogle Scholar
  24. PADS project learning demo. http://www.padsproj.org/learning-demo.cgi, 2007.Google ScholarGoogle Scholar
  25. R. Pang, V. Paxson, R. Sommer, and L. Peterson. binpac: a yacc for writing application protocol parsers. In IMC '06: Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 289--300, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In SIGIR, pages 235--242, New York, NY, USA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233--272, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Stolcke and S. Omohundro. Inducing probabilistic grammars by bayesian model merging. In ICGI, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Wikström and T. Rogvall. Protocol programming in Erlang using binaries. In Fifth International Erlang/OTP User Conference, Oct. 1999.Google ScholarGoogle Scholar

Index Terms

  1. A context-free markup language for semi-structured text

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 45, Issue 6
      PLDI '10
      June 2010
      496 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/1809028
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2010
        514 pages
        ISBN:9781450300193
        DOI:10.1145/1806596

      Copyright © 2010 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 5 June 2010

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!