Abstract
An ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer edits the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constants, optional data, alternatives, enumerations, sequences, tabular data, and recursive patterns. The ANNE system uses a combination of user annotations and the raw data itself to extract a context-free grammar from the document. This context-free grammar can then be used to parse the data and transform it into an XML parse tree, which may be viewed through a browser for analysis or debugging purposes. In addition, the ANNE system generates a PADS/ML description, which may be saved as lasting documentation of the data format or compiled into a host of useful data processing tools.
In addition to designing and implementing ANNE, we have devised a semantic theory for the core elements of the language. This semantic theory describes the editing process, which translates a raw, unannotated text document into an annotated document, and the grammar extraction process, which generates a context-free grammar from an annotated document. We also present an alternative characterization of system behavior by drawing upon ideas from the field of relevance logic. This secondary characterization, which we call relevance analysis, specifies a direct relationship between unannotated documents and the context-free grammars that our system can generate from them. Relevance analysis allows us to prove important theorems concerning the expressiveness and utility of our system.
- A. R. Anderson, N. Belnap, and J. Dunn. Entailment: The Logic of Relevance and Necessity. Princeton University Press, Princeton, NJ, 1975.Google Scholar
- A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003. Google Scholar
Digital Library
- G. Back. DataScript -- A specification and scripting language for binary data. In GPCE, volume 2487, pages 66--77. Lecture Notes in Computer Science, 2002. Google Scholar
Digital Library
- G. J. Bex, F. Neven, and S. Vansummeren. SchemaScope: a system for inferring and cleaning xml schemas. In SIGMOD, pages 1259--1262, 2008. Google Scholar
Digital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, San Francisco, CA, USA, 2001. Google Scholar
Digital Library
- F. Denis, A. Lemay, and A. Terlutte. Learning regular languages using rfsas. Theor. Comput. Sci., 313(2):267--294, 2004. Google Scholar
Digital Library
- M. Fernandez, K. Fisher, J. Foster, M. Greenberg, and Y. Mandelbaum. A generic programming toolkit for PADS/ML: First-class upgrades for third-party developers. In PADL, pages 133--149, 2008. Google Scholar
Digital Library
- M. F. Fernández, K. Fisher, R. Gruber, and Y. Mandelbaum. PADX: Querying large-scale ad hoc data with xquery. In PLANX, Jan. 2006.Google Scholar
- K. Fisher and R. Gruber. PADS: A domain specific language for processing ad hoc data. In PLDI, pages 295--304, 2005. Google Scholar
Digital Library
- K. Fisher, Y. Mandelbaum, and D. Walker. The next 700 data description languages. In POPL, 2006. Google Scholar
Digital Library
- K. Fisher, D. Walker, K. Q. Zhu, and P. White. From dirt to shovels: Fully automatic tool generation from ad hoc data. In POPL, pages 421--434, Jan. 2008. Google Scholar
Digital Library
- P. Garcia and E. Vidal. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell., 12(9):920--925, 1990. Google Scholar
Digital Library
- E. M. Gold. Language identification in the limit. Information and Control, 10(5):447--474, 1967.Google Scholar
Cross Ref
- P. Gustafsson and K. Sagonas. Adaptive pattern matching on binary data. In ESOP, pages 124--139. Springer, Mar. 2004.Google Scholar
Cross Ref
- T. W. Hong and K. L. Clark. Using grammatical inference to automate information extraction from the Web. Lecture Notes in Computer Science, 2168:216+, 2001. Google Scholar
Digital Library
- N. Kushmerick. Wrapper induction for information extraction. PhD thesis, University of Washington, 1997. Department of Computer Science and Engineering. Google Scholar
Digital Library
- K. Lerman, L. Getoor, S. Minton, and C. Knoblock. Using the structure of web sites for automatic segmentation of tables. pages 119--130, New York, NY, USA, 2004.Google Scholar
- K. J. Lieberherr and A. J. Riel. Demeter: A CASE study of software growth through parameterized classes. 1(3):8--22, August 1988. Google Scholar
Digital Library
- Y. Mandelbaum, K. Fisher, D. Walker, M. Fernandez, and A. Gleyzer. PADS/ML: A functional data description language. In POPL, 2007. Google Scholar
Digital Library
- W. Martens, F. Neven, T. Schwentick, and G. J. Bex. Expressiveness and complexity of XML schema. ACM Trans. Database Syst., 31(3):770--813, 2006. Google Scholar
Digital Library
- P. McCann and S. Chandra. PacketTypes: Abstract specification of network protocol messages. In SIGCOM, pages 321--333. ACM Press, August 2000. Google Scholar
Digital Library
- H. T. Ng, C. Y. Lim, and J. L. T. Koo. Learning to recognize tables in free text. pages 443--450, Morristown, NJ, USA, 1999.Google Scholar
- J. Oncina and P. Garcia. Inferring regular languages in polynomial updated time. Machine Perception and Artificial Intelligence, 1:29--61, 1992.Google Scholar
- PADS project learning demo. http://www.padsproj.org/learning-demo.cgi, 2007.Google Scholar
- R. Pang, V. Paxson, R. Sommer, and L. Peterson. binpac: a yacc for writing application protocol parsers. In IMC '06: Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 289--300, New York, NY, USA, 2006. ACM. Google Scholar
Digital Library
- D. Pinto, A. McCallum, X. Wei, and W. B. Croft. Table extraction using conditional random fields. In SIGIR, pages 235--242, New York, NY, USA, 2003. Google Scholar
Digital Library
- V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google Scholar
Digital Library
- S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1-3):233--272, 1999. Google Scholar
Digital Library
- A. Stolcke and S. Omohundro. Inducing probabilistic grammars by bayesian model merging. In ICGI, 1994. Google Scholar
Digital Library
- C. Wikström and T. Rogvall. Protocol programming in Erlang using binaries. In Fifth International Erlang/OTP User Conference, Oct. 1999.Google Scholar
Index Terms
A context-free markup language for semi-structured text
Recommendations
A context-free markup language for semi-structured text
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and ImplementationAn ad hoc data format is any nonstandard, semi-structured data format for which robust data processing tools are not easily available. In this paper, we present ANNE, a new kind of markup language designed to help users generate documentation and data ...
Incremental learning of system log formats
System logs come in a large and evolving variety of formats, many of which are semi-structured and/or non-standard. As a consequence, off-the-shelf tools for processing such logs often do not exist, forcing analysts to develop their own tools, which is ...
From dirt to shovels: fully automatic tool generation from ad hoc data
POPL '08: Proceedings of the 35th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languagesAn ad hoc data source is any semistructured data source for which useful data analysis and transformation tools are not readily available. Such data must be queried, transformed and displayed by systems administrators, computational biologists, ...







Comments