ABSTRACT
An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this paper, we develop a foundational framework where the central construct is what we call a spanner. A spanner maps an input string into relations over the spans (intervals specified by bounding indices) of the string. The focus of this paper is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables.
We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners---the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.
- A. V. Aho. Algorithms for finding patterns in strings. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), pages 255--300. North Holland, 1990. Google Scholar
Digital Library
- J. F. Allen. Maintaining knowledge about temporal intervals. Commun. ACM, 26(11):832--843, Nov. 1983. Google Scholar
Digital Library
- D. E. Appelt and B. Onyshkevych. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, pages 23--30, Baltimore, Maryland, USA, 1998. Association for Computational Linguistics. Google Scholar
Digital Library
- M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79. ACM, 1999. Google Scholar
Digital Library
- P. Barceló, D. Figueira, and L. Libkin. Graph logics with rational relations and the generalized intersection problem. In LICS, pages 115--124. IEEE, 2012. Google Scholar
Digital Library
- P. Barceló, J. L. Reutter, and L. Libkin. Parameterized regular expressions and their languages. Theor. Comput. Sci., 474:21--45, 2013. Google Scholar
Digital Library
- M. Benedikt, L. Libkin, T. Schwentick, and L. Segoufin. Definable relations and first-order query languages over strings. J. ACM, 50(5):694--751, 2003. Google Scholar
Digital Library
- J. Berstel. Transductions and Context-Free Languages. Teubner Studienbücher, Stuttgart, 1979.Google Scholar
- A. J. Bonner and G. Mecca. Sequences, datalog, and transducers. J. Comput. Syst. Sci., 57(3):234--259, 1998. Google Scholar
Digital Library
- D. Calvanese, G. D. Giacomo, M. Lenzerini, and M. Y. Vardi. Containment of conjunctive regular path queries with inverse. In KR 2000, pages 176--185, 2000.Google Scholar
- D. Calvanese, G. D. Giacomo, M. Lenzerini, and M. Y. Vardi. View-based query processing and constraint satisfaction. In LICS, pages 361--371, 2000. Google Scholar
Digital Library
- C. Câmpeanu, K. Salomaa, and S. Yu. A formal study of practical regular expressions. Int. J. Found. Comput. Sci., 14(6):1007--1018, 2003.Google Scholar
- C. Câmpeanu and N. Santean. On the intersection of regex languages with regular languages. Theor. Comput. Sci., 410(24--25):2336--2344, 2009. Google Scholar
Digital Library
- B. Carle and P. Narendran. On extended regular expressions. In LATA 2009, volume 5457 of Lecture Notes in Computer Science, pages 279--289, 2009. Google Scholar
Digital Library
- L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128--137. The Association for Computer Linguistics, 2010. Google Scholar
Digital Library
- M. P. Consens and A. O. Mendelzon. Graphlog: a visual formalism for real life recursion. In PODS, pages 404--416. ACM, 1990. Google Scholar
Digital Library
- I. F. Cruz, A. O. Mendelzon, and P. T. Wood. A graphical query language supporting recursion. In SIGMOD Conference, pages 323--330. ACM, 1987. Google Scholar
Digital Library
- H. Cunningham. Gate, a general architecture for text engineering. Computers and the Humanities, 36(2):223--254, 2002.Google Scholar
- A. Deutsch and V. Tannen. Optimization properties for classes of conjunctive regular path queries. In DBPL, pages 21--39, 2001. Google Scholar
Digital Library
- C. C. Elgot and J. E. Mezei. On relations defined by generalized finite automata. IBM Journal of Research and Development, 9:47--68, 1965. Google Scholar
Digital Library
- D. Florescu, A. Y. Levy, and D. Suciu. Query containment for conjunctive queries with regular expressions. In PODS, pages 139--148, 1998. Google Scholar
Digital Library
- D. Freitag. Toward general-purpose learning for information extraction. In COLING-ACL, pages 404--408, 1998. Google Scholar
Digital Library
- D. D. Freydenberger. Extended regular expressions: Succinctness and decidability. In STACS 2011, volume 9 of LIPIcs, pages 507--518. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2011.Google Scholar
- J. Friedl. Mastering Regular Expressions. O'Reilly Media, 2006. Google Scholar
Digital Library
- S. Ginsburg and X. S. Wang. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci., 56(1):1--26, 1998. Google Scholar
Digital Library
- G. Grahne, M. Nykänen, and E. Ukkonen. Reasoning about strings in databases. J. Comput. Syst. Sci., 59(1):116--162, 1999. Google Scholar
Digital Library
- R. Grishman and B. Sundheim. Message understanding conference- 6: A brief history. In COLING, pages 466--471, 1996. Google Scholar
Digital Library
- O. Grumberg, O. Kupferman, and S. Sheinvald. Variable automata over infinite alphabets. In A. H. Dediu, H. Fernau, and C. Martín-Vide, editors, LATA, volume 6031 of Lecture Notes in Computer Science, pages 561--572. Springer, 2010. Google Scholar
Digital Library
- D. E. Knuth. Semantics of context-free languages. Mathematical Systems Theory, 2(2):127--145, 1968.Google Scholar
Cross Ref
- D. E. Knuth. Correction: Semantics of context-free languages. Mathematical Systems Theory, 5(1):95--96, 1971.Google Scholar
Cross Ref
- R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4):7--13, 2008. Google Scholar
Digital Library
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289. Morgan Kaufmann, 2001. Google Scholar
Digital Library
- T. R. Leek. Information extraction using hidden markov models. Master's thesis, UC San Diego, 1997.Google Scholar
- P. Linz. An introduction to formal languages and automata. Jones and Bartlett Publishers, Inc., Sudbury, Mass. {N.A.}, third edition, 2001. Google Scholar
Digital Library
- B. Liu, L. Chiticariu, V. Chu, H. Jagadish, and F. Reiss. Automatic rule refinement for information extraction. Proceedings of the VLDB Endowment, 3(1--2):588--597, 2010. Google Scholar
Digital Library
- A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, pages 591--598. Morgan Kaufmann, 2000. Google Scholar
Digital Library
- F. Neven and J. V. den Bussche. Expressiveness of structured document query languages based on attribute grammars. J. ACM, 49(1):56--100, 2002. Google Scholar
Digital Library
- F. Neven and T. Schwentick. Query automata over finite trees. Theoretical Computer Science, 275(2):633 -- 674, 2002. Google Scholar
Digital Library
- M. Nivat. Transduction des langages de Chomsky. Ann. Inst. Fourier, 18:339--455, 1968.Google Scholar
Cross Ref
- F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942. IEEE, 2008. Google Scholar
Digital Library
- E. Riloff. Automatically constructing a dictionary for information extraction tasks. In AAAI, pages 811--816. AAAI Press / The MIT Press, 1993. Google Scholar
Digital Library
- S. Soderland, D. Fisher, J. Aseltine, and W. G. Lehnert. CRYSTAL: Inducing a conceptual dictionary. In IJCAI, pages 1314--1321. Morgan Kaufmann, 1995. Google Scholar
Digital Library
- S. Staworko, J. Chomicki, and J. Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2--3):209--246, 2012. Google Scholar
Digital Library
Index Terms
Spanners: a formal framework for information extraction
Recommendations
Document Spanners: A Formal Approach to Information Extraction
An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short)...
Efficient Enumeration Algorithms for Regular Document Spanners
Best of SIGMOD 2018 and Best of PODS 2018Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners, use regular languages to locate the data that a user wants to extract from a ...
Document Spanners: From Expressive Power to Decision Problems
We examine document spanners, a formal framework for information extraction that was introduced by Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). A document spanner is a function that maps an input string to a relation over spans (...






Comments