skip to main content
10.1145/2463664.2463665acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Spanners: a formal framework for information extraction

Published:22 June 2013Publication History

ABSTRACT

An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this paper, we develop a foundational framework where the central construct is what we call a spanner. A spanner maps an input string into relations over the spans (intervals specified by bounding indices) of the string. The focus of this paper is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables.

We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners---the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.

References

  1. A. V. Aho. Algorithms for finding patterns in strings. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), pages 255--300. North Holland, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. F. Allen. Maintaining knowledge about temporal intervals. Commun. ACM, 26(11):832--843, Nov. 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. E. Appelt and B. Onyshkevych. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, pages 23--30, Baltimore, Maryland, USA, 1998. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Barceló, D. Figueira, and L. Libkin. Graph logics with rational relations and the generalized intersection problem. In LICS, pages 115--124. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Barceló, J. L. Reutter, and L. Libkin. Parameterized regular expressions and their languages. Theor. Comput. Sci., 474:21--45, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Benedikt, L. Libkin, T. Schwentick, and L. Segoufin. Definable relations and first-order query languages over strings. J. ACM, 50(5):694--751, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Berstel. Transductions and Context-Free Languages. Teubner Studienbücher, Stuttgart, 1979.Google ScholarGoogle Scholar
  9. A. J. Bonner and G. Mecca. Sequences, datalog, and transducers. J. Comput. Syst. Sci., 57(3):234--259, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Calvanese, G. D. Giacomo, M. Lenzerini, and M. Y. Vardi. Containment of conjunctive regular path queries with inverse. In KR 2000, pages 176--185, 2000.Google ScholarGoogle Scholar
  11. D. Calvanese, G. D. Giacomo, M. Lenzerini, and M. Y. Vardi. View-based query processing and constraint satisfaction. In LICS, pages 361--371, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Câmpeanu, K. Salomaa, and S. Yu. A formal study of practical regular expressions. Int. J. Found. Comput. Sci., 14(6):1007--1018, 2003.Google ScholarGoogle Scholar
  13. C. Câmpeanu and N. Santean. On the intersection of regex languages with regular languages. Theor. Comput. Sci., 410(24--25):2336--2344, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Carle and P. Narendran. On extended regular expressions. In LATA 2009, volume 5457 of Lecture Notes in Computer Science, pages 279--289, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128--137. The Association for Computer Linguistics, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. P. Consens and A. O. Mendelzon. Graphlog: a visual formalism for real life recursion. In PODS, pages 404--416. ACM, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. F. Cruz, A. O. Mendelzon, and P. T. Wood. A graphical query language supporting recursion. In SIGMOD Conference, pages 323--330. ACM, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Cunningham. Gate, a general architecture for text engineering. Computers and the Humanities, 36(2):223--254, 2002.Google ScholarGoogle Scholar
  19. A. Deutsch and V. Tannen. Optimization properties for classes of conjunctive regular path queries. In DBPL, pages 21--39, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. C. Elgot and J. E. Mezei. On relations defined by generalized finite automata. IBM Journal of Research and Development, 9:47--68, 1965. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Florescu, A. Y. Levy, and D. Suciu. Query containment for conjunctive queries with regular expressions. In PODS, pages 139--148, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Freitag. Toward general-purpose learning for information extraction. In COLING-ACL, pages 404--408, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. D. Freydenberger. Extended regular expressions: Succinctness and decidability. In STACS 2011, volume 9 of LIPIcs, pages 507--518. Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, 2011.Google ScholarGoogle Scholar
  24. J. Friedl. Mastering Regular Expressions. O'Reilly Media, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Ginsburg and X. S. Wang. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci., 56(1):1--26, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Grahne, M. Nykänen, and E. Ukkonen. Reasoning about strings in databases. J. Comput. Syst. Sci., 59(1):116--162, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Grishman and B. Sundheim. Message understanding conference- 6: A brief history. In COLING, pages 466--471, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. O. Grumberg, O. Kupferman, and S. Sheinvald. Variable automata over infinite alphabets. In A. H. Dediu, H. Fernau, and C. Martín-Vide, editors, LATA, volume 6031 of Lecture Notes in Computer Science, pages 561--572. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. E. Knuth. Semantics of context-free languages. Mathematical Systems Theory, 2(2):127--145, 1968.Google ScholarGoogle ScholarCross RefCross Ref
  30. D. E. Knuth. Correction: Semantics of context-free languages. Mathematical Systems Theory, 5(1):95--96, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  31. R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, S. Vaithyanathan, and H. Zhu. SystemT: a system for declarative information extraction. SIGMOD Record, 37(4):7--13, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289. Morgan Kaufmann, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. R. Leek. Information extraction using hidden markov models. Master's thesis, UC San Diego, 1997.Google ScholarGoogle Scholar
  34. P. Linz. An introduction to formal languages and automata. Jones and Bartlett Publishers, Inc., Sudbury, Mass. {N.A.}, third edition, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. B. Liu, L. Chiticariu, V. Chu, H. Jagadish, and F. Reiss. Automatic rule refinement for information extraction. Proceedings of the VLDB Endowment, 3(1--2):588--597, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy markov models for information extraction and segmentation. In ICML, pages 591--598. Morgan Kaufmann, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. F. Neven and J. V. den Bussche. Expressiveness of structured document query languages based on attribute grammars. J. ACM, 49(1):56--100, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. F. Neven and T. Schwentick. Query automata over finite trees. Theoretical Computer Science, 275(2):633 -- 674, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Nivat. Transduction des langages de Chomsky. Ann. Inst. Fourier, 18:339--455, 1968.Google ScholarGoogle ScholarCross RefCross Ref
  40. F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942. IEEE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. E. Riloff. Automatically constructing a dictionary for information extraction tasks. In AAAI, pages 811--816. AAAI Press / The MIT Press, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. Soderland, D. Fisher, J. Aseltine, and W. G. Lehnert. CRYSTAL: Inducing a conceptual dictionary. In IJCAI, pages 1314--1321. Morgan Kaufmann, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. Staworko, J. Chomicki, and J. Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2--3):209--246, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Spanners: a formal framework for information extraction

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                PODS '13: Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI symposium on Principles of database systems
                June 2013
                334 pages
                ISBN:9781450320665
                DOI:10.1145/2463664
                • General Chair:
                • Richard Hull,
                • Program Chair:
                • Wenfei Fan

                Copyright © 2013 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 22 June 2013

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                PODS '13 Paper Acceptance Rate24of97submissions,25%Overall Acceptance Rate476of1,835submissions,26%

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!