skip to main content
article

Kleenex: compiling nondeterministic transducers to deterministic streaming transducers

Published:11 January 2016Publication History
Skip Abstract Section

Abstract

We present and illustrate Kleenex, a language for expressing general nondeterministic finite transducers, and its novel compilation to streaming string transducers with essentially optimal streaming behavior, worst-case linear-time performance and sustained high throughput. Its underlying theory is based on transducer decomposition into oracle and action machines: the oracle machine performs streaming greedy disambiguation of the input; the action machine performs the output actions. In use cases Kleenex achieves consistently high throughput rates around the 1 Gbps range on stock hardware. It performs well, especially in complex use cases, in comparison to both specialized and related tools such as GNUawk, GNUsed, GNUgrep, RE2, Ragel and regular-expression libraries.

References

  1. A. V. Aho. Algorithms for finding patterns in strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume Algorithms and Complexity (A), pages 255–300. Elsevier and MIT Press, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ISBN 0-444-88071-2 and 0-262-22038-5.Google ScholarGoogle Scholar
  3. A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Pearson Education, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Allender and I. Mertz. Complexity of regular functions. In Proc. LATA, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  5. R. Alur and P. ˇ Cern`y. Expressiveness of streaming string transducers. In Proc. Foundations of Software Technology and Teoretical Computer Science (FSTTCS), 2010.Google ScholarGoogle Scholar
  6. R. Alur and P. ˇ Cern`y. Streaming transducers for algorithmic verification of single-pass list-processing programs. ACM SIGPLAN Notices, 46 (1):599–610, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Alur and J. Deshmukh. Nondeterministic streaming string transducers. Automata, Languages and Programming, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Alur, L. D’Antoni, J. Deshmukh, M. Raghothaman, and Y. Yuan. Regular functions and cost register automata. In Proceedings of the 2013 28th Annual ACM/IEEE Symposium on Logic in Computer Science, pages 13–22. IEEE Computer Society, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Alur, A. Freilich, and M. Raghothaman. Regular combinators for string transformations. In Proceedings of the Joint Meeting of the Twenty-Third EACSL Annual Conference on Computer Science Logic (CSL) and the Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), CSL-LICS ’14, pages 9:1–9:10, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2886-9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Alur, L. D’Antoni, and M. Raghothaman. DReX: A declarative language for efficiently evaluating regular string transformations. In Proc. 42nd ACM Symposium on Principles of Programming Languages (POPL), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Anselmo, D. Giammarresi, and S. Varricchio. Finite automata and non-self-embedding grammars. In Implementation and Application of Automata, pages 47–56. Springer, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Antimirov. Partial derivatives of regular expressions and finite automaton constructions. Theor. Comput. Sci., 155(2):291–319, 1996. ISSN 0304-3975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M.-P. Béal and O. Carton. Determinization of transducers over finite and infinite words. Theoretical Computer Science, 289(1):225–251, Oct. 2002. ISSN 03043975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Berstel. Transductions and Context-Free Languages. Teubner, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  15. P. Bille and M. Thorup. Faster regular expression matching. In Proc. 36th International Colloquium on Automata, Languages and Programming (ICALP), pages 171–182, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Bille and M. Thorup. Regular expression matching with multistrings and intervals. In Proc. 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Borsotti, L. Breveglieri, S. C. Reghizzi, and A. Morzenti. From ambiguous regular expressions to deterministic parsing automata. In Implementation and Application of Automata, pages 35–48. Springer, 2015.Google ScholarGoogle Scholar
  18. A. Borsotti, L. Breveglieri, S. C. Reghizzi, and A. Morzenti. BSP: A parsing tool for ambiguous regular expressions. In Implementation and Application of Automata, pages 313–316. Springer, 2015.Google ScholarGoogle Scholar
  19. B. Brodie, D. Taylor, and R. Cytron. A scalable architecture for high-throughput regular-expression pattern matching. ACM SIGARCH Computer Architecture News, 34(2):202, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Brüggemann-Klein and D. Wood. One-unambiguous regular languages. Information and computation, 140(2):229–253, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. A. Brzozowski. Derivatives of regular expressions. J. ACM, 11(4): 481–494, 1964. ISSN 0004-5411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Electrical Engineering and Computer Science Series. MIT Press and McGraw-Hill, 3d edition, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. D’Antoni and M. Veanes. Static Analysis of String Encoders and Decoders. In VMCAI 2013, volume 7737 of LNCS, pages 209–228. Springer Verlag, 2013.Google ScholarGoogle Scholar
  24. L. D’Antoni and M. Veanes. Minimization of symbolic automata. In Proceedings of the 41th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), San Diego, California, January 2014. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Ellul, B. Krawetz, J. Shallit, and M.-w. Wang. Regular expressions: New results and open problems. Journal of Automata, Languages and Combinatorics, 10(4):407–437, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Engelfriet and H. Hoogeboom. MSO definable string transductions and two-way finite-state transducers. ACM Transactions on Computational Logic (TOCL), 2(2):216–254, 2001. ISSN 1529-3785. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. K. Fisher and R. Gruber. PADS: a domain-specific language for processing ad hoc data. ACM Sigplan Notices, 40(6):295–304, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Fisher and D. Walker. The PADS project: an overview. In Proceedings of the 14th International Conference on Database Theory, pages 11–17. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Ford. Parsing expression grammars: a recognition-based syntactic foundation. In ACM SIGPLAN Notices, number 1 in 39, pages 111–122. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Frisch and L. Cardelli. Greedy regular expression matching. In Proc. 31st International Colloquium on Automata, Languages and Programming (ICALP), volume 3142 of Lecture notes in computer science, pages 618–629, Turku, Finland, July 2004. Springer.Google ScholarGoogle Scholar
  31. J. Goyvaerts and S. Levithan. Regular Expressions Cookbook. O’Reilly, 2009. ISBN 978-0-596-52068-7.Google ScholarGoogle Scholar
  32. N. B. B. Grathwohl, F. Henglein, L. Nielsen, and U. T. Rasmussen. Two-pass greedy regular expression parsing. In Proc. 18th International Conference on Implementation and Application of Automata (CIAA), volume 7982 of Lecture Notes in Computer Science (LNCS), pages 60–71. Springer, July 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. B. B. Grathwohl, F. Henglein, and U. T. Rasmussen. Optimally Streaming Greedy Regular Expression Parsing. In Theoretical Aspects of Computing - ICTAC 2014 - 11th International Colloquium, Bucharest, Romania, September 17-19, 2014. Proceedings, pages 224– 240, 2014.Google ScholarGoogle Scholar
  34. C. Graulund. On automata-theoretic characterizations of regular expressions as types. Bachelor Thesis, Department of Mathematics, University of Copenhagen, May 2015.Google ScholarGoogle Scholar
  35. P. Hazel. PCRE – Perl-compatible regular expressions. Concatenation of PCRE man pages, January 3 2010.Google ScholarGoogle Scholar
  36. F. Henglein and L. Nielsen. Regular expression containment: Coinductive axiomatization and computational interpretation. SIGPLAN Notices, Proc. 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL), 46(1):385–398, January 2011.. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T. Johnson, N. Robertson, P. D. Seymour, and R. Thomas. Directed tree-width. Journal of Combinatorial Theory, Series B, 82(1):138–154, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Kearns. Extending regular expressions with context operators and parse extraction. Software - Practice and Experience, 21(8):787–804, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. K. Kosako. The Oniguruma regular expression library, 2014. URL http://www.geocities.jp/kosako3/oniguruma/.Google ScholarGoogle Scholar
  40. D. Kozen. Automata and computability. Springer Verlag, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. L. Lee. Fast context-free grammar parsing requires fast boolean matrix multiplication. Journal of the ACM (JACM), 49(1):1–15, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Lutz. Programming Python, volume 8. O’Reilly, 4th edition edition, December 2010. ISBN 978-0-596-15810-1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. R. McNaughton and H. Yamada. Regular expressions and state graphs for automata. IRE Trans. on Electronic Comput., EC-9(1):38–47, 1960.Google ScholarGoogle ScholarCross RefCross Ref
  44. . URL http://dx.doi.org/10.1109/TEC.1960.5221603.Google ScholarGoogle Scholar
  45. M. Mohri. Finite-state transducers in language and speech processing. Computational linguistics, 23(2):269–311, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. E. Myers, P. Oliva, and K. Guimar˜aes. Reporting exact and approximate regular expression matches. In Combinatorial Pattern Matching, pages 91–103. Springer, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of mathematical biology, 51(1):5–37, 1989.Google ScholarGoogle Scholar
  48. G. Myers. A four Russians algorithm for regular expression pattern matching. J. ACM, 39(2):432–448, 1992. ISSN 0004-5411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel finitestate machines. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, pages 529–542. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. G. Navarro and M. Raffinot. Compact DFA representation for fast regular expression search. Algorithm Engineering, pages 1–13, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. L. Nielsen and F. Henglein. Bit-coded regular expression parsing. In Proc. 5th Int’l Conf. on Language and Automata Theory and Applications (LATA), Lecture Notes in Computer Science (LNCS), pages 402–413. Springer, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. S. Okui and T. Suzuki. Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions. In M. Domaratzki and K. Salomaa, editors, Implementation and Application of Automata, volume 6482 of Lecture Notes in Computer Science, pages 231–240. Springer Berlin Heidelberg, 2011. ISBN 978-3-642-18097-2.. URL http://dx.doi.org/10.1007/978-3-642-18098-9_25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J.-F. Raskin and F. Servais. Visibly Pushdown Transducers. In L. Aceto, I. Damg˚ard, L. A. Goldberg, M. Halld´orsson, A. Ing´olfsd´ottir, and I. Walukiewicz, editors, Automata, Languages and Programming, volume 5126 of Lecture Notes in Computer Science, pages 386–397. Springer Berlin Heidelberg, 2008. ISBN 978-3-540-70582-6.. URL http://dx.doi.org/10.1007/978-3-540-70583-3_32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. A. Rathnayake and H. Thielecke. Static analysis for regular expression exponential runtime via substructural logics. CoRR, abs/1405.7058, 2014.Google ScholarGoogle Scholar
  55. M. Schützenberger. Sur une variante des fonctions sequentielles. Theoretical Computer Science, 4(1):47–57, Feb. 1977.Google ScholarGoogle ScholarCross RefCross Ref
  56. R. Sidhu and V. Prasanna. Fast Regular Expression Matching Using FPGAs. In Proc. 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2001. FCCM ’01, pages 227–238, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. G. Stewart, M. Gowda, G. Mainland, B. Radunovic, D. Vytiniotis, and C. L. Agull´o. Ziria: A DSL for wireless systems programming. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 415–428. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Sugiyama and Y. Minamide. Checking time linearity of regular expression matching based on backtracking. In IPSJ Transactions on Programming, number 3 in 7, pages 1–11, 2014.Google ScholarGoogle Scholar
  59. M. Sulzmann and K. Z. M. Lu. Regular Expression Sub-matching Using Partial Derivatives. In Proceedings of the 14th symposium on Principles and practice of declarative programming, PPDP ’12, pages 79–90, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1522-7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. M. Sulzmann and K. Z. M. Lu. Posix regular expression parsing with derivatives. In Proc. 12th International Symposium on Functional and Logic Programming, FLOPS ’14, Kanazawa, Japan, June 2014.Google ScholarGoogle ScholarCross RefCross Ref
  61. J.-M. Talbot and P.-A. Reynier. Visibly Pushdown Transducers with Well-nested Outputs. Technical report, Aix Marseille Universite, CNRS, 2014. URL https://hal.archives-ouvertes.fr/ hal-00988129/.Google ScholarGoogle Scholar
  62. The GNU Project, 2015. URL http://www.gnu.org/software/ coreutils/coreutils.html.Google ScholarGoogle Scholar
  63. The Hyperscan authors. Hyperscan, October 2015. URL https: //01.org/hyperscan.Google ScholarGoogle Scholar
  64. The RE2 authors. RE2, 2015. URL https://github.com/google/ re2.Google ScholarGoogle Scholar
  65. The RE2J authors. RE2J, 2015. URL https://github.com/ google/re2j.Google ScholarGoogle Scholar
  66. K. Thompson. Programming techniques: Regular expression search algorithm. Commun. ACM, 11(6):419–422, 1968. ISSN 0001-0782.. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. A. Thurston. Ragel state machine compiler, 2015. URL http: //www.colm.net/open-source/ragel/.Google ScholarGoogle Scholar
  68. G. van Noord and D. Gerdemann. Finite State Transducers with Predicates and Identities. Grammars, 4(3):263–286, 2001. ISSN 1386-7393.Google ScholarGoogle Scholar
  69. M. Veanes. Symbolic String Transformations with Regular Lookahead and Rollback. In Ershov Informatics Conference (PSI’14). Springer Verlag, 2014.Google ScholarGoogle Scholar
  70. M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjorner. Symbolic finite state transducers: Algorithms and applications. In Proceedings of the 39th Annual Symposium on Principles of Programming Languages, POPL ’12, pages 137–150, New York, NY, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. M. Veanes, D. Molnar, T. Mytkowicz, and B. Livshits. Data-parallel string-manipulating programs. In Proceedings of the 42nd annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). ACM Press, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. P. Wadler. Deforestation: transforming programs to eliminate trees. Theoretical Computer Science, 73(2):231–248, June 1990. ISSN 0304- 3975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. L. Wall, T. Christiansen, and J. Orwant. Programming Perl. O’Reilly, 3rd edition, July 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. B. W. Watson. Implementing and using finite automata toolkits. Natural Language Engineering, 2(04):295–302, 1996. ISSN 1469-8110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. B. B. Welch, K. Jones, and J. Hobbs. Practical programming in Tcl and Tk. Prentice Hall, 4th edition edition, 2003. ISBN 0130385603. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. S. Wu and U. Manber. Agrep–a fast approximate pattern-matching tool. Usenix Winter 1992, pages 153–162, 1992.Google ScholarGoogle Scholar
  77. L. Yang, P. Manadhata, W. Horne, P. Rao, and V. Ganapathy. Fast submatch extraction using OBDDs. In Proceedings of the Eighth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’12, pages 163–174, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1685-9. Introduction Contributions Overview of paper Transducers Kleenex Core Kleenex Standard Kleenex The Full Surface Language Register Update Actions Streaming Simulation Path Trees Basic Simulation Algorithm Determinization Implementation and Benchmarks Baseline Rewriting With or Without Action Separation Use Cases Discussion Conclusions Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Kleenex: compiling nondeterministic transducers to deterministic streaming transducers

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM SIGPLAN Notices
              ACM SIGPLAN Notices  Volume 51, Issue 1
              POPL '16
              January 2016
              815 pages
              ISSN:0362-1340
              EISSN:1558-1160
              DOI:10.1145/2914770
              • Editor:
              • Andy Gill
              Issue’s Table of Contents
              • cover image ACM Conferences
                POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
                January 2016
                815 pages
                ISBN:9781450335492
                DOI:10.1145/2837614

              Copyright © 2016 ACM

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 11 January 2016

              Check for updates

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader
            About Cookies On This Site

            We use cookies to ensure that we give you the best experience on our website.

            Learn more

            Got it!