Abstract
We present and illustrate Kleenex, a language for expressing general nondeterministic finite transducers, and its novel compilation to streaming string transducers with essentially optimal streaming behavior, worst-case linear-time performance and sustained high throughput. Its underlying theory is based on transducer decomposition into oracle and action machines: the oracle machine performs streaming greedy disambiguation of the input; the action machine performs the output actions. In use cases Kleenex achieves consistently high throughput rates around the 1 Gbps range on stock hardware. It performs well, especially in complex use cases, in comparison to both specialized and related tools such as GNUawk, GNUsed, GNUgrep, RE2, Ragel and regular-expression libraries.
- A. V. Aho. Algorithms for finding patterns in strings. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume Algorithms and Complexity (A), pages 255–300. Elsevier and MIT Press, 1990. Google Scholar
Digital Library
- ISBN 0-444-88071-2 and 0-262-22038-5.Google Scholar
- A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools. Pearson Education, 2006. Google Scholar
Digital Library
- E. Allender and I. Mertz. Complexity of regular functions. In Proc. LATA, 2015.Google Scholar
Cross Ref
- R. Alur and P. ˇ Cern`y. Expressiveness of streaming string transducers. In Proc. Foundations of Software Technology and Teoretical Computer Science (FSTTCS), 2010.Google Scholar
- R. Alur and P. ˇ Cern`y. Streaming transducers for algorithmic verification of single-pass list-processing programs. ACM SIGPLAN Notices, 46 (1):599–610, 2011. Google Scholar
Digital Library
- R. Alur and J. Deshmukh. Nondeterministic streaming string transducers. Automata, Languages and Programming, 2011. Google Scholar
Digital Library
- R. Alur, L. D’Antoni, J. Deshmukh, M. Raghothaman, and Y. Yuan. Regular functions and cost register automata. In Proceedings of the 2013 28th Annual ACM/IEEE Symposium on Logic in Computer Science, pages 13–22. IEEE Computer Society, 2013. Google Scholar
Digital Library
- R. Alur, A. Freilich, and M. Raghothaman. Regular combinators for string transformations. In Proceedings of the Joint Meeting of the Twenty-Third EACSL Annual Conference on Computer Science Logic (CSL) and the Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS), CSL-LICS ’14, pages 9:1–9:10, New York, NY, USA, 2014. ACM. ISBN 978-1-4503-2886-9. Google Scholar
Digital Library
- R. Alur, L. D’Antoni, and M. Raghothaman. DReX: A declarative language for efficiently evaluating regular string transformations. In Proc. 42nd ACM Symposium on Principles of Programming Languages (POPL), 2015. Google Scholar
Digital Library
- M. Anselmo, D. Giammarresi, and S. Varricchio. Finite automata and non-self-embedding grammars. In Implementation and Application of Automata, pages 47–56. Springer, 2003. Google Scholar
Digital Library
- V. Antimirov. Partial derivatives of regular expressions and finite automaton constructions. Theor. Comput. Sci., 155(2):291–319, 1996. ISSN 0304-3975. Google Scholar
Digital Library
- M.-P. Béal and O. Carton. Determinization of transducers over finite and infinite words. Theoretical Computer Science, 289(1):225–251, Oct. 2002. ISSN 03043975. Google Scholar
Digital Library
- J. Berstel. Transductions and Context-Free Languages. Teubner, 1979.Google Scholar
Cross Ref
- P. Bille and M. Thorup. Faster regular expression matching. In Proc. 36th International Colloquium on Automata, Languages and Programming (ICALP), pages 171–182, July 2009. Google Scholar
Digital Library
- P. Bille and M. Thorup. Regular expression matching with multistrings and intervals. In Proc. 21st ACM-SIAM Symposium on Discrete Algorithms (SODA), 2010. Google Scholar
Digital Library
- A. Borsotti, L. Breveglieri, S. C. Reghizzi, and A. Morzenti. From ambiguous regular expressions to deterministic parsing automata. In Implementation and Application of Automata, pages 35–48. Springer, 2015.Google Scholar
- A. Borsotti, L. Breveglieri, S. C. Reghizzi, and A. Morzenti. BSP: A parsing tool for ambiguous regular expressions. In Implementation and Application of Automata, pages 313–316. Springer, 2015.Google Scholar
- B. Brodie, D. Taylor, and R. Cytron. A scalable architecture for high-throughput regular-expression pattern matching. ACM SIGARCH Computer Architecture News, 34(2):202, 2006. Google Scholar
Digital Library
- A. Brüggemann-Klein and D. Wood. One-unambiguous regular languages. Information and computation, 140(2):229–253, 1998. Google Scholar
Digital Library
- J. A. Brzozowski. Derivatives of regular expressions. J. ACM, 11(4): 481–494, 1964. ISSN 0004-5411. Google Scholar
Digital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Electrical Engineering and Computer Science Series. MIT Press and McGraw-Hill, 3d edition, 2009. Google Scholar
Digital Library
- L. D’Antoni and M. Veanes. Static Analysis of String Encoders and Decoders. In VMCAI 2013, volume 7737 of LNCS, pages 209–228. Springer Verlag, 2013.Google Scholar
- L. D’Antoni and M. Veanes. Minimization of symbolic automata. In Proceedings of the 41th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), San Diego, California, January 2014. ACM Press. Google Scholar
Digital Library
- K. Ellul, B. Krawetz, J. Shallit, and M.-w. Wang. Regular expressions: New results and open problems. Journal of Automata, Languages and Combinatorics, 10(4):407–437, 2005. Google Scholar
Digital Library
- J. Engelfriet and H. Hoogeboom. MSO definable string transductions and two-way finite-state transducers. ACM Transactions on Computational Logic (TOCL), 2(2):216–254, 2001. ISSN 1529-3785. Google Scholar
Digital Library
- K. Fisher and R. Gruber. PADS: a domain-specific language for processing ad hoc data. ACM Sigplan Notices, 40(6):295–304, 2005. Google Scholar
Digital Library
- K. Fisher and D. Walker. The PADS project: an overview. In Proceedings of the 14th International Conference on Database Theory, pages 11–17. ACM, 2011. Google Scholar
Digital Library
- B. Ford. Parsing expression grammars: a recognition-based syntactic foundation. In ACM SIGPLAN Notices, number 1 in 39, pages 111–122. ACM, 2004. Google Scholar
Digital Library
- A. Frisch and L. Cardelli. Greedy regular expression matching. In Proc. 31st International Colloquium on Automata, Languages and Programming (ICALP), volume 3142 of Lecture notes in computer science, pages 618–629, Turku, Finland, July 2004. Springer.Google Scholar
- J. Goyvaerts and S. Levithan. Regular Expressions Cookbook. O’Reilly, 2009. ISBN 978-0-596-52068-7.Google Scholar
- N. B. B. Grathwohl, F. Henglein, L. Nielsen, and U. T. Rasmussen. Two-pass greedy regular expression parsing. In Proc. 18th International Conference on Implementation and Application of Automata (CIAA), volume 7982 of Lecture Notes in Computer Science (LNCS), pages 60–71. Springer, July 2013. Google Scholar
Digital Library
- N. B. B. Grathwohl, F. Henglein, and U. T. Rasmussen. Optimally Streaming Greedy Regular Expression Parsing. In Theoretical Aspects of Computing - ICTAC 2014 - 11th International Colloquium, Bucharest, Romania, September 17-19, 2014. Proceedings, pages 224– 240, 2014.Google Scholar
- C. Graulund. On automata-theoretic characterizations of regular expressions as types. Bachelor Thesis, Department of Mathematics, University of Copenhagen, May 2015.Google Scholar
- P. Hazel. PCRE – Perl-compatible regular expressions. Concatenation of PCRE man pages, January 3 2010.Google Scholar
- F. Henglein and L. Nielsen. Regular expression containment: Coinductive axiomatization and computational interpretation. SIGPLAN Notices, Proc. 38th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages (POPL), 46(1):385–398, January 2011.. Google Scholar
Digital Library
- T. Johnson, N. Robertson, P. D. Seymour, and R. Thomas. Directed tree-width. Journal of Combinatorial Theory, Series B, 82(1):138–154, 2001. Google Scholar
Digital Library
- S. Kearns. Extending regular expressions with context operators and parse extraction. Software - Practice and Experience, 21(8):787–804, 1991. Google Scholar
Digital Library
- K. Kosako. The Oniguruma regular expression library, 2014. URL http://www.geocities.jp/kosako3/oniguruma/.Google Scholar
- D. Kozen. Automata and computability. Springer Verlag, 1997. Google Scholar
Digital Library
- L. Lee. Fast context-free grammar parsing requires fast boolean matrix multiplication. Journal of the ACM (JACM), 49(1):1–15, 2002. Google Scholar
Digital Library
- M. Lutz. Programming Python, volume 8. O’Reilly, 4th edition edition, December 2010. ISBN 978-0-596-15810-1. Google Scholar
Digital Library
- R. McNaughton and H. Yamada. Regular expressions and state graphs for automata. IRE Trans. on Electronic Comput., EC-9(1):38–47, 1960.Google Scholar
Cross Ref
- . URL http://dx.doi.org/10.1109/TEC.1960.5221603.Google Scholar
- M. Mohri. Finite-state transducers in language and speech processing. Computational linguistics, 23(2):269–311, 1997. Google Scholar
Digital Library
- E. Myers, P. Oliva, and K. Guimar˜aes. Reporting exact and approximate regular expression matches. In Combinatorial Pattern Matching, pages 91–103. Springer, 1998. Google Scholar
Digital Library
- E. W. Myers and W. Miller. Approximate matching of regular expressions. Bulletin of mathematical biology, 51(1):5–37, 1989.Google Scholar
- G. Myers. A four Russians algorithm for regular expression pattern matching. J. ACM, 39(2):432–448, 1992. ISSN 0004-5411. Google Scholar
Digital Library
- T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel finitestate machines. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, pages 529–542. ACM, 2014. Google Scholar
Digital Library
- G. Navarro and M. Raffinot. Compact DFA representation for fast regular expression search. Algorithm Engineering, pages 1–13, 2001. Google Scholar
Digital Library
- L. Nielsen and F. Henglein. Bit-coded regular expression parsing. In Proc. 5th Int’l Conf. on Language and Automata Theory and Applications (LATA), Lecture Notes in Computer Science (LNCS), pages 402–413. Springer, May 2011. Google Scholar
Digital Library
- S. Okui and T. Suzuki. Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions. In M. Domaratzki and K. Salomaa, editors, Implementation and Application of Automata, volume 6482 of Lecture Notes in Computer Science, pages 231–240. Springer Berlin Heidelberg, 2011. ISBN 978-3-642-18097-2.. URL http://dx.doi.org/10.1007/978-3-642-18098-9_25. Google Scholar
Digital Library
- J.-F. Raskin and F. Servais. Visibly Pushdown Transducers. In L. Aceto, I. Damg˚ard, L. A. Goldberg, M. Halld´orsson, A. Ing´olfsd´ottir, and I. Walukiewicz, editors, Automata, Languages and Programming, volume 5126 of Lecture Notes in Computer Science, pages 386–397. Springer Berlin Heidelberg, 2008. ISBN 978-3-540-70582-6.. URL http://dx.doi.org/10.1007/978-3-540-70583-3_32. Google Scholar
Digital Library
- A. Rathnayake and H. Thielecke. Static analysis for regular expression exponential runtime via substructural logics. CoRR, abs/1405.7058, 2014.Google Scholar
- M. Schützenberger. Sur une variante des fonctions sequentielles. Theoretical Computer Science, 4(1):47–57, Feb. 1977.Google Scholar
Cross Ref
- R. Sidhu and V. Prasanna. Fast Regular Expression Matching Using FPGAs. In Proc. 9th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2001. FCCM ’01, pages 227–238, 2001. Google Scholar
Digital Library
- G. Stewart, M. Gowda, G. Mainland, B. Radunovic, D. Vytiniotis, and C. L. Agull´o. Ziria: A DSL for wireless systems programming. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 415–428. ACM, 2015. Google Scholar
Digital Library
- S. Sugiyama and Y. Minamide. Checking time linearity of regular expression matching based on backtracking. In IPSJ Transactions on Programming, number 3 in 7, pages 1–11, 2014.Google Scholar
- M. Sulzmann and K. Z. M. Lu. Regular Expression Sub-matching Using Partial Derivatives. In Proceedings of the 14th symposium on Principles and practice of declarative programming, PPDP ’12, pages 79–90, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1522-7. Google Scholar
Digital Library
- M. Sulzmann and K. Z. M. Lu. Posix regular expression parsing with derivatives. In Proc. 12th International Symposium on Functional and Logic Programming, FLOPS ’14, Kanazawa, Japan, June 2014.Google Scholar
Cross Ref
- J.-M. Talbot and P.-A. Reynier. Visibly Pushdown Transducers with Well-nested Outputs. Technical report, Aix Marseille Universite, CNRS, 2014. URL https://hal.archives-ouvertes.fr/ hal-00988129/.Google Scholar
- The GNU Project, 2015. URL http://www.gnu.org/software/ coreutils/coreutils.html.Google Scholar
- The Hyperscan authors. Hyperscan, October 2015. URL https: //01.org/hyperscan.Google Scholar
- The RE2 authors. RE2, 2015. URL https://github.com/google/ re2.Google Scholar
- The RE2J authors. RE2J, 2015. URL https://github.com/ google/re2j.Google Scholar
- K. Thompson. Programming techniques: Regular expression search algorithm. Commun. ACM, 11(6):419–422, 1968. ISSN 0001-0782.. Google Scholar
Digital Library
- A. Thurston. Ragel state machine compiler, 2015. URL http: //www.colm.net/open-source/ragel/.Google Scholar
- G. van Noord and D. Gerdemann. Finite State Transducers with Predicates and Identities. Grammars, 4(3):263–286, 2001. ISSN 1386-7393.Google Scholar
- M. Veanes. Symbolic String Transformations with Regular Lookahead and Rollback. In Ershov Informatics Conference (PSI’14). Springer Verlag, 2014.Google Scholar
- M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjorner. Symbolic finite state transducers: Algorithms and applications. In Proceedings of the 39th Annual Symposium on Principles of Programming Languages, POPL ’12, pages 137–150, New York, NY, USA, 2012. Google Scholar
Digital Library
- M. Veanes, D. Molnar, T. Mytkowicz, and B. Livshits. Data-parallel string-manipulating programs. In Proceedings of the 42nd annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). ACM Press, 2015. Google Scholar
Digital Library
- P. Wadler. Deforestation: transforming programs to eliminate trees. Theoretical Computer Science, 73(2):231–248, June 1990. ISSN 0304- 3975. Google Scholar
Digital Library
- L. Wall, T. Christiansen, and J. Orwant. Programming Perl. O’Reilly, 3rd edition, July 2000. Google Scholar
Digital Library
- B. W. Watson. Implementing and using finite automata toolkits. Natural Language Engineering, 2(04):295–302, 1996. ISSN 1469-8110. Google Scholar
Digital Library
- B. B. Welch, K. Jones, and J. Hobbs. Practical programming in Tcl and Tk. Prentice Hall, 4th edition edition, 2003. ISBN 0130385603. Google Scholar
Digital Library
- S. Wu and U. Manber. Agrep–a fast approximate pattern-matching tool. Usenix Winter 1992, pages 153–162, 1992.Google Scholar
- L. Yang, P. Manadhata, W. Horne, P. Rao, and V. Ganapathy. Fast submatch extraction using OBDDs. In Proceedings of the Eighth ACM/IEEE Symposium on Architectures for Networking and Communications Systems, ANCS ’12, pages 163–174, New York, NY, USA, 2012. ACM. ISBN 978-1-4503-1685-9. Introduction Contributions Overview of paper Transducers Kleenex Core Kleenex Standard Kleenex The Full Surface Language Register Update Actions Streaming Simulation Path Trees Basic Simulation Algorithm Determinization Implementation and Benchmarks Baseline Rewriting With or Without Action Separation Use Cases Discussion Conclusions Google Scholar
Digital Library
Index Terms
Kleenex: compiling nondeterministic transducers to deterministic streaming transducers
Recommendations
Kleenex: compiling nondeterministic transducers to deterministic streaming transducers
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesWe present and illustrate Kleenex, a language for expressing general nondeterministic finite transducers, and its novel compilation to streaming string transducers with essentially optimal streaming behavior, worst-case linear-time performance and ...
Aperiodic String Transducers
DLT 2016: Proceedings of the 20th International Conference on Developments in Language Theory - Volume 9840Regular string-to-string functions enjoy a nice triple characterization through deterministic two-way transducers $$\mathrm {2DFT}$$, streaming string transducers $$\mathrm {SST}$$ and $$\mathrm {MSO}$$ definable functions. This result has recently been ...
Transducers with set output
We consider transducers with set output, i.e., finite state machines which produce a set of output symbols (rather than a string of symbols) upon reading any input symbol. When a word consisting of input symbols is read, the union of corresponding ...






Comments