Abstract
Regular expressions (regexes) are ubiquitous in modern software. There is a variety of implementation techniques for regex matching, which can be roughly categorized as (1) relying on backtracking search, or (2) being based on finite-state automata. The implementations that use backtracking are often chosen due to their ability to support advanced pattern-matching constructs. Unfortunately, they are known to suffer from severe performance problems. For some regular expressions, the running time for matching can be exponential in the size of the input text. In order to provide stronger guarantees of matching efficiency, automata-based regex matching is the preferred choice. However, even these regex engines may exhibit severe performance degradation for some patterns. The main reason for this is that regexes used in practice are not exclusively built from the classical regular constructs, i.e., concatenation, nondeterministic choice and Kleene's star. They involve additional constructs that provide succinctness and convenience of expression. The most common such construct is bounded repetition (also called counting), which describes the repetition of the pattern a fixed number of times.
In this paper, we propose a new algorithm for the efficient matching of regular expressions that involve bounded repetition. Our algorithms are based on a new model of automata, which we call nondeterministic bit vector automata (NBVA). This model is chosen to be expressively equivalent to nondeterministic counter automata with bounded counters, a very natural model for expressing patterns with bounded repetition. We show that there is a class of regular expressions with bounded repetition that can be matched in time that is independent from the repetition bounds. Our algorithms are general enough to cover the vast majority of challenging bounded repetitions that arise in practice. We provide an implementation of our approach in a regex engine, which we call BVA-Scan. We compare BVA-Scan against state-of-the-art regex engines on several real datasets.
- Alfred V. Aho and Margaret J. Corasick. 1975. Efficient String Matching: An Aid to Bibliographic Search. Commun. ACM, 18, 6 (1975), 333–340. https://doi.org/10.1145/360825.360855
Google Scholar
Digital Library
- Valentin Antimirov. 1996. Partial Derivatives of Regular Expressions and Finite Automaton Constructions. Theoretical Computer Science, 155, 2 (1996), 291–319. https://doi.org/10.1016/0304-3975(95)00182-4
Google Scholar
Digital Library
- GNU Awk. 2022. GNU Awk. https://www.gnu.org/software/gawk/ Accessed: March 11, 2023.
Google Scholar
- Backreferences. 2022. Back Reference in PCRE. https://www.pcre.org/original/doc/html/pcrepattern.html#SEC19 Accessed: March 11, 2023.
Google Scholar
- Ricardo Baeza-Yates and Gaston H. Gonnet. 1992. A New Approach to Text Searching. Commun. ACM, 35, 10 (1992), 74–82. https://doi.org/10.1145/135239.135243
Google Scholar
Digital Library
- Howard Barringer, Allen Goldberg, Klaus Havelund, and Koushik Sen. 2004. Rule-Based Runtime Verification. In VMCAI 2004 (LNCS, Vol. 2937). Springer, Heidelberg. 44–57. https://doi.org/10.1007/978-3-540-24622-0_5
Google Scholar
Cross Ref
- Ezio Bartocci, Jyotirmoy Deshmukh, Alexandre Donzé, Georgios Fainekos, Oded Maler, Dejan Ničković, and Sriram Sankaranarayanan. 2018. Specification-Based Monitoring of Cyber-Physical Systems: A Survey on Theory, Tools and Applications. In Lectures on Runtime Verification: Introductory and Advanced Topics, Ezio Bartocci and Yliès Falcone (Eds.) (LNCS, Vol. 10457). Springer, Cham. 135–175. https://doi.org/10.1007/978-3-319-75632-5_5
Google Scholar
Cross Ref
- Michela Becchi and Patrick Crowley. 2008. Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions. In Proceedings of the 2008 ACM CoNEXT Conference (CoNEXT ’08). ACM, New York, NY, USA. Article 25, 12 pages. https://doi.org/10.1145/1544012.1544037
Google Scholar
Digital Library
- Joao Bispo, Ioannis Sourdis, Joao M. P. Cardoso, and Stamatis Vassiliadis. 2006. Regular Expression Matching for Reconfigurable Packet Inspection. In 2006 IEEE International Conference on Field Programmable Technology. IEEE, USA. 119–126. https://doi.org/10.1109/FPT.2006.270302
Google Scholar
Cross Ref
- Chunkun Bo, Vinh Dang, Elaheh Sadredini, and Kevin Skadron. 2018. Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 Using Automata Processing Across Different Platforms. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, USA. 737–748. https://doi.org/10.1109/HPCA.2018.00068
Google Scholar
Cross Ref
- Robert S. Boyer and J. Strother Moore. 1977. A Fast String Searching Algorithm. Commun. ACM, 20, 10 (1977), 762–772. https://doi.org/10.1145/359842.359859
Google Scholar
Digital Library
- Benjamin C. Brodie, David E. Taylor, and Ron K. Cytron. 2006. A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA ’06). IEEE Computer Society, USA. 191–202. https://doi.org/10.1109/ISCA.2006.7
Google Scholar
Digital Library
- Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM, 11, 4 (1964), 481–494. https://doi.org/10.1145/321239.321249
Google Scholar
Digital Library
- Agnishom Chattopadhyay and Konstantinos Mamouras. 2020. A Verified Online Monitor for Metric Temporal Logic with Quantitative Semantics. In RV 2020, Jyotirmoy Deshmukh and Dejan Ničković (Eds.) (LNCS, Vol. 12399). Springer, Cham. 383–403. https://doi.org/10.1007/978-3-030-60508-7_21
Google Scholar
Digital Library
- ClamAV. 2023. ClamAV - Open Source Antivirus Engine. Website. https://www.clamav.net/ Accessed: March 11, 2023.
Google Scholar
- Beate Commentz-Walter. 1979. A String Matching Algorithm Fast on the Average. In ICALP 1979, Hermann A. Maurer (Ed.) (LNCS, Vol. 71). Springer, Berlin, Heidelberg. 118–132. https://doi.org/10.1007/3-540-09510-1_10
Google Scholar
Cross Ref
- James C. Davis. 2019. Rethinking Regex Engines to Address ReDoS. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA. 1256–1258. https://doi.org/10.1145/3338906.3342509
Google Scholar
Digital Library
- Paul Dlugosch, Dave Brown, Paul Glendenning, Michael Leventhal, and Harold Noyes. 2014. An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing. IEEE Transactions on Parallel and Distributed Systems, 25, 12 (2014), 3088–3098. https://doi.org/10.1109/TPDS.2014.8
Google Scholar
Cross Ref
- Wouter Gelade, Marc Gyssens, and Wim Martens. 2009. Regular Expressions with Counting: Weak versus Strong Determinism. In MFCS 2009, Rastislav Královič and Damian Niwiński (Eds.) (LNCS, Vol. 5734). Springer, Berlin, Heidelberg. 369–381. https://doi.org/10.1007/978-3-642-03816-7_32
Google Scholar
Digital Library
- Victor Mikhaylovich Glushkov. 1961. The Abstract Theory of Automata. Russian Mathematical Surveys, 16, 5 (1961), 1–53. https://doi.org/10.1070/RM1961v016n05ABEH004112
Google Scholar
Cross Ref
- Jan Goyvaerts. 2021. Runaway Regular Expressions: Catastrophic Backtracking. https://www.regular-expressions.info/catastrophic.html accessed March 11, 2023.
Google Scholar
- GNU Grep. 2022. GNU Grep - Global Regular Expression Print. https://www.gnu.org/software/grep/ Accessed: March 11, 2023.
Google Scholar
- Philip Hazel and Zoltan Herczeg. 2022. PCRE2 - Perl Compatible Regular Expressions v2. https://www.pcre.org/ Accessed: March 11, 2023.
Google Scholar
- Lukáš Holík, Ondřej Lengál, Olli Saarikivi, Lenka Turoňová, Margus Veanes, and Tomáš Vojnar. 2019. Succinct Determinisation of Counting Automata via Sphere Construction. In APLAS 2019, Anthony Widjaja Lin (Ed.) (LNCS, Vol. 11893). Springer, Cham. 468–489. https://doi.org/10.1007/978-3-030-34175-6_24
Google Scholar
Cross Ref
- Dag Hovland. 2009. Regular Expressions with Numerical Constraints and Automata with Counters. In ICTAC 2009, Martin Leucker and Carroll Morgan (Eds.) (LNCS, Vol. 5684). Springer, Berlin, Heidelberg. 231–245. https://doi.org/10.1007/978-3-642-03466-4_15
Google Scholar
Digital Library
- Posix Syntax in PCRE. 2022. Posix Syntax in PCRE. https://www.pcre.org/original/doc/html/pcrepattern.html Accessed: March 11, 2023.
Google Scholar
- Richard M. Karp and Michael O. Rabin. 1987. Efficient Randomized Pattern-Matching Algorithms. IBM Journal of Research and Development, 31, 2 (1987), 249–260. https://doi.org/10.1147/rd.312.0249
Google Scholar
Digital Library
- Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. 1977. Fast Pattern Matching in Strings. SIAM J. Comput., 6, 2 (1977), 323–350. https://doi.org/10.1137/0206024
Google Scholar
Digital Library
- Lingkun Kong, Qixuan Yu, Agnishom Chattopadhyay, Alexis Le Glaunec, Yi Huang, Konstantinos Mamouras, and Kaiyuan Yang. 2022. Software-Hardware Codesign for Efficient In-Memory Regular Pattern Matching. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). ACM, New York, NY, USA. 733–748. https://doi.org/10.1145/3519939.3523456
Google Scholar
Digital Library
- CsA Automata Library. 2021. CsA Automata Library. https://pajda.fit.vutbr.cz/ituronova/countingautomata
Google Scholar
- Konstantinos Mamouras, Agnishom Chattopadhyay, and Zhifu Wang. 2021. Algebraic Quantitative Semantics for Efficient Online Temporal Monitoring. In TACAS 2021, Jan Friso Groote and Kim Guldstrand Larsen (Eds.) (LNCS, Vol. 12651). Springer, Cham. 330–348. https://doi.org/10.1007/978-3-030-72016-2_18
Google Scholar
Digital Library
- Konstantinos Mamouras, Agnishom Chattopadhyay, and Zhifu Wang. 2021. A Compositional Framework for Quantitative Online Monitoring over Continuous-Time Signals. In RV 2021, Lu Feng and Dana Fisman (Eds.) (LNCS, Vol. 12974). Springer, Cham. 142–163. https://doi.org/10.1007/978-3-030-88494-9_8
Google Scholar
Digital Library
- Konstantinos Mamouras and Zhifu Wang. 2020. Online Signal Monitoring with Bounded Lag. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39, 11 (2020), 3868–3880. https://doi.org/10.1109/TCAD.2020.3013053
Google Scholar
Cross Ref
- Albert R. Meyer and Michael J. Fischer. 1971. Economy of Description by Automata, Grammars, and Formal Systems. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, USA. 188–191. https://doi.org/10.1109/SWAT.1971.11
Google Scholar
Digital Library
- Albert R. Meyer and Larry J. Stockmeyer. 1972. The Equivalence Problem for Regular Expressions with Squaring Requires Exponential Space. In 13th Annual Symposium on Switching and Automata Theory (SWAT 1972). IEEE Computer Society, Los Alamitos, CA, USA. 125–129. https://doi.org/10.1109/SWAT.1972.29
Google Scholar
Digital Library
- RE2. 2023. RE2: Google’s regular expression library. Website. https://github.com/google/re2 Accessed: March 11, 2023.
Google Scholar
- RegexLib. 2023. Regular Expression Library. https://regexlib.com/ Accessed: March 11, 2023.
Google Scholar
- Martin Roesch. 1999. Snort - Lightweight Intrusion Detection for Networks. In Proceedings of the 13th USENIX Conference on System Administration (LISA ’99). USENIX Association, USA. 229–238. https://www.usenix.org/legacy/publications/library/proceedings/lisa99/full_papers/roesch/roesch.pdf
Google Scholar
Digital Library
- Indranil Roy and Srinivas Aluru. 2016. Discovering Motifs in Biological Sequences Using the Micron Automata Processor. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13, 1 (2016), 99–111. https://doi.org/10.1109/TCBB.2015.2430313
Google Scholar
Digital Library
- Olli Saarikivi, Margus Veanes, Tiki Wan, and Eric Xu. 2019. Symbolic Regex Matcher. In TACAS 2019 (LNCS, Vol. 11427). Springer, Cham. 372–378. https://doi.org/10.1007/978-3-030-17462-0_24
Google Scholar
Digital Library
- Christian J. A. Sigrist, Lorenzo Cerutti, Edouard de Castro, Petra S. Langendijk-Genevaux, Virginie Bulliard, Amos Bairoch, and Nicolas Hulo. 2009. PROSITE, A Protein Domain Database for Functional Characterization and Annotation. Nucleic Acids Research, 38, suppl_1 (2009), D161–D166. https://doi.org/10.1093/nar/gkp885
Google Scholar
Cross Ref
- Randy Smith, Cristian Estan, and Somesh Jha. 2008. XFA: Faster Signature Matching with Extended Automata. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (SP ’08). IEEE Computer Society, USA. 187–201. isbn:9780769531687 https://doi.org/10.1109/SP.2008.14
Google Scholar
Digital Library
- Snort. 2023. Snort - Network Intrusion Detection & Prevention System. https://www.snort.org/ Accessed: March 11, 2023.
Google Scholar
- Apache SpamAssassin. 2022. Apache SpamAssassin. https://spamassassin.apache.org/ Accessed: March 11, 2023.
Google Scholar
- Larry J. Stockmeyer and Albert R. Meyer. 1973. Word Problems Requiring Exponential Time (Preliminary Report). In Proceedings of the Fifth Annual ACM Symposium on Theory of Computing (STOC ’73). ACM, New York, NY, USA. 1–9. https://doi.org/10.1145/800125.804029
Google Scholar
Digital Library
- Suricata. 2023. Suricata - Open Source Intrusion Detection and Prevention Engine. https://suricata.io/ Accessed: March 11, 2023.
Google Scholar
- Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm. Commun. ACM, 11, 6 (1968), 419–422. https://doi.org/10.1145/363347.363387
Google Scholar
Digital Library
- Lenka Turoňová, Lukáš Holík, Ondřej Lengál, Olli Saarikivi, Margus Veanes, and Tomáš Vojnar. 2020. Regex Matching with Counting-Set Automata. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), Article 218, 30 pages. https://doi.org/10.1145/3428286
Google Scholar
Digital Library
- Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-Pattern Regex Matcher for Modern CPUs. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19). USENIX Association, Boston, MA. 631–648. https://www.usenix.org/conference/nsdi19/presentation/wang-xiang
Google Scholar
- Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. 2006. Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection. In Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS ’06). ACM, New York, NY, USA. 93–102. https://doi.org/10.1145/1185347.1185360
Google Scholar
Digital Library
Index Terms
Regular Expression Matching using Bit Vector Automata
Recommendations
Quantum versus deterministic counter automata
This paper focuses on quantum analogues of various models of counter automata, and almost completely proves the relation between the classes of languages recognizable by bounded error quantum ones and classical deterministic ones in every model of ...
Regex matching with counting-set automata
We propose a solution to the problem of efficient matching regular expressions (regexes) with bounded repetition, such as (ab){1,100}, using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that ...
Closure properties and descriptional complexity of deterministic regular expressions
We study the descriptional complexity of regular languages that are definable by deterministic regular expressions, i.e., we examine worst-case blow-ups in size when translating between different representations for such languages. As representations of ...






Comments