skip to main content
research-article
Open Access
Artifacts Evaluated & Functional / v1.1

Regular Expression Matching using Bit Vector Automata

Published:06 April 2023Publication History
Skip Abstract Section

Abstract

Regular expressions (regexes) are ubiquitous in modern software. There is a variety of implementation techniques for regex matching, which can be roughly categorized as (1) relying on backtracking search, or (2) being based on finite-state automata. The implementations that use backtracking are often chosen due to their ability to support advanced pattern-matching constructs. Unfortunately, they are known to suffer from severe performance problems. For some regular expressions, the running time for matching can be exponential in the size of the input text. In order to provide stronger guarantees of matching efficiency, automata-based regex matching is the preferred choice. However, even these regex engines may exhibit severe performance degradation for some patterns. The main reason for this is that regexes used in practice are not exclusively built from the classical regular constructs, i.e., concatenation, nondeterministic choice and Kleene's star. They involve additional constructs that provide succinctness and convenience of expression. The most common such construct is bounded repetition (also called counting), which describes the repetition of the pattern a fixed number of times.

In this paper, we propose a new algorithm for the efficient matching of regular expressions that involve bounded repetition. Our algorithms are based on a new model of automata, which we call nondeterministic bit vector automata (NBVA). This model is chosen to be expressively equivalent to nondeterministic counter automata with bounded counters, a very natural model for expressing patterns with bounded repetition. We show that there is a class of regular expressions with bounded repetition that can be matched in time that is independent from the repetition bounds. Our algorithms are general enough to cover the vast majority of challenging bounded repetitions that arise in practice. We provide an implementation of our approach in a regex engine, which we call BVA-Scan. We compare BVA-Scan against state-of-the-art regex engines on several real datasets.

References

  1. Alfred V. Aho and Margaret J. Corasick. 1975. Efficient String Matching: An Aid to Bibliographic Search. Commun. ACM, 18, 6 (1975), 333–340. https://doi.org/10.1145/360825.360855 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Valentin Antimirov. 1996. Partial Derivatives of Regular Expressions and Finite Automaton Constructions. Theoretical Computer Science, 155, 2 (1996), 291–319. https://doi.org/10.1016/0304-3975(95)00182-4 Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. GNU Awk. 2022. GNU Awk. https://www.gnu.org/software/gawk/ Accessed: March 11, 2023. Google ScholarGoogle Scholar
  4. Backreferences. 2022. Back Reference in PCRE. https://www.pcre.org/original/doc/html/pcrepattern.html#SEC19 Accessed: March 11, 2023. Google ScholarGoogle Scholar
  5. Ricardo Baeza-Yates and Gaston H. Gonnet. 1992. A New Approach to Text Searching. Commun. ACM, 35, 10 (1992), 74–82. https://doi.org/10.1145/135239.135243 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Howard Barringer, Allen Goldberg, Klaus Havelund, and Koushik Sen. 2004. Rule-Based Runtime Verification. In VMCAI 2004 (LNCS, Vol. 2937). Springer, Heidelberg. 44–57. https://doi.org/10.1007/978-3-540-24622-0_5 Google ScholarGoogle ScholarCross RefCross Ref
  7. Ezio Bartocci, Jyotirmoy Deshmukh, Alexandre Donzé, Georgios Fainekos, Oded Maler, Dejan Ničković, and Sriram Sankaranarayanan. 2018. Specification-Based Monitoring of Cyber-Physical Systems: A Survey on Theory, Tools and Applications. In Lectures on Runtime Verification: Introductory and Advanced Topics, Ezio Bartocci and Yliès Falcone (Eds.) (LNCS, Vol. 10457). Springer, Cham. 135–175. https://doi.org/10.1007/978-3-319-75632-5_5 Google ScholarGoogle ScholarCross RefCross Ref
  8. Michela Becchi and Patrick Crowley. 2008. Extending Finite Automata to Efficiently Match Perl-Compatible Regular Expressions. In Proceedings of the 2008 ACM CoNEXT Conference (CoNEXT ’08). ACM, New York, NY, USA. Article 25, 12 pages. https://doi.org/10.1145/1544012.1544037 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Joao Bispo, Ioannis Sourdis, Joao M. P. Cardoso, and Stamatis Vassiliadis. 2006. Regular Expression Matching for Reconfigurable Packet Inspection. In 2006 IEEE International Conference on Field Programmable Technology. IEEE, USA. 119–126. https://doi.org/10.1109/FPT.2006.270302 Google ScholarGoogle ScholarCross RefCross Ref
  10. Chunkun Bo, Vinh Dang, Elaheh Sadredini, and Kevin Skadron. 2018. Searching for Potential gRNA Off-Target Sites for CRISPR/Cas9 Using Automata Processing Across Different Platforms. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, USA. 737–748. https://doi.org/10.1109/HPCA.2018.00068 Google ScholarGoogle ScholarCross RefCross Ref
  11. Robert S. Boyer and J. Strother Moore. 1977. A Fast String Searching Algorithm. Commun. ACM, 20, 10 (1977), 762–772. https://doi.org/10.1145/359842.359859 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Benjamin C. Brodie, David E. Taylor, and Ron K. Cytron. 2006. A Scalable Architecture For High-Throughput Regular-Expression Pattern Matching. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA ’06). IEEE Computer Society, USA. 191–202. https://doi.org/10.1109/ISCA.2006.7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Janusz A. Brzozowski. 1964. Derivatives of Regular Expressions. J. ACM, 11, 4 (1964), 481–494. https://doi.org/10.1145/321239.321249 Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Agnishom Chattopadhyay and Konstantinos Mamouras. 2020. A Verified Online Monitor for Metric Temporal Logic with Quantitative Semantics. In RV 2020, Jyotirmoy Deshmukh and Dejan Ničković (Eds.) (LNCS, Vol. 12399). Springer, Cham. 383–403. https://doi.org/10.1007/978-3-030-60508-7_21 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ClamAV. 2023. ClamAV - Open Source Antivirus Engine. Website. https://www.clamav.net/ Accessed: March 11, 2023. Google ScholarGoogle Scholar
  16. Beate Commentz-Walter. 1979. A String Matching Algorithm Fast on the Average. In ICALP 1979, Hermann A. Maurer (Ed.) (LNCS, Vol. 71). Springer, Berlin, Heidelberg. 118–132. https://doi.org/10.1007/3-540-09510-1_10 Google ScholarGoogle ScholarCross RefCross Ref
  17. James C. Davis. 2019. Rethinking Regex Engines to Address ReDoS. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). ACM, New York, NY, USA. 1256–1258. https://doi.org/10.1145/3338906.3342509 Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Paul Dlugosch, Dave Brown, Paul Glendenning, Michael Leventhal, and Harold Noyes. 2014. An Efficient and Scalable Semiconductor Architecture for Parallel Automata Processing. IEEE Transactions on Parallel and Distributed Systems, 25, 12 (2014), 3088–3098. https://doi.org/10.1109/TPDS.2014.8 Google ScholarGoogle ScholarCross RefCross Ref
  19. Wouter Gelade, Marc Gyssens, and Wim Martens. 2009. Regular Expressions with Counting: Weak versus Strong Determinism. In MFCS 2009, Rastislav Královič and Damian Niwiński (Eds.) (LNCS, Vol. 5734). Springer, Berlin, Heidelberg. 369–381. https://doi.org/10.1007/978-3-642-03816-7_32 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Victor Mikhaylovich Glushkov. 1961. The Abstract Theory of Automata. Russian Mathematical Surveys, 16, 5 (1961), 1–53. https://doi.org/10.1070/RM1961v016n05ABEH004112 Google ScholarGoogle ScholarCross RefCross Ref
  21. Jan Goyvaerts. 2021. Runaway Regular Expressions: Catastrophic Backtracking. https://www.regular-expressions.info/catastrophic.html accessed March 11, 2023. Google ScholarGoogle Scholar
  22. GNU Grep. 2022. GNU Grep - Global Regular Expression Print. https://www.gnu.org/software/grep/ Accessed: March 11, 2023. Google ScholarGoogle Scholar
  23. Philip Hazel and Zoltan Herczeg. 2022. PCRE2 - Perl Compatible Regular Expressions v2. https://www.pcre.org/ Accessed: March 11, 2023. Google ScholarGoogle Scholar
  24. Lukáš Holík, Ondřej Lengál, Olli Saarikivi, Lenka Turoňová, Margus Veanes, and Tomáš Vojnar. 2019. Succinct Determinisation of Counting Automata via Sphere Construction. In APLAS 2019, Anthony Widjaja Lin (Ed.) (LNCS, Vol. 11893). Springer, Cham. 468–489. https://doi.org/10.1007/978-3-030-34175-6_24 Google ScholarGoogle ScholarCross RefCross Ref
  25. Dag Hovland. 2009. Regular Expressions with Numerical Constraints and Automata with Counters. In ICTAC 2009, Martin Leucker and Carroll Morgan (Eds.) (LNCS, Vol. 5684). Springer, Berlin, Heidelberg. 231–245. https://doi.org/10.1007/978-3-642-03466-4_15 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Posix Syntax in PCRE. 2022. Posix Syntax in PCRE. https://www.pcre.org/original/doc/html/pcrepattern.html Accessed: March 11, 2023. Google ScholarGoogle Scholar
  27. Richard M. Karp and Michael O. Rabin. 1987. Efficient Randomized Pattern-Matching Algorithms. IBM Journal of Research and Development, 31, 2 (1987), 249–260. https://doi.org/10.1147/rd.312.0249 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Donald E. Knuth, James H. Morris, Jr., and Vaughan R. Pratt. 1977. Fast Pattern Matching in Strings. SIAM J. Comput., 6, 2 (1977), 323–350. https://doi.org/10.1137/0206024 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lingkun Kong, Qixuan Yu, Agnishom Chattopadhyay, Alexis Le Glaunec, Yi Huang, Konstantinos Mamouras, and Kaiyuan Yang. 2022. Software-Hardware Codesign for Efficient In-Memory Regular Pattern Matching. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI 2022). ACM, New York, NY, USA. 733–748. https://doi.org/10.1145/3519939.3523456 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. CsA Automata Library. 2021. CsA Automata Library. https://pajda.fit.vutbr.cz/ituronova/countingautomata Google ScholarGoogle Scholar
  31. Konstantinos Mamouras, Agnishom Chattopadhyay, and Zhifu Wang. 2021. Algebraic Quantitative Semantics for Efficient Online Temporal Monitoring. In TACAS 2021, Jan Friso Groote and Kim Guldstrand Larsen (Eds.) (LNCS, Vol. 12651). Springer, Cham. 330–348. https://doi.org/10.1007/978-3-030-72016-2_18 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Konstantinos Mamouras, Agnishom Chattopadhyay, and Zhifu Wang. 2021. A Compositional Framework for Quantitative Online Monitoring over Continuous-Time Signals. In RV 2021, Lu Feng and Dana Fisman (Eds.) (LNCS, Vol. 12974). Springer, Cham. 142–163. https://doi.org/10.1007/978-3-030-88494-9_8 Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Konstantinos Mamouras and Zhifu Wang. 2020. Online Signal Monitoring with Bounded Lag. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39, 11 (2020), 3868–3880. https://doi.org/10.1109/TCAD.2020.3013053 Google ScholarGoogle ScholarCross RefCross Ref
  34. Albert R. Meyer and Michael J. Fischer. 1971. Economy of Description by Automata, Grammars, and Formal Systems. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, USA. 188–191. https://doi.org/10.1109/SWAT.1971.11 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Albert R. Meyer and Larry J. Stockmeyer. 1972. The Equivalence Problem for Regular Expressions with Squaring Requires Exponential Space. In 13th Annual Symposium on Switching and Automata Theory (SWAT 1972). IEEE Computer Society, Los Alamitos, CA, USA. 125–129. https://doi.org/10.1109/SWAT.1972.29 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. RE2. 2023. RE2: Google’s regular expression library. Website. https://github.com/google/re2 Accessed: March 11, 2023. Google ScholarGoogle Scholar
  37. RegexLib. 2023. Regular Expression Library. https://regexlib.com/ Accessed: March 11, 2023. Google ScholarGoogle Scholar
  38. Martin Roesch. 1999. Snort - Lightweight Intrusion Detection for Networks. In Proceedings of the 13th USENIX Conference on System Administration (LISA ’99). USENIX Association, USA. 229–238. https://www.usenix.org/legacy/publications/library/proceedings/lisa99/full_papers/roesch/roesch.pdf Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Indranil Roy and Srinivas Aluru. 2016. Discovering Motifs in Biological Sequences Using the Micron Automata Processor. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13, 1 (2016), 99–111. https://doi.org/10.1109/TCBB.2015.2430313 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Olli Saarikivi, Margus Veanes, Tiki Wan, and Eric Xu. 2019. Symbolic Regex Matcher. In TACAS 2019 (LNCS, Vol. 11427). Springer, Cham. 372–378. https://doi.org/10.1007/978-3-030-17462-0_24 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Christian J. A. Sigrist, Lorenzo Cerutti, Edouard de Castro, Petra S. Langendijk-Genevaux, Virginie Bulliard, Amos Bairoch, and Nicolas Hulo. 2009. PROSITE, A Protein Domain Database for Functional Characterization and Annotation. Nucleic Acids Research, 38, suppl_1 (2009), D161–D166. https://doi.org/10.1093/nar/gkp885 Google ScholarGoogle ScholarCross RefCross Ref
  42. Randy Smith, Cristian Estan, and Somesh Jha. 2008. XFA: Faster Signature Matching with Extended Automata. In Proceedings of the 2008 IEEE Symposium on Security and Privacy (SP ’08). IEEE Computer Society, USA. 187–201. isbn:9780769531687 https://doi.org/10.1109/SP.2008.14 Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Snort. 2023. Snort - Network Intrusion Detection & Prevention System. https://www.snort.org/ Accessed: March 11, 2023. Google ScholarGoogle Scholar
  44. Apache SpamAssassin. 2022. Apache SpamAssassin. https://spamassassin.apache.org/ Accessed: March 11, 2023. Google ScholarGoogle Scholar
  45. Larry J. Stockmeyer and Albert R. Meyer. 1973. Word Problems Requiring Exponential Time (Preliminary Report). In Proceedings of the Fifth Annual ACM Symposium on Theory of Computing (STOC ’73). ACM, New York, NY, USA. 1–9. https://doi.org/10.1145/800125.804029 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Suricata. 2023. Suricata - Open Source Intrusion Detection and Prevention Engine. https://suricata.io/ Accessed: March 11, 2023. Google ScholarGoogle Scholar
  47. Ken Thompson. 1968. Programming Techniques: Regular Expression Search Algorithm. Commun. ACM, 11, 6 (1968), 419–422. https://doi.org/10.1145/363347.363387 Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Lenka Turoňová, Lukáš Holík, Ondřej Lengál, Olli Saarikivi, Margus Veanes, and Tomáš Vojnar. 2020. Regex Matching with Counting-Set Automata. Proceedings of the ACM on Programming Languages, 4, OOPSLA (2020), Article 218, 30 pages. https://doi.org/10.1145/3428286 Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Xiang Wang, Yang Hong, Harry Chang, KyoungSoo Park, Geoff Langdale, Jiayu Hu, and Heqing Zhu. 2019. Hyperscan: A Fast Multi-Pattern Regex Matcher for Modern CPUs. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’19). USENIX Association, Boston, MA. 631–648. https://www.usenix.org/conference/nsdi19/presentation/wang-xiang Google ScholarGoogle Scholar
  50. Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. 2006. Fast and Memory-Efficient Regular Expression Matching for Deep Packet Inspection. In Proceedings of the 2006 ACM/IEEE Symposium on Architecture for Networking and Communications Systems (ANCS ’06). ACM, New York, NY, USA. 93–102. https://doi.org/10.1145/1185347.1185360 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Regular Expression Matching using Bit Vector Automata

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Article Metrics

        • Downloads (Last 12 months)254
        • Downloads (Last 6 weeks)90

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!