Abstract
Regular expressions are a classical concept in formal language theory. Regular expressions in programming languages (RegEx) such as JavaScript, feature non-standard semantics of operators (e.g. greedy/lazy Kleene star), as well as additional features such as capturing groups and references. While symbolic execution of programs containing RegExes appeals to string solvers natively supporting important features of RegEx, such a string solver is hitherto missing. In this paper, we propose the first string theory and string solver that natively provides such support. The key idea of our string solver is to introduce a new automata model, called prioritized streaming string transducers (PSST), to formalize the semantics of RegEx-dependent string functions. PSSTs combine priorities, which have previously been introduced in prioritized finite-state automata to capture greedy/lazy semantics, with string variables as in streaming string transducers to model capturing groups. We validate the consistency of the formal semantics with the actual JavaScript semantics by extensive experiments. Furthermore, to solve the string constraints, we show that PSSTs enjoy nice closure and algorithmic properties, in particular, the regularity-preserving property (i.e., pre-images of regular constraints under PSSTs are regular), and introduce a sound sequent calculus that exploits these properties and performs propagation of regular constraints by means of taking post-images or pre-images. Although the satisfiability of the string constraint language is generally undecidable, we show that our approach is complete for the so-called straight-line fragment. We evaluate the performance of our string solver on over 195000 string constraints generated from an open-source RegEx library. The experimental results show the efficacy of our approach, drastically improving the existing methods (via symbolic execution) in both precision and efficiency.
Supplemental Material
- Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Bui Phi Diep, Lukás Holík, Ahmed Rezine, and Philipp Rümmer. 2017. Flatten and conquer: a framework for efficient analysis of string constraints. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2017, Barcelona, Spain, June 18-23, 2017. 602–617. https://doi.org/10.1145/3062341.3062384 Google Scholar
Digital Library
- Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Bui Phi Diep, Lukás Holík, Ahmed Rezine, and Philipp Rümmer. 2018. Trau: SMT solver for string constraints. In 2018 Formal Methods in Computer Aided Design, FMCAD 2018, Austin, TX, USA, October 30 - November 2, 2018, Nikolaj Bjørner and Arie Gurfinkel (Eds.). IEEE, 1–5. isbn:978-0-9835678-8-2 https://doi.org/10.23919/FMCAD.2018.8602997 Google Scholar
Cross Ref
- Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Lukás Holík, Ahmed Rezine, Philipp Rümmer, and Jari Stenman. 2014. String Constraints for Verification. In CAV. 150–166. https://doi.org/10.1007/978-3-319-08867-9_10 Google Scholar
Digital Library
- Parosh Aziz Abdulla, Mohamed Faouzi Atig, Bui Phi Diep, Lukás Holík, and Petr Janku. 2019. Chain-Free String Constraints. In Automated Technology for Verification and Analysis - 17th International Symposium, ATVA 2019, Taipei, Taiwan, October 28-31, 2019, Proceedings. 277–293. https://doi.org/10.1007/978-3-030-31784-3_16 Google Scholar
Digital Library
- Rajeev Alur and Pavol Cerný. 2010. Expressiveness of streaming string transducers. In IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science, FSTTCS 2010, December 15-18, 2010, Chennai, India. 1–12. https://doi.org/10.4230/LIPIcs.FSTTCS.2010.1 Google Scholar
Cross Ref
- Rajeev Alur and Jyotirmoy V. Deshmukh. 2011. Nondeterministic Streaming String Transducers. In Automata, Languages and Programming - 38th International Colloquium, ICALP 2011, Zurich, Switzerland, July 4-8, 2011, Proceedings, Part II, Luca Aceto, Monika Henzinger, and Jirí Sgall (Eds.) (Lecture Notes in Computer Science, Vol. 6756). Springer, 1–20. https://doi.org/10.1007/978-3-642-22012-8_1 Google Scholar
Cross Ref
- Roberto Amadini. 2020. A Survey on String Constraint Solving. CoRR, abs/2002.02376 (2020), arxiv:2002.02376. arxiv:2002.02376Google Scholar
- Roberto Amadini, Mak Andrlon, Graeme Gange, Peter Schachte, Harald Søndergaard, and Peter J. Stuckey. 2019. Constraint Programming for Dynamic Symbolic Execution of JavaScript. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research - 16th International Conference, CPAIOR 2019, Thessaloniki, Greece, June 4-7, 2019, Proceedings, Louis-Martin Rousseau and Kostas Stergiou (Eds.) (Lecture Notes in Computer Science, Vol. 11494). Springer, 1–19. https://doi.org/10.1007/978-3-030-19212-9_1 Google Scholar
Cross Ref
- Roberto Amadini, Graeme Gange, Peter J. Stuckey, and Guido Tack. 2017. A Novel Approach to String Constraint Solving. In Principles and Practice of Constraint Programming - 23rd International Conference, CP 2017, Melbourne, VIC, Australia, August 28 - September 1, 2017, Proceedings, J. Christopher Beck (Ed.) (Lecture Notes in Computer Science, Vol. 10416). Springer, 3–20. https://doi.org/10.1007/978-3-319-66158-2_1 Google Scholar
Cross Ref
- Martin Berglund, Frank Drewes, and Brink van der Merwe. 2014. Analyzing Catastrophic Backtracking Behavior in Practical Regular Expression Matching. In Proceedings 14th International Conference on Automata and Formal Languages, AFL 2014, Szeged, Hungary, May 27-29, 2014, Zoltán Ésik and Zoltán Fülöp (Eds.) (EPTCS, Vol. 151). 109–123. https://doi.org/10.4204/EPTCS.151.7 Google Scholar
Cross Ref
- Martin Berglund and Brink van der Merwe. 2017. On the semantics of regular expression parsing in the wild. Theoretical Computer Science, 679 (2017), 69 – 82. https://doi.org/10.1016/j.tcs.2016.09.006 Google Scholar
Cross Ref
- Martin Berglund and Brink van der Merwe. 2017. Regular Expressions with Backreferences Re-examined. In Proceedings of the Prague Stringology Conference 2017, Prague, Czech Republic, August 28-30, 2017, Jan Holub and Jan Zdárek (Eds.). Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 30–41. http://www.stringology.org/event/2017/p04.htmlGoogle Scholar
- Murphy Berzish, Vijay Ganesh, and Yunhui Zheng. 2017. Z3str3: A string solver with theory-aware heuristics. In 2017 Formal Methods in Computer Aided Design, FMCAD 2017, Vienna, Austria, October 2-6, 2017. 55–59. https://doi.org/10.23919/FMCAD.2017.8102241 Google Scholar
Cross Ref
- Berzish, Murphy. 2021. Z3str4: A Solver for Theories over Strings. Ph. D. Dissertation. http://hdl.handle.net/10012/17102Google Scholar
- Diep Bui and contributors. 2019. Z3-Trau. https://github.com/diepbp/z3-trauGoogle Scholar
- Tevfik Bultan and contributors. 2015. ABC string solver. https://github.com/vlab-cs-ucsb/ABCGoogle Scholar
- Cezar Câmpeanu, Kai Salomaa, and Sheng Yu. 2003. A Formal Study Of Practical Regular Expressions. Int. J. Found. Comput. Sci., 14, 6 (2003), 1007–1018. https://doi.org/10.1142/S012905410300214X Google Scholar
Cross Ref
- Benjamin Carle and Paliath Narendran. 2009. On Extended Regular Expressions. In Language and Automata Theory and Applications, Third International Conference, LATA 2009, Tarragona, Spain, April 2-8, 2009. Proceedings, Adrian-Horia Dediu, Armand-Mihai Ionescu, and Carlos Martín-Vide (Eds.) (Lecture Notes in Computer Science, Vol. 5457). Springer, 279–289. https://doi.org/10.1007/978-3-642-00982-2_24 Google Scholar
Digital Library
- Olivier Carton, Christian Choffrut, and Serge Grigorieff. 2006. Decision problems among the main subfamilies of rational relations. ITA, 40, 2 (2006), 255–275. https://doi.org/10.1051/ita:2006005 Google Scholar
Cross Ref
- Ashok K. Chandra, Dexter Kozen, and Larry J. Stockmeyer. 1981. Alternation. J. ACM, 28, 1 (1981), 114–133. https://doi.org/10.1145/322234.322243 Google Scholar
Digital Library
- Taolue Chen, Yan Chen, Matthew Hague, Anthony W. Lin, and Zhilin Wu. 2018. What is decidable about string constraints with the ReplaceAll function. PACMPL, 2, POPL (2018), 3:1–3:29. https://doi.org/10.1145/3158091 Google Scholar
Digital Library
- Taolue Chen, Alejandro Flores-Lamas, Matthew Hague, Zhilei Han, Denghang Hu, Shuanglong Kan, Anthony Widjaja Lin, Philipp Rümmer, and Zhilin Wu. 2021. Solving String Constraints With Regex-Dependent Functions Through Transducers With Priorities And Variables. CoRR, abs/2111.04298 (2021), arXiv:2111.04298. arxiv:2111.04298Google Scholar
- Taolue Chen, Matthew Hague, Anthony W. Lin, Philipp Rümmer, and Zhilin Wu. 2019. Decision Procedures for Path Feasibility of String-Manipulating Programs with Complex Operations. PACMPL, 3, POPL (2019), Article 49, Jan., 30 pages. https://doi.org/10.1145/3290362 Google Scholar
Digital Library
- Arlen Cox and Jason Leasure. 2017. Model Checking Regular Language Constraints. arxiv:1708.09073. arxiv:1708.09073Google Scholar
- James C. Davis, Louis G. Michael IV, Christy A. Coghlan, Francisco Servant, and Dongyoon Lee. 2019. Why Aren’t Regular Expressions a Lingua Franca? An Empirical Study on the Re-Use and Portability of Regular Expressions. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA. 443–454. isbn:9781450355728 https://doi.org/10.1145/3338906.3338909 Google Scholar
Digital Library
- Leonardo Mendonça de Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In Tools and Algorithms for the Construction and Analysis of Systems, 14th International Conference, TACAS 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings. 337–340. https://doi.org/10.1007/978-3-540-78800-3_24 Google Scholar
Cross Ref
- Joost Engelfriet. 1991. Iterated Stack Automata and Complexity Classes. Inf. Comput., 95, 1 (1991), 21–75. https://doi.org/10.1016/0890-5401(91)90015-T Google Scholar
Digital Library
- Dominik D. Freydenberger. 2013. Extended Regular Expressions: Succinctness and Decidability. Theory Comput. Syst., 53, 2 (2013), 159–193. https://doi.org/10.1007/s00224-012-9389-0 Google Scholar
Cross Ref
- Dominik D. Freydenberger and Markus L. Schmid. 2019. Deterministic regular expressions with back-references. J. Comput. Syst. Sci., 105 (2019), 1–39. https://doi.org/10.1016/j.jcss.2019.04.001 Google Scholar
Digital Library
- Vijay Ganesh and Murphy Berzish. 2016. Undecidability of a Theory of Strings, Linear Arithmetic over Length, and String-Number Conversion. CoRR, abs/1605.09442 (2016), arxiv:1605.09442Google Scholar
- Vijay Ganesh, Mia Minnes, Armando Solar-Lezama, and Martin C. Rinard. 2012. Word Equations with Length Constraints: What’s Decidable? In Hardware and Software: Verification and Testing - 8th International Haifa Verification Conference, HVC 2012, Haifa, Israel, November 6-8, 2012. Revised Selected Papers. 209–226. https://doi.org/10.1007/978-3-642-39611-3_21 Google Scholar
Digital Library
- Gerhard Gentzen. 1935. Untersuchungen über das Logische Schliessen. Mathematische Zeitschrift, 39 (1935), 176–210, 405–431. English translation, “Investigations into Logical Deduction,” in Gentzen69Google Scholar
Cross Ref
- John Harrison. 2009. Handbook of Practical Logic and Automated Reasoning. Cambridge University Press. isbn:978-0-521-89957-4Google Scholar
Digital Library
- Lukás Holík, Petr Janku, Anthony W. Lin, Philipp Rümmer, and Tomás Vojnar. 2018. String constraints with concatenation and transducers solved efficiently. PACMPL, 2, POPL (2018), 4:1–4:32. https://doi.org/10.1145/3158092 Google Scholar
Digital Library
- Pieter Hooimeijer, Benjamin Livshits, David Molnar, Prateek Saxena, and Margus Veanes. 2011. Fast and Precise Sanitizer Analysis with BEK. In USENIX Security Symposium. http://static.usenix.org/events/sec11/tech/full_papers/Hooimeijer.pdfGoogle Scholar
- John E. Hopcroft and Jeffrey D. Ullman. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley.Google Scholar
Digital Library
- Adam Kiezun, Vijay Ganesh, Shay Artzi, Philip J. Guo, Pieter Hooimeijer, and Michael D. Ernst. 2012. HAMPI: A solver for word equations over strings, regular expressions, and context-free grammars. ACM Trans. Softw. Eng. Methodol., 21, 4 (2012), 25:1–25:28. https://doi.org/10.1145/2377656.2377662 Google Scholar
Digital Library
- Quang Loc Le and Mengda He. 2018. A Decision Procedure for String Logic with Quadratic Equations, Regular Expressions and Length Constraints. In Programming Languages and Systems - 16th Asian Symposium, APLAS 2018, Wellington, New Zealand, December 2-6, 2018, Proceedings, Sukyoung Ryu (Ed.) (Lecture Notes in Computer Science, Vol. 11275). Springer, 350–372. https://doi.org/10.1007/978-3-030-02768-1_19 Google Scholar
Cross Ref
- Tianyi Liang, Andrew Reynolds, Cesare Tinelli, Clark Barrett, and Morgan Deters. 2014. A DPLL(T) Theory Solver for a Theory of Strings and Regular Expressions. In CAV. 646–662. https://doi.org/10.1007/978-3-319-08867-9_43 Google Scholar
Digital Library
- Anthony W. Lin and Pablo Barceló. 2016. String Solving with Word Equations and Transducers: Towards a Logic for Analysing Mutation XSS. POPL ’16. ACM, 123–136. https://doi.org/10.1145/2837614.2837641 Google Scholar
Digital Library
- Blake Loring, Duncan Mitchell, and Johannes Kinder. 2017. ExpoSE: practical symbolic execution of standalone JavaScript. In Proceedings of the 24th ACM SIGSOFT International SPIN Symposium on Model Checking of Software, Santa Barbara, CA, USA, July 10-14, 2017, Hakan Erdogmus and Klaus Havelund (Eds.). ACM, 196–199. https://doi.org/10.1145/3092282.3092295 Google Scholar
Digital Library
- Blake Loring, Duncan Mitchell, and Johannes Kinder. 2019. Sound regular expression semantics for dynamic symbolic execution of JavaScript. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2019, Phoenix, AZ, USA, June 22-26, 2019. ACM, 425–438. https://doi.org/10.1145/3314221.3314645 Google Scholar
Digital Library
- A. N. Masilov. 1976. Multilevel magazine automata. Probl. Peredachi Inf., 12, 1 (1976), 55–62. issn:0555-2923Google Scholar
- Louis G. Michael, James Donohue, James C. Davis, Dongyoon Lee, and Francisco Servant. 2019. Regexes Are Hard: Decision-Making, Difficulties, and Risks in Programming Regular Expressions. In Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE ’19). IEEE Press, 415–426. https://doi.org/10.1109/ASE.2019.00047 Google Scholar
Digital Library
- Robert Nieuwenhuis, Albert Oliveras, and Cesare Tinelli. 2006. Solving SAT and SAT Modulo Theories: From an abstract Davis-Putnam-Logemann-Loveland procedure to DPLL(T). J. ACM, 53, 6 (2006), 937–977. issn:0004-5411 https://doi.org/10.1145/1217856.1217859 Google Scholar
Digital Library
- Prateek Saxena, Devdatta Akhawe, Steve Hanna, Feng Mao, Stephen McCamant, and Dawn Song. 2010. A Symbolic Execution Framework for JavaScript. In 31st IEEE Symposium on Security and Privacy, S&P 2010, 16-19 May 2010, Berleley/Oakland, California, USA. 513–528. https://doi.org/10.1109/SP.2010.38 Google Scholar
Digital Library
- Markus L. Schmid. 2016. Characterising REGEX languages by regular languages equipped with factor-referencing. Inf. Comput., 249 (2016), 1–17. https://doi.org/10.1016/j.ic.2016.02.003 Google Scholar
Digital Library
- Joseph D. Scott, Pierre Flener, Justin Pearson, and Christian Schulte. 2017. Design and Implementation of Bounded-Length Sequence Variables. In Integration of AI and OR Techniques in Constraint Programming - 14th International Conference, CPAIOR 2017, Padua, Italy, June 5-8, 2017, Proceedings, Domenico Salvagnin and Michele Lombardi (Eds.) (Lecture Notes in Computer Science, Vol. 10335). Springer, 51–67. https://doi.org/10.1007/978-3-319-59776-8_5 Google Scholar
Cross Ref
- Cristian-Alexandru Staicu and Michael Pradel. 2018. Freezing the Web: A Study of ReDoS Vulnerabilities in Javascript-Based Web Servers. In Proceedings of the 27th USENIX Conference on Security Symposium (SEC’18). USENIX Association, USA. 361–376. https://www.usenix.org/conference/usenixsecurity18/presentation/staicuGoogle Scholar
- 1969. The Collected Papers of Gerhard Gentzen, M. E. Szabo (Ed.). North-Holland, Amsterdam.Google Scholar
- Ken Thompson. 1968. Regular Expression Search Algorithm. Commun. ACM, 11, 6 (1968), 419–422. https://doi.org/10.1145/363347.363387 Google Scholar
Digital Library
- Minh-Thai Trinh, Duc-Hiep Chu, and Joxan Jaffar. 2014. S3: A Symbolic String Solver for Vulnerability Detection in Web Applications. In CCS. 1232–1243. https://doi.org/10.1145/2660267.2660372 Google Scholar
Digital Library
- Minh-Thai Trinh, Duc-Hiep Chu, and Joxan Jaffar. 2016. Progressive Reasoning over Recursively-Defined Strings. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part I. Springer, 218–240. https://doi.org/10.1007/978-3-319-41528-4_12 Google Scholar
Cross Ref
- Hung-En Wang, Tzung-Lin Tsai, Chun-Han Lin, Fang Yu, and Jie-Hong R. Jiang. 2016. String Analysis via Automata Manipulation with Logic Circuit Representation. In Computer Aided Verification - 28th International Conference, CAV 2016, Toronto, ON, Canada, July 17-23, 2016, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 9779). Springer, 241–260. isbn:978-3-319-41527-7 https://doi.org/10.1007/978-3-319-41528-4 Google Scholar
Cross Ref
- Hung-En Wang, Shih-Yu Chen, Fang Yu, and Jie-Hong R. Jiang. 2018. A Symbolic Model Checking Approach to the Analysis of String and Length Constraints. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE 2018). ACM, 623–633. https://doi.org/10.1145/3238147.3238189 Google Scholar
Digital Library
- Fang Yu, Muath Alkhalaf, and Tevfik Bultan. 2010. Stranger: An Automata-Based String Analysis Tool for PHP. In TACAS. 154–157. https://doi.org/10.1007/978-3-642-12002-2_13 Benchmark can be found at Google Scholar
Digital Library
- Fang Yu, Muath Alkhalaf, Tevfik Bultan, and Oscar H. Ibarra. 2014. Automata-based Symbolic String Analysis for Vulnerability Detection. Form. Methods Syst. Des., 44, 1 (2014), 44–70. https://doi.org/10.1007/s10703-013-0189-1 Google Scholar
Digital Library
- Yunhui Zheng, Vijay Ganesh, Sanu Subramanian, Omer Tripp, Julian Dolby, and Xiangyu Zhang. 2015. Effective Search-Space Pruning for Solvers of String Equations, Regular Expressions and Length Constraints. In Computer Aided Verification - 27th International Conference, CAV 2015, San Francisco, CA, USA, July 18-24, 2015, Proceedings, Part I. Springer, 235–254. https://doi.org/10.1007/978-3-319-21690-4_14 Google Scholar
Cross Ref
- Yunhui Zheng, Xiangyu Zhang, and Vijay Ganesh. 2013. Z3-str: a Z3-based string solver for web application analysis. In ESEC/SIGSOFT FSE. 114–124. https://doi.org/10.1145/2491411.2491456 Google Scholar
Digital Library
- Qizhen Zhu, Hitoshi Akama, and Yasuhiko Minamide. 2019. Solving String Constraints with Streaming String Transducers. Journal of Information Processing, 27 (2019), 810–821. https://doi.org/10.2197/ipsjjip.27.810 Google Scholar
Cross Ref
Index Terms
Solving string constraints with Regex-dependent functions through transducers with priorities and variables
Recommendations
Copyful Streaming String Transducers
Special Issue on the 11th International Workshop on Reachability Problems (RP 2017)Copyless streaming string transducers (copyless SST) have been introduced by R. Alur and P. Černý in 2010 as a one-way deterministic automata model to define transductions of finite strings. Copyless SST extend deterministic finite state automata with a ...
Regex and extended regex
CIAA'02: Proceedings of the 7th international conference on Implementation and application of automataRegex are used in many programs such as Perl, Awk, Python, egrep, vi, emacs etc. It is known that regex are different from regular expressions. In this paper, we give regex a formal treatment. We make a distinction between regex and extended regex; ...
String solving with word equations and transducers: towards a logic for analysing mutation XSS
POPL '16: Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming LanguagesWe study the fundamental issue of decidability of satisfiability over string logics with concatenations and finite-state transducers as atomic operations. Although restricting to one type of operations yields decidability, little is known about the ...






Comments