Abstract
String-manipulating programs are an important class of programs with applications in malware detection, graphics, input sanitization for Web security, and large-scale HTML processing. This paper extends prior work on BEK, an expressive domain-specific language for writing string-manipulating programs, with algorithmic insights that make BEK both analyzable and data-parallel. By analyzable we mean that unlike most general purpose programming languages, many algebraic properties of a BEK program are decidable (i.e., one can check whether two programs commute or compute the inverse of a program). By data-parallel we mean that a BEK program can compute on arbitrary subsections of its input in parallel, thus exploiting parallel hardware. This latter requirement is particularly important for programs which operate on large data: without data parallelism, a programmer cannot hide the latency of reading data from various storage media (i.e., reading a terabyte of data from a modern hard drive takes about 3 hours). With a data-parallel approach, the system can split data across multiple disks and thus hide the latency of reading the data.
A BEK program is expressive: a programmer can use conditionals, switch statements, and registers--or local variables--in order to implement common string-manipulating programs. Unfortunately, this expressivity induces data dependencies, which are an obstacle to parallelism. The key contribution of this paper is an algorithm which automatically removes these data dependencies by mapping a B EK program into a intermediate format consisting of symbolic transducers, which extend classical transducers with symbolic predicates and symbolic assignments. We present a novel algorithm that we call exploration which performs symbolic loop unrolling of these transducers to obtain simplified versions of the original program. We show how these simplified versions can then be lifted to a stateless form, and from there compiled to data-parallel hardware.
To evaluate the efficacy of our approach, we demonstrate up to 8x speedups for a number of real-world, BEK programs, (e.g., HTML encoder and decoder) on data-parallel hardware. To the best of our knowledge, these are the first data parallel implementation of these programs. To validate that our approach is correct, we use an automatic testing technique to compare our generated code to the original implementations and find no semantic deviations.
Supplemental Material
- R. Alur and P. Cerny. Streaming transducers for algorithmic verification of single-pass list-processing programs. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11. ACM, 2011. Google Scholar
Digital Library
- D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda, C. Kruegel, and G. Vigna. Saner: Composing static and dynamic analysis to validate sanitization in web applications. In IEEE Oakland Security and Privacy, 2008. Google Scholar
Digital Library
- C. Barrett and C. Tinelli. Satisfiability modulo theories. In E. Clarke, T. Henzinger, and H. Veith, editors, Handbook of Model Checking. Springer, 2014. (to appear).Google Scholar
- D. Bates, A. Barth, and C. Jackson. Regular expressions considered harmful in client-side xss filters. In Proceedings of the 19th international conference on World wide web, WWW '10. ACM, 2010. Google Scholar
Digital Library
- M. Benedikt, C. Ley, and G. Puppis. Automata vs. logics on data words. In Proceedings of the 24th international conference/19th annual conference on Computer Science Logic (CSL'10/EACSL'10), volume 6247 of LNCS, pages 110--124. Springer, 2010. Google Scholar
Digital Library
- M. Bojańczyk, A. Muscholl, T. Schwentick, L. Segoufin, and C. David. Two-variable logic on words with data. In Proceedings of the 21st Annual IEEE Symposium on Logic in Computer Science (LICS'06), pages 7--16. IEEE, 2006. Google Scholar
Digital Library
- M. Botinńan and D. Babirć. Sigma*: symbolic learning of input- output specifications. In Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL'13, pages 443--456. ACM, 2013. Google Scholar
Digital Library
- L. D'Antoni and M. Veanes. Static analysis of string encoders and decoders. In 14th International Conference on Verification, Model Checking, and Abstract Interpretation, VMCAI 2013, volume 7737 of LNCS, pages 209--228. Springer, 2013.Google Scholar
Digital Library
- L. D'Antoni and M. Veanes. Minimization of symbolic automata. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL'14, pages 541--553. ACM, 2014. Google Scholar
Digital Library
- L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In C. Ramakrishnan and J. Rehof, editors, Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer, 2008. Google Scholar
Digital Library
- L. De Moura and N. Bjørner. Satisfiability modulo theories: Introduction and applications. Commun. ACM, 54(9):69--77, 2011. Google Scholar
Digital Library
- T. L. Gall and B. Jeannet. Lattice automata: A representation for languages on infinite alphabets, and some applications to verification. In SAS 2007, volume 4634 of LNCS, pages 52--68, 2007. Google Scholar
Digital Library
- P. Godefroid. Compositional dynamic test generation. In Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL'13, pages 47--54, 2007. Google Scholar
Digital Library
- W. D. Hillis and G. L. Steele. Data parallel algorithms. In Commun. ACM, volume 29, pages 1170--1183, Dec 1986. Google Scholar
Digital Library
- P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes. Fast and precise sanitizer analysis with Bek. In USENIX Security, August 2011. Google Scholar
Digital Library
- J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 1979. Google Scholar
Digital Library
- M. Kaminski and N. Francez. Finite-memory automata. In 31st Annual IEEE Symposium on Foundations of Computer Science, volume 2 of FOCS'90, pages 683--688. IEEE, 1990. Google Scholar
Digital Library
- A. Kiezun, V. Ganesh, P. J. Guo, P. Hooimeijer, and M. D. Ernst. HAMPI: a solver for string constraints. In International Symposium on Software Testing and Analysis, 2009. Google Scholar
Digital Library
- R. E. Ladner and M. J. Fischer. Parallel prefix computation. Journal of the ACM, 27(4):831--838, 1980. Google Scholar
Digital Library
- L. Libkin. Variable independence for first-order definable constraints. ACM Transactions on Computational Logic, 4(4):431--451, 2003. Google Scholar
Digital Library
- D. Lindsay and E. V. Nava. Universal XSS via IE8's XSS filters. In Black Hat Europe, 2010.Google Scholar
- Anti-Cross Site Scripting Library. Microsoft Corporation, http://msdn.microsoft.com/en-us/security/aa973814.aspx.Google Scholar
- Bek guide. Microsoft Research,. http://rise4fun.com/bek/tutorial.Google Scholar
- Y. Minamide. Static approximation of dynamically generated web pages. In WWW '05, pages 432--441, 2005. Google Scholar
Digital Library
- T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel finite- state machines. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 529--542, New York, 2014. ACM. Google Scholar
Digital Library
- F. Neven, T. Schwentick, and V. Vianu. Finite state machines for strings over infinite alphabets. ACM Trans. CL, 5:403--435, 2004. Google Scholar
Digital Library
- XSS Filter Evasion Cheat Sheet. OWASP. http://ha.ckers.org/xss.html.Google Scholar
- P. Saxena, D. Akhawe, S. Hanna, S. McCamant, F. Mao, and D. Song. A symbolic execution framework for JavaScript. In IEEE Security and Privacy, 2010. Google Scholar
Digital Library
- P. Saxena, D. Molnar, and B. Livshits. Scriptgard: Preventing script injection attacks in legacy web applications with automatic sanitization. Technical Report MSR-TR-2010--128, Microsoft Research, August 2010.Google Scholar
- L. Segoufin. Automata and logics for words and trees over an infinite alphabet. In Z. Ésik, editor, EACSL Annual Conference on Logic in Computer Science, CSL, volume 4207 of LNCS, pages 41--57, 2006. Google Scholar
Digital Library
- W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, volume 2304 of LNCS, pages 179--196. Springer, 2002. Google Scholar
Digital Library
- M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjorner. Symbolic finite state transducers: Algorithms and applications. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '12. ACM, 2012. Google Scholar
Digital Library
- M. Veanes, N. Bjørner, L. Nachmanson, and S. Bereg. Monadic decomposition. In International Conference on Computer Aided Verification, volume 8559 of LNCS, pages 628--645. Springer, 2014. Google Scholar
Digital Library
- P. Wadler. Deforestation: transforming programs to eliminate trees. In Proceedings of the Second European Symposium on Programming, pages 231--248, 1988. Google Scholar
Digital Library
- G. Wassermann, D. Yu, A. Chander, D. Dhurjati, H. Inamura, and Z. Su. Dynamic test input generation for web applications. In International Symposium on Software Testing and Analysis, 2008. Google Scholar
Digital Library
- F. Yu, T. Bultan, and O. H. Ibarra. Relational string verification using multi-track automata. In Conference on Implementation and Application of Automata, CIAA'10, pages 290--299, 2011. Google Scholar
Digital Library
- D. Zhang, Q. J. Li, R. Rabbah, and S. Amarasinghe. A lightweight streaming layer for multicore execution. SIGARCH Comput. Archit. News, 36(2):18--27, May 20. Google Scholar
Digital Library
Index Terms
Data-Parallel String-Manipulating Programs
Recommendations
Safe & Efficient Gradual Typing for TypeScript
Current proposals for adding gradual typing to JavaScript, such as Closure, TypeScript and Dart, forgo soundness to deal with issues of scale, code reuse, and popular programming patterns. We show how to address these issues in practice while retaining ...







Comments