skip to main content
research-article

Data-Parallel String-Manipulating Programs

Published:14 January 2015Publication History
Skip Abstract Section

Abstract

String-manipulating programs are an important class of programs with applications in malware detection, graphics, input sanitization for Web security, and large-scale HTML processing. This paper extends prior work on BEK, an expressive domain-specific language for writing string-manipulating programs, with algorithmic insights that make BEK both analyzable and data-parallel. By analyzable we mean that unlike most general purpose programming languages, many algebraic properties of a BEK program are decidable (i.e., one can check whether two programs commute or compute the inverse of a program). By data-parallel we mean that a BEK program can compute on arbitrary subsections of its input in parallel, thus exploiting parallel hardware. This latter requirement is particularly important for programs which operate on large data: without data parallelism, a programmer cannot hide the latency of reading data from various storage media (i.e., reading a terabyte of data from a modern hard drive takes about 3 hours). With a data-parallel approach, the system can split data across multiple disks and thus hide the latency of reading the data.

A BEK program is expressive: a programmer can use conditionals, switch statements, and registers--or local variables--in order to implement common string-manipulating programs. Unfortunately, this expressivity induces data dependencies, which are an obstacle to parallelism. The key contribution of this paper is an algorithm which automatically removes these data dependencies by mapping a B EK program into a intermediate format consisting of symbolic transducers, which extend classical transducers with symbolic predicates and symbolic assignments. We present a novel algorithm that we call exploration which performs symbolic loop unrolling of these transducers to obtain simplified versions of the original program. We show how these simplified versions can then be lifted to a stateless form, and from there compiled to data-parallel hardware.

To evaluate the efficacy of our approach, we demonstrate up to 8x speedups for a number of real-world, BEK programs, (e.g., HTML encoder and decoder) on data-parallel hardware. To the best of our knowledge, these are the first data parallel implementation of these programs. To validate that our approach is correct, we use an automatic testing technique to compare our generated code to the original implementations and find no semantic deviations.

Skip Supplemental Material Section

Supplemental Material

p139-sidebyside.mpg

References

  1. R. Alur and P. Cerny. Streaming transducers for algorithmic verification of single-pass list-processing programs. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda, C. Kruegel, and G. Vigna. Saner: Composing static and dynamic analysis to validate sanitization in web applications. In IEEE Oakland Security and Privacy, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Barrett and C. Tinelli. Satisfiability modulo theories. In E. Clarke, T. Henzinger, and H. Veith, editors, Handbook of Model Checking. Springer, 2014. (to appear).Google ScholarGoogle Scholar
  4. D. Bates, A. Barth, and C. Jackson. Regular expressions considered harmful in client-side xss filters. In Proceedings of the 19th international conference on World wide web, WWW '10. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Benedikt, C. Ley, and G. Puppis. Automata vs. logics on data words. In Proceedings of the 24th international conference/19th annual conference on Computer Science Logic (CSL'10/EACSL'10), volume 6247 of LNCS, pages 110--124. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bojańczyk, A. Muscholl, T. Schwentick, L. Segoufin, and C. David. Two-variable logic on words with data. In Proceedings of the 21st Annual IEEE Symposium on Logic in Computer Science (LICS'06), pages 7--16. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Botinńan and D. Babirć. Sigma*: symbolic learning of input- output specifications. In Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL'13, pages 443--456. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. D'Antoni and M. Veanes. Static analysis of string encoders and decoders. In 14th International Conference on Verification, Model Checking, and Abstract Interpretation, VMCAI 2013, volume 7737 of LNCS, pages 209--228. Springer, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. D'Antoni and M. Veanes. Minimization of symbolic automata. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL'14, pages 541--553. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In C. Ramakrishnan and J. Rehof, editors, Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. De Moura and N. Bjørner. Satisfiability modulo theories: Introduction and applications. Commun. ACM, 54(9):69--77, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. L. Gall and B. Jeannet. Lattice automata: A representation for languages on infinite alphabets, and some applications to verification. In SAS 2007, volume 4634 of LNCS, pages 52--68, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Godefroid. Compositional dynamic test generation. In Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL'13, pages 47--54, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. D. Hillis and G. L. Steele. Data parallel algorithms. In Commun. ACM, volume 29, pages 1170--1183, Dec 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes. Fast and precise sanitizer analysis with Bek. In USENIX Security, August 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Kaminski and N. Francez. Finite-memory automata. In 31st Annual IEEE Symposium on Foundations of Computer Science, volume 2 of FOCS'90, pages 683--688. IEEE, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Kiezun, V. Ganesh, P. J. Guo, P. Hooimeijer, and M. D. Ernst. HAMPI: a solver for string constraints. In International Symposium on Software Testing and Analysis, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. E. Ladner and M. J. Fischer. Parallel prefix computation. Journal of the ACM, 27(4):831--838, 1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Libkin. Variable independence for first-order definable constraints. ACM Transactions on Computational Logic, 4(4):431--451, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Lindsay and E. V. Nava. Universal XSS via IE8's XSS filters. In Black Hat Europe, 2010.Google ScholarGoogle Scholar
  22. Anti-Cross Site Scripting Library. Microsoft Corporation, http://msdn.microsoft.com/en-us/security/aa973814.aspx.Google ScholarGoogle Scholar
  23. Bek guide. Microsoft Research,. http://rise4fun.com/bek/tutorial.Google ScholarGoogle Scholar
  24. Y. Minamide. Static approximation of dynamically generated web pages. In WWW '05, pages 432--441, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel finite- state machines. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 529--542, New York, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Neven, T. Schwentick, and V. Vianu. Finite state machines for strings over infinite alphabets. ACM Trans. CL, 5:403--435, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. XSS Filter Evasion Cheat Sheet. OWASP. http://ha.ckers.org/xss.html.Google ScholarGoogle Scholar
  28. P. Saxena, D. Akhawe, S. Hanna, S. McCamant, F. Mao, and D. Song. A symbolic execution framework for JavaScript. In IEEE Security and Privacy, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Saxena, D. Molnar, and B. Livshits. Scriptgard: Preventing script injection attacks in legacy web applications with automatic sanitization. Technical Report MSR-TR-2010--128, Microsoft Research, August 2010.Google ScholarGoogle Scholar
  30. L. Segoufin. Automata and logics for words and trees over an infinite alphabet. In Z. Ésik, editor, EACSL Annual Conference on Logic in Computer Science, CSL, volume 4207 of LNCS, pages 41--57, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, volume 2304 of LNCS, pages 179--196. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjorner. Symbolic finite state transducers: Algorithms and applications. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '12. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Veanes, N. Bjørner, L. Nachmanson, and S. Bereg. Monadic decomposition. In International Conference on Computer Aided Verification, volume 8559 of LNCS, pages 628--645. Springer, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Wadler. Deforestation: transforming programs to eliminate trees. In Proceedings of the Second European Symposium on Programming, pages 231--248, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Wassermann, D. Yu, A. Chander, D. Dhurjati, H. Inamura, and Z. Su. Dynamic test input generation for web applications. In International Symposium on Software Testing and Analysis, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. Yu, T. Bultan, and O. H. Ibarra. Relational string verification using multi-track automata. In Conference on Implementation and Application of Automata, CIAA'10, pages 290--299, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Zhang, Q. J. Li, R. Rabbah, and S. Amarasinghe. A lightweight streaming layer for multicore execution. SIGARCH Comput. Archit. News, 36(2):18--27, May 20. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data-Parallel String-Manipulating Programs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader
      About Cookies On This Site

      We use cookies to ensure that we give you the best experience on our website.

      Learn more

      Got it!