skip to main content
article
Public Access

Synthesizing program input grammars

Published:14 June 2017Publication History
Skip Abstract Section

Abstract

We present an algorithm for synthesizing a context-free grammar encoding the language of valid program inputs from a set of input examples and blackbox access to the program. Our algorithm addresses shortcomings of existing grammar inference algorithms, which both severely overgeneralize and are prohibitively slow. Our implementation, GLADE, leverages the grammar synthesized by our algorithm to fuzz test programs with structured inputs. We show that GLADE substantially increases the incremental coverage on valid inputs compared to two baseline fuzzers.

References

  1. A. Albarghouthi, S. Gulwani, and Z. Kincaid. Recursive program synthesis. In Computer Aided Verification, pages 934–950. Springer, 2013.Google ScholarGoogle Scholar
  2. S. Anand, E. K. Burke, T. Y. Chen, J. Clark, M. B. Cohen, W. Grieskamp, M. Harman, M. J. Harrold, P. McMinn, et al. An orchestrated survey of methodologies for automated software test case generation. Journal of Systems and Software, 86(8):1978–2001, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. Angluin. Learning regular sets from queries and counterexamples. Information and computation, 75(2):87–106, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Artzi, A. Kiezun, J. Dolby, F. Tip, D. Dig, A. Paradkar, and M. D. Ernst. Finding bugs in dynamic web applications. In Proceedings of the 2008 international symposium on Software testing and analysis, pages 261–272. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Bollig, J.-P. Katoen, C. Kern, M. Leucker, D. Neider, and D. R. Piegdon. libalf: The automata learning framework. In International Conference on Computer Aided Verification, pages 360–364. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Botinˇcan and D. Babi´c. Sigma*: Symbolic learning of input-output specifications. In Proceedings of the 40th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 443–456, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. S. Boujarwah and K. Saleh. Compiler test case generation methods: a survey and assessment. Information and software technology, 39(9):617–625, 1997.Google ScholarGoogle Scholar
  8. J. Caballero, H. Yin, Z. Liang, and D. Song. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM conference on Computer and communications security, pages 317–329. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Cadar and K. Sen. Symbolic execution for software testing: three decades later. Communications of the ACM, 56(2):82– 90, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Cadar, D. Dunbar, D. R. Engler, et al. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In OSDI, volume 8, pages 209–224, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler. Exe: automatically generating inputs of death. ACM Transactions on Information and System Security (TISSEC), 12(2):10, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Y. Cho, D. Babic, P. Poosankam, K. Z. Chen, E. X. Wu, and D. Song. Mace: Model-inference-assisted concolic exploration for protocol and vulnerability discovery. In USENIX Security Symposium, pages 139–154, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Choi, G. Necula, and K. Sen. Guided gui testing of android apps with minimal restart and approximate learning. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, pages 623–640, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. De la Higuera. Grammatical inference: learning automata and grammars. Cambridge University Press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. ECMA International. Standard ECMA-262: ECMA 2015 Language Specification. 6 edition, June 2015.Google ScholarGoogle Scholar
  16. J. K. Feser, S. Chaudhuri, and I. Dillig. Synthesizing data structure transformations from input-output examples. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 229–239. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. E. Forrester and B. P. Miller. An empirical study of the robustness of windows nt applications using random testing. In Proceedings of the 4th USENIX Windows System Symposium, pages 59–68. Seattle, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Ganesh, T. Leek, and M. Rinard. Taint-based directed whitebox fuzzing. In Proceedings of the 31st International Conference on Software Engineering, pages 474–484. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Giannakopoulou, Z. Rakamari´c, and V. Raman. Symbolic learning of component interfaces. In International Static Analysis Symposium, pages 248–264. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. GNU. Gnu bison. https://www.gnu.org/software/ bison, 2014.Google ScholarGoogle Scholar
  21. GNU Grep. https://www.gnu.org/software/grep/ manual, 2016.Google ScholarGoogle Scholar
  22. P. Godefroid, N. Klarlund, and K. Sen. Dart: Directed automated random testing. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 213–223. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Godefroid, A. Kiezun, and M. Y. Levin. Grammar-based whitebox fuzzing. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 206–215, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Godefroid, M. Y. Levin, D. A. Molnar, et al. Automated whitebox fuzz testing. In NDSS, volume 8, pages 151–166, 2008.Google ScholarGoogle Scholar
  25. E. M. Gold. Language identification in the limit. Information and control, 10(5):447–474, 1967.Google ScholarGoogle ScholarCross RefCross Ref
  26. S. Gulwani. Automating string processing in spreadsheets using input-output examples. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 317–330, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 317–328, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Holler, K. Herzig, and A. Zeller. Fuzzing with code fragments. In Presented as part of the 21st USENIX Security Symposium (USENIX Security 12), pages 445–458, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Höschele and A. Zeller. Mining input grammars from dynamic taints. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pages 720–725. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Huang, J. Jia, B. Yu, B.-G. Chun, P. Maniatis, and M. Naik. Predicting execution time of computer programs using sparse polynomial regression. In Advances in Neural Information Processing Systems, pages 883–891, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H. Ishizaka. Polynomial time learnability of simple deterministic languages. Machine Learning, 5(2):151–164, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Knobe and K. Knobe. A method for inferring context-free grammars. Information and Control, 31(2):129–146, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  33. L. Lee. Learning of context-free languages: A survey of the literature. Techn. Rep. TR-12-96, Harvard University, 1996.Google ScholarGoogle Scholar
  34. A. Leung, J. Sarracino, and S. Lerner. Interactive parser synthesis by example. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 565–574. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Z. Lin and X. Zhang. Deriving input syntactic structure from execution. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 83–93. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Z. Lin, X. Zhang, and D. Xu. Reverse engineering input syntactic structure from program execution and its applications. Software Engineering, IEEE Transactions on, 36(5):688–703, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Lindig. Random testing of c calling conventions. In Proceedings of the sixth international symposium on Automated analysis-driven debugging, pages 3–12. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. Majumdar and R.-G. Xu. Directed test generation using symbolic grammars. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, pages 134–143. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of unix utilities. Communications of the ACM, 33(12):32–44, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. P. Miller, G. Cooksey, and F. Moore. An empirical study of the robustness of macos applications using random testing. In Proceedings of the 1st international workshop on Random testing, pages 46–54. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Naik, H. Yang, G. Castelnuovo, and M. Sagiv. Abstractions from tests. pages 373–386, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 89–100, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. P. Norvig. http://norvig.com/lispy.html, 2010.Google ScholarGoogle Scholar
  44. J. Oncina and P. Garc´ıa. Identifying regular languages in polynomial time. Advances in Structural and Syntactic Pattern Recognition, 5(99-108):15–20.Google ScholarGoogle Scholar
  45. Oracle America, Inc. The Java TM Virtual Machine Specification. 7 edition, July 2011.Google ScholarGoogle Scholar
  46. D. Perelman, S. Gulwani, D. Grossman, and P. Provost. Testdriven synthesis. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 408–418, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. O. Polozov and S. Gulwani. Flashmeta: A framework for inductive program synthesis. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pages 107–126. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. P. Purdom. A sentence generator for testing parsers. BIT Numerical Mathematics, 12(3):366–375, 1972.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. Rinard. Acceptability-oriented computing. pages 221– 239, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. C. Rinard. Living in the comfort zone. pages 611–622, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. R. L. Sauder. A general test data generator for cobol. In Proceedings of the May 1-3, 1962, spring joint computer conference, pages 317–323. ACM, 1962. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. K. Sen, D. Marinov, and G. Agha. CUTE: a concolic unit testing engine for C, volume 30. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. R. Singh and S. Gulwani. Synthesizing number transformations from input-output examples. In Computer Aided Verification, pages 634–651. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. R. J. Solomonoff. A new method for discovering the grammars of phrase structure languages. In Information Processing. Unesco, Paris, 1960.Google ScholarGoogle Scholar
  55. Stack Overflow. http:// stackoverflow.com/questions/3809401/ what-is-a-good-regular-expression-to-match-a-url, 2010.Google ScholarGoogle Scholar
  56. A. Stolcke. Bayesian learning of probabilistic language models. PhD thesis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. A. Stolcke and S. Omohundro. Inducing probabilistic grammars by bayesian model merging. Grammatical inference and applications, pages 106–118, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Z. Su and G. Wassermann. The essence of command injection attacks in web applications. In Conference Record of the 33rd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 372–382, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. M. Sutton and A. Greene. The art of file format fuzzing. In Blackhat USA conference, 2005.Google ScholarGoogle Scholar
  60. M. Sutton, A. Greene, and P. Amini. Fuzzing: brute force vulnerability discovery. Pearson Education, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. The Flex Project. Flex: The fast lexical analyzer. http: //flex.sourceforge.net, 2008.Google ScholarGoogle Scholar
  62. A. Vardhan, K. Sen, M. Viswanathan, and G. Agha. Learning to verify safety properties. In International Conference on Formal Engineering Methods, pages 274–289. Springer, 2004.Google ScholarGoogle Scholar
  63. J. Viide, A. Helin, M. Laakso, P. Pietikäinen, M. Seppänen, K. Halunen, R. Puuperä, and J. Röning. Experiences with model inference assisted fuzzing. In WOOT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. W3C. https://www.w3.org/TR/2008/ REC-xml-20081126, 2008.Google ScholarGoogle Scholar
  65. T. Wang, T. Wei, G. Gu, and W. Zou. Taintscope: A checksumaware directed fuzzing tool for automatic software vulnerability detection. In Security and privacy (SP), 2010 IEEE symposium on, pages 497–512. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. G. Wondracek, P. M. Comparetti, C. Kruegel, E. Kirda, and S. S. S. Anna. Automatic network protocol analysis. In NDSS, volume 8, pages 1–14, 2008.Google ScholarGoogle Scholar
  67. X. Yang, Y. Chen, E. Eide, and J. Regehr. Finding and understanding bugs in c compilers. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 283–294, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. M. Zalewski. American fuzzy lop. http://lcamtuf. coredump.cx/afl, 2015.Google ScholarGoogle Scholar

Index Terms

  1. Synthesizing program input grammars

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 52, Issue 6
      PLDI '17
      June 2017
      708 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3140587
      Issue’s Table of Contents
      • cover image ACM Conferences
        PLDI 2017: Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2017
        708 pages
        ISBN:9781450349888
        DOI:10.1145/3062341

      Copyright © 2017 ACM

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 June 2017

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader
    About Cookies On This Site

    We use cookies to ensure that we give you the best experience on our website.

    Learn more

    Got it!