skip to main content
research-article

Extracting Formats of Service Messages with Varying Payloads

Published:01 February 2022Publication History
Skip Abstract Section

Abstract

Having precise specifications of service APIs is essential for many Software Engineering activities. Unfortunately, available documentation of services is often inadequate and/or imprecise and, hence, cannot be fully relied upon. Generating service documentation manually is a tedious and error-prone task, especially in light of changes to services. Therefore, there is a need for automated support in generating service documentation. In this work, we present a novel approach to infer the API of a service by analyzing recorded messages sent to and received from this service. Our approach includes a novel, two-level clustering technique to cluster messages, a step that many existing approaches to infer message formats fail to perform precisely in the presence of significant variation of payload information of the available messages. We have evaluated our approach on message traces from four different real-world services. The experimental result shows that our approach is more effective than existing techniques in extracting correct message formats from recorded messages.

REFERENCES

  1. [1] Agrawal Rakesh, Srikant Ramakrishnan, et al. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Santiago de Chile, Chile. Morgan Kaufmann, 487499. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Antunes Joao, Neves Nuno, and Verissimo Paulo. 2011. Reverse engineering of protocols from network traces. In Proceedings of the 18th Working Conference on Reverse Engineering, Limerick, Ireland. IEEE, 169178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Babich Gregory A. and Camps Octavia I.. 1996. Weighted Parzen windows for pattern classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 5 (May 1996), 567570. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bapat Subodh. 1994. Automatic storage of persistent ASN.1 objects in a relational schema. (March 1994). U.S. Patent 5,291,583.Google ScholarGoogle Scholar
  5. [5] Beaton Jack, Jeong Sae Young, Xie Yingyu, Stylos Jeffrey, and Myers Brad A.. 2008. Usability challenges for enterprise service-oriented architecture APIs. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’08), Herrsching am Ammersee, Bavaria, Germany. IEEE, 193–196. DOI: DOI: https://doi.org/10.1109/VLHCC.2008.4639084 Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Beschastnikh Ivan, Abrahamson Jenny, Brun Yuriy, and Ernst Michael D.. 2011. Synoptic: Studying logged behavior with inferred models. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary. ACM, 448451. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Bezdek James C. and Hathaway Richard J.. 2002. VAT: A tool for visual assessment of (cluster) tendency. In Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN’02), Honolulu, Hawaii. IEEE, 22252230.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Box Don, Ehnebuske David, Kakivaya Gopal, Layman Andrew, Mendelsohn Noah, Nielsen Henrik Frystyk, Thatte Satish, and Winer Dave. 2000. Simple object access protocol (SOAP) 1.1. (May 2000). Retrieved January 8, 2022 from https://www.w3.org/TR/soap/.Google ScholarGoogle Scholar
  9. [9] Inc. CA Technologies2019. CA Identity Manager. (Dec. 2019). Retrieved January 4, 2020 from https://techdocs.broadcom.com/content/broadcom/techdocs/us/en/ca-enterprise-software/layer7-identity-and-access-management/identity-manager/14-3.html.Google ScholarGoogle Scholar
  10. [10] Caballero Juan, Yin Heng, Liang Zhenkai, and Song Dawn. 2007. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA. ACM, 317329. DOI: DOI: https://doi.org/10.1145/1315245.1315286 Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Chow Jim, Pfaff Ben, Garfinkel Tal, Christopher Kevin, and Rosenblum Mendel. 2004. Understanding data lifetime via whole system simulation. In Proceedings of 13th USENIX Security Symposium, San Diego, California.USENIX Association, 321–336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Chow Jim, Pfaff Ben, Garfinkel Tal, Christopher Kevin, and Rosenblum Mendel. 2004. Understanding data lifetime via whole system simulation. In Proceedings of the 13th USENIX Security Symposium, San Diego, CA. USENIX Association, 321336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Comparetti Paolo Milani, Wondracek Gilbert, Kruegel Christopher, and Kirda Engin. 2009. Prospex: Protocol specification extraction. In Proceedings of the 30th IEEE Symposium on Security and Privacy. Berkeley, CA. IEEE, 110125. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Costa Manuel, Crowcroft Jon, Castro Miguel, Rowstron Antony, Zhou Lidong, Zhang Lintao, and Barham Paul. 2005. Vigilante: End-to-end containment of Internet worms. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. Brighton, UK. ACM, 133147. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Crandall Jedidiah R. and Chong Frederic T.. 2004. Minos: Control data attack prevention orthogonal to memory model. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture,Portland, OR. IEEE, 221232. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Cui Weidong, Kannan Jayanthkumar, and Wang Helen J.. 2007. Discoverer: Automatic protocol reverse engineering from network traces. In Proceedings of the 16th USENIX Security Symposium.Boston, MA. USENIX Association, 1–14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Cui Weidong, Paxson Vern, Weaver Nicholas, and Katz Randy H.. 2006. Protocol-independent adaptive replay of application dialog. In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS’06). San Diego, CA. Citeseer, 1–15.Google ScholarGoogle Scholar
  18. [18] Dai Hetong, Li Heng, Chen Che Shao, Shang Weiyi, and Chen Tse-Hsun. 2020. Logram: Efficient log parsing using n-gram dictionaries. IEEE Transactions on Software Engineering (2020).Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Du Min and Li Feifei. 2016. Spell: Streaming parsing of system event logs. In IEEE 16th International Conference on Data Mining (ICDM’16), Barcelona, Spain. IEEE, 859864.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Du Miao, Versteeg Steve, Schneider Jean-Guy, Han Jun, and Grundy John. 2015. Interaction traces mining for efficient system responses generation. ACM SIGSOFT Software Engineering Notes 40, 1 (2015), 18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Fu Qiang, Lou Jian-Guang, Wang Yi, and Li Jiang. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 9th IEEE International Conference on Data Mining, Miami Beach, FL. IEEE, 149158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] LLC Google. Google Books API. ([n.d.]). Retrieved January 8, 2022 from https://developers.google.com/books/docs/overviewGoogle ScholarGoogle Scholar
  23. [23] He Pinjia, Zhu Jieming, Xu Pengcheng, Zheng Zibin, and Lyu Michael R.. 2018. A directed acyclic graph approach to online log parsing. arXiv:1806.04356Google ScholarGoogle Scholar
  24. [24] He Pinjia, Zhu Jieming, Zheng Zibin, and Lyu Michael R.. 2017. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS’17). IEEE, Honolulu, HI. IEEE, 3340.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] He Shilin, He Pinjia, Chen Zhuangbin, Yang Tianyi, Su Yuxin, and Lyu Michael R.. 2021. A survey on automated log analysis for reliability engineering. ACM Computing Surveys 54, 6 (2021), 137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Hine Cameron, Schneider Jean-Guy, Han Jun, and Versteeg Steve. 2016. Enterprise software service emulation: Constructing large-scale testbeds. In Proceedings of the IEEE/ACM International Workshop on Continuous Software Evolution and Delivery (CSED’16). IEEE, Austin, TX. IEEE, 5662. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Höschele Matthias and Zeller Andreas. 2016. Mining input grammars from dynamic taints. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, NY. 720725. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Hossain MD Arafat. 2020. Discovering Context Dependent Service Models for Stateful Service Virtualization. Ph.D. Dissertation. Swinburne University of Technology, Victoria, Australia.Google ScholarGoogle Scholar
  29. [29] Hossain MD Arafat, Versteeg Steve, Han Jun, Kabir Muhammad Ashad, Jiang Jiaojiao, and Schneider Jean-Guy. 2018. Mining accurate message formats for service APIs. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18), Campobasso, Italy. IEEE, 266276.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Jiang Jiaojiao, Versteeg Steve, Han Jun, Hossain MD Arafat, and Schneider Jean-Guy. 2019. A positional keyword-based approach to inferring fine-grained message formats. Future Generation Computer Systems 102 (Aug. 2019), 369381.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Kendall Maurice George, Stuart Alan, and Ord John Keith. 1991. Kendall’s Advanced Theory of Statistics: Classical Inference and Relationship. Vol. 2. Oxford University Press (5th edition), New York, NY.Google ScholarGoogle Scholar
  32. [32] Levandowsky Michael and Winter David. 1971. Distance between sets. Nature 234, 5323 (Nov. 1971), 3435.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lim Junghee, Reps Thomas, and Liblit Ben. 2006. Extracting output formats from executables. In Proceedings of the 13th Working Conference on Reverse Engineering, Benevento, Italy. IEEE, 167178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Lin Zhiqiang, Jiang Xuxian, Xu Dongyan, and Zhang Xiangyu. 2008. Automatic protocol format reverse engineering through context-aware monitored execution. In Proceedings of the Symposium on Network and Distributed System Security (NDSS’08), San Diego, CA. The Internet Society, 115.Google ScholarGoogle Scholar
  35. [35] Luo Jian-Zhen and Yu Shun-Zheng. 2013. Position-based automatic reverse engineering of network protocols. Journal of Network and Computer Applications 36, 3 (Feb. 2013), 10701077. DOI: DOI: https://doi.org/10.1016/j.jnca.2013.01.013Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Määttä Marko and Räty Tomi. 2014. A modelling approach for monitoring sequence activities in diverse environments. In Proceedings of the 9th International Conference on Digital Information Management (ICDIM’14), Phitsanulok, Thailand. IEEE, 3338.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Makanju Adetokunbo A. O., Zincir-Heywood A. Nur, and Milios Evangelos E.. 2009. Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France. ACM, 12551264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Manning Christopher D., Raghavan Prabhakar, and Schütze Hinrich. 2008. The term vocabulary and postings lists. Introduction to Information Retrieval. Cambridge University Press, Cambridge, England. 22–27 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] LLC McAfee. 2004. Network Protocol Analysis using Bioinformatics Algorithms. (2004). Retrieved August 20, 2019 from www.4tphi.net/~awalters/PI/pi.pdf.Google ScholarGoogle Scholar
  40. [40] McLachlan Geoffrey, Do Kim-Anh, and Ambroise Christophe. 2004. Analyzing Microarray Gene Expression Data. Wiley, Hoboken, NJ. 213–218 pages.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Messaoudi Salma, Panichella Annibale, Bianculli Domenico, Briand Lionel, and Sasnauskas Raimondas. 2018. A search-based approach for accurate identification of log message formats. In Proceedings of the 26th Conference on Program Comprehension. Gothenburg, Sweden. ACM, 167177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Mintz Mike and Sayers Andrew. 2003. MSN Messenger protocol. (2003). Retrieved January 8, 2022 from http://www.hypothetic.org/docs/msn/index.php.Google ScholarGoogle Scholar
  43. [43] Mizutani Masayoshi. 2013. Incremental mining of system log format. In IEEE International Conference on Services Computing. IEEE, 595602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. [44] Moore Edward F.. 1956. Gedanken-experiments on sequential machines. Automata Studies 34 (1956), 129153.Google ScholarGoogle Scholar
  45. [45] Nagappan Meiyappan and Vouk Mladen A.. 2010. Abstracting log lines to log event types for mining software system logs. In 7th IEEE Working Conference on Mining Software Repositories (MSR’10), Cape Town, South Africa. IEEE, 114117.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Needleman Saul B. and Wunsch Christian D.. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (March 1970), 443453.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Nei Masatoshi, Tajima Fumio, and Tateno Yoshio. 1983. Accuracy of estimated phylogenetic trees from molecular data. Journal of Molecular Evolution 19, 2 (March 1983), 153170.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Olson David L. and Delen Dursun. 2008. Performance Evaluation for Predictive Modeling. Advanced Data Mining Techniques. Springer Science & Business Media, Berlin, 138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Shannon Claude E.. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1 (Jan. 2001), 355. DOI: DOI: https://doi.org/10.1145/584091.584093 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. [50] Tang Liang, Li Tao, and Perng Chang-Shing. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 785794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. [51] Tellis Philip S., McAndrewSmith Steve, Kamp Michaël, Parrott Wayne, Dolson Ray Van, and Poyarekar Siddhesh. 2010. libyahoo2: A C library for Yahoo! Messenger. (July 2010). Retrieved January 8, 2022 from http://libyahoo2.sourceforge.net/.Google ScholarGoogle Scholar
  52. [52] Tongaonkar Alok, Keralapura Ram, and Nucci Antonio. 2013. Santaclass: A self adaptive network traffic classification system. In Proceedings of the International Federation for Information Processing (IFIP’13) Networking Conference.Brooklyn, NY. IEEE, 19.Google ScholarGoogle Scholar
  53. [53] Tongaonkar Alok, Torres Ruben, Iliofotou Marios, Keralapura Ram, and Nucci Antonio. 2015. Towards self adaptive network traffic classification. Computer Communications 56 (Feb. 2015), 3546.Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Tridgell Andrew. 2003. How Samba was written. Retrieved January 8, 2022 from https://www.samba.org/ftp/tridge/misc/french_cafe.txt.Google ScholarGoogle Scholar
  55. [55] Inc. Twitter2014. Twitter REST API. Retrieved March 22, 2018 from https://developer.twitter.com/en/docs/api-reference-index.Google ScholarGoogle Scholar
  56. [56] Vaarandi Risto. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM’03), Kansas City, Missouri. IEEE, 119126.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Vaarandi Risto and Pihelgas Mauno. 2015. Logcluster-a data clustering and pattern mining algorithm for event logs. In 11th International Conference on Network and Service Management (CNSM’15), Barcelona, Spain. IEEE, 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Versteeg Steve, Du Miao, Bird John, Schneider Jean-Guy, Grundy John, and Han Jun. 2016. Enhanced playback of automated service emulation models using entropy analysis. In Proceedings of the International Workshop on Continuous Software Evolution and Delivery (CSED’16). Austin, TX. IEEE, 4955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. [59] Versteeg Steve, Du Miao, Schneider Jean-Guy, Grundy John, Han Jun, and Goyal Menka. 2016. Opaque service virtualisation: A practical tool for emulating endpoint systems. In Proceedings of the 38th International Conference on Software Engineering Companion. Austin, TX. ACM, 202211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Wang Lusheng and Jiang Tao. 1994. On the complexity of multiple sequence alignment. Journal of Computational Biology 1, 4 (June 1994), 337348. DOI: DOI: https://doi.org/10.1089/cmb.1994.1.337Google ScholarGoogle ScholarCross RefCross Ref
  61. [61] Wang Yipeng, Li Xingjian, Meng Jiao, Zhao Yong, Zhang Zhibin, and Guo Li. 2011. Biprominer: Automatic mining of binary protocol features. In Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. Gwangju, South Korea. IEEE, 179184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Wang Yipeng, Yun Xiaochun, Shafiq M Zubair, Wang Liyan, Liu Alex X., Zhang Zhibin, Yao Danfeng, Zhang Yongzheng, and Guo Li. 2012. A semantics aware approach to automated reverse engineering unknown protocols. In Proceedings of the 20th IEEE International Conference on Network Protocols (ICNP’12). Austin, TX. IEEE, 110. DOI: DOI: https://doi.org/10.1109/ICNP.2012.6459963Google ScholarGoogle ScholarCross RefCross Ref
  63. [63] Wang Yong, Zhang Nan, Wu Yan-Mei, and Su Bin-Bin. 2013. Protocol specification inference based on keywords identification. In Advanced Data Mining and Applications (ADMA’13), Lecture Notes in Computer Science, Vol. 8347,Motoda Hiroshi, Wu Zhaohui, Cao Longbing, Zaiane Osmar, Yao Min, and Wang Wei (Eds.). Springer, Berlin,443454. DOI: DOI: https://doi.org/10.1007/978-3-642-53917-6_40 Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Wang Yipeng, Zhang Zhibin, Yao Danfeng (Daphne), Qu Buyun, and Guo Li. 2011. Inferring protocol state machine from network traces: A probabilistic approach. In Applied Cryptography and Network Security (ACNS 2011), Lecture Notes in Computer Science, Vol. 6715,Lopez Javier and Tsudik Gene (Eds.). Springer, Berlin, 118. DOI: DOI: https://doi.org/10.1007/978-3-642-21554-4_1 Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Wang Zhi, Jiang Xuxian, Cui Weidong, Wang Xinyuan, and Grace Mike. 2009. ReFormat: Automatic reverse engineering of encrypted messages. In Proceedings of the 14th European Symposium on Research in Computer Security. Saint-Malo, France. Springer, 200–215. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. [66] Wen Shameng, Meng Qingkun, Feng Chao, and Tang Chaojing. 2017. Protocol vulnerability detection based on network traffic analysis and binary reverse engineering. PloS One 12, 10 (Oct. 2017), e0186188.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Wondracek Gilbert, Comparetti Paolo Milani, Kruegel Christopher, Kirda Engin, and Anna Scuola Superiore S.. 2008. Automatic network protocol analysis. In Proceedings of the Network and Distributed System Security Symposium (NDSS’08). San Diego, CA. The Internet Society, 114.Google ScholarGoogle Scholar
  68. [68] Yeong Wengyik, Howes Tim, and Kille Steve. 1995. Lightweight Directory Access Protocol. RFC 1777. Internet Engineering Task Force (IETF’95),Fremont, CA. ISOC. http://www.rfc-editor.org/info/rfc1777. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. [69] Yu Zhuanghui, Huang Yongzhong, Guo Shaozhong, Zhou Bei, and Ren Hua. 2007. Extracting information from unknown protocols on campusNet. In Proceedings of the 1st IEEE International Symposium on Information Technologies and Applications in Education. Kunming, China. IEEE, 535539.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Extracting Formats of Service Messages with Varying Payloads

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Internet Technology
          ACM Transactions on Internet Technology  Volume 22, Issue 3
          August 2022
          631 pages
          ISSN:1533-5399
          EISSN:1557-6051
          DOI:10.1145/3498359
          • Editor:
          • Ling Liu
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 February 2022
          • Accepted: 1 November 2021
          • Revised: 1 September 2021
          • Received: 1 April 2021
          Published in toit Volume 22, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Refereed
        • Article Metrics

          • Downloads (Last 12 months)88
          • Downloads (Last 6 weeks)5

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format
        About Cookies On This Site

        We use cookies to ensure that we give you the best experience on our website.

        Learn more

        Got it!