Abstract
Having precise specifications of service APIs is essential for many Software Engineering activities. Unfortunately, available documentation of services is often inadequate and/or imprecise and, hence, cannot be fully relied upon. Generating service documentation manually is a tedious and error-prone task, especially in light of changes to services. Therefore, there is a need for automated support in generating service documentation. In this work, we present a novel approach to infer the API of a service by analyzing recorded messages sent to and received from this service. Our approach includes a novel, two-level clustering technique to cluster messages, a step that many existing approaches to infer message formats fail to perform precisely in the presence of significant variation of payload information of the available messages. We have evaluated our approach on message traces from four different real-world services. The experimental result shows that our approach is more effective than existing techniques in extracting correct message formats from recorded messages.
- [1] . 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Santiago de Chile, Chile. Morgan Kaufmann, 487–499. Google Scholar
Digital Library
- [2] . 2011. Reverse engineering of protocols from network traces. In Proceedings of the 18th Working Conference on Reverse Engineering, Limerick, Ireland. IEEE, 169–178. Google Scholar
Digital Library
- [3] . 1996. Weighted Parzen windows for pattern classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 5 (
May 1996), 567–570. Google ScholarDigital Library
- [4] . 1994. Automatic storage of persistent ASN.1 objects in a relational schema. (
March 1994).U.S. Patent 5,291,583. Google Scholar - [5] . 2008. Usability challenges for enterprise service-oriented architecture APIs. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’08), Herrsching am Ammersee, Bavaria, Germany. IEEE, 193–196.
DOI: DOI: https://doi.org/10.1109/VLHCC.2008.4639084 Google ScholarCross Ref
- [6] . 2011. Synoptic: Studying logged behavior with inferred models. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary. ACM, 448–451. Google Scholar
Digital Library
- [7] . 2002. VAT: A tool for visual assessment of (cluster) tendency. In Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN’02), Honolulu, Hawaii. IEEE, 2225–2230.Google Scholar
Cross Ref
- [8] . 2000. Simple object access protocol (SOAP) 1.1. (
May 2000). Retrieved January 8, 2022 from https://www.w3.org/TR/soap/.Google Scholar - [9] 2019. CA Identity Manager. (
Dec. 2019). Retrieved January 4, 2020 from https://techdocs.broadcom.com/content/broadcom/techdocs/us/en/ca-enterprise-software/layer7-identity-and-access-management/identity-manager/14-3.html.Google Scholar - [10] . 2007. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA. ACM, 317–329.
DOI: DOI: https://doi.org/10.1145/1315245.1315286 Google ScholarCross Ref
- [11] . 2004. Understanding data lifetime via whole system simulation. In Proceedings of 13th USENIX Security Symposium, San Diego, California.USENIX Association, 321–336. Google Scholar
Digital Library
- [12] . 2004. Understanding data lifetime via whole system simulation. In Proceedings of the 13th USENIX Security Symposium, San Diego, CA. USENIX Association, 321–336. Google Scholar
Digital Library
- [13] . 2009. Prospex: Protocol specification extraction. In Proceedings of the 30th IEEE Symposium on Security and Privacy. Berkeley, CA. IEEE, 110–125. Google Scholar
Digital Library
- [14] . 2005. Vigilante: End-to-end containment of Internet worms. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. Brighton, UK. ACM, 133–147. Google Scholar
Digital Library
- [15] . 2004. Minos: Control data attack prevention orthogonal to memory model. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture,Portland, OR. IEEE, 221–232. Google Scholar
Digital Library
- [16] . 2007. Discoverer: Automatic protocol reverse engineering from network traces. In Proceedings of the 16th USENIX Security Symposium.Boston, MA. USENIX Association, 1–14. Google Scholar
Digital Library
- [17] . 2006. Protocol-independent adaptive replay of application dialog. In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS’06). San Diego, CA. Citeseer, 1–15.Google Scholar
- [18] . 2020. Logram: Efficient log parsing using n-gram dictionaries. IEEE Transactions on Software Engineering (2020).Google Scholar
Cross Ref
- [19] . 2016. Spell: Streaming parsing of system event logs. In IEEE 16th International Conference on Data Mining (ICDM’16), Barcelona, Spain. IEEE, 859–864.Google Scholar
Cross Ref
- [20] . 2015. Interaction traces mining for efficient system responses generation. ACM SIGSOFT Software Engineering Notes 40, 1 (2015), 1–8. Google Scholar
Digital Library
- [21] . 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 9th IEEE International Conference on Data Mining, Miami Beach, FL. IEEE, 149–158. Google Scholar
Digital Library
- [22] . Google Books API. ([n.d.]). Retrieved January 8, 2022 from https://developers.google.com/books/docs/overviewGoogle Scholar
- [23] . 2018. A directed acyclic graph approach to online log parsing. arXiv:1806.04356Google Scholar
- [24] . 2017. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS’17). IEEE, Honolulu, HI. IEEE, 33–40.Google Scholar
Cross Ref
- [25] . 2021. A survey on automated log analysis for reliability engineering. ACM Computing Surveys 54, 6 (2021), 1–37. Google Scholar
Digital Library
- [26] . 2016. Enterprise software service emulation: Constructing large-scale testbeds. In Proceedings of the IEEE/ACM International Workshop on Continuous Software Evolution and Delivery (CSED’16). IEEE, Austin, TX. IEEE, 56–62. Google Scholar
Digital Library
- [27] . 2016. Mining input grammars from dynamic taints. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, NY. 720–725. Google Scholar
Digital Library
- [28] . 2020. Discovering Context Dependent Service Models for Stateful Service Virtualization. Ph.D. Dissertation. Swinburne University of Technology, Victoria, Australia.Google Scholar
- [29] . 2018. Mining accurate message formats for service APIs. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18), Campobasso, Italy. IEEE, 266–276.Google Scholar
Cross Ref
- [30] . 2019. A positional keyword-based approach to inferring fine-grained message formats. Future Generation Computer Systems 102 (
Aug. 2019), 369–381.Google ScholarCross Ref
- [31] . 1991. Kendall’s Advanced Theory of Statistics: Classical Inference and Relationship. Vol. 2. Oxford University Press (5th edition), New York, NY.Google Scholar
- [32] . 1971. Distance between sets. Nature 234, 5323 (
Nov. 1971), 34–35.Google ScholarCross Ref
- [33] . 2006. Extracting output formats from executables. In Proceedings of the 13th Working Conference on Reverse Engineering, Benevento, Italy. IEEE, 167–178. Google Scholar
Digital Library
- [34] . 2008. Automatic protocol format reverse engineering through context-aware monitored execution. In Proceedings of the Symposium on Network and Distributed System Security (NDSS’08), San Diego, CA. The Internet Society, 1–15.Google Scholar
- [35] . 2013. Position-based automatic reverse engineering of network protocols. Journal of Network and Computer Applications 36, 3 (
Feb. 2013), 1070–1077.DOI: DOI: https://doi.org/10.1016/j.jnca.2013.01.013Google ScholarCross Ref
- [36] . 2014. A modelling approach for monitoring sequence activities in diverse environments. In Proceedings of the 9th International Conference on Digital Information Management (ICDIM’14), Phitsanulok, Thailand. IEEE, 33–38.Google Scholar
Cross Ref
- [37] . 2009. Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France. ACM, 1255–1264. Google Scholar
Digital Library
- [38] . 2008. The term vocabulary and postings lists. Introduction to Information Retrieval. Cambridge University Press, Cambridge, England. 22–27 pages. Google Scholar
Digital Library
- [39] . 2004. Network Protocol Analysis using Bioinformatics Algorithms. (2004). Retrieved August 20, 2019 from www.4tphi.net/~awalters/PI/pi.pdf.Google Scholar
- [40] . 2004. Analyzing Microarray Gene Expression Data. Wiley, Hoboken, NJ. 213–218 pages.Google Scholar
Cross Ref
- [41] . 2018. A search-based approach for accurate identification of log message formats. In Proceedings of the 26th Conference on Program Comprehension. Gothenburg, Sweden. ACM, 167–177. Google Scholar
Digital Library
- [42] . 2003. MSN Messenger protocol. (2003). Retrieved January 8, 2022 from http://www.hypothetic.org/docs/msn/index.php.Google Scholar
- [43] . 2013. Incremental mining of system log format. In IEEE International Conference on Services Computing. IEEE, 595–602. Google Scholar
Digital Library
- [44] . 1956. Gedanken-experiments on sequential machines. Automata Studies 34 (1956), 129–153.Google Scholar
- [45] . 2010. Abstracting log lines to log event types for mining software system logs. In 7th IEEE Working Conference on Mining Software Repositories (MSR’10), Cape Town, South Africa. IEEE, 114–117.Google Scholar
Cross Ref
- [46] . 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (
March 1970), 443–453.Google ScholarCross Ref
- [47] . 1983. Accuracy of estimated phylogenetic trees from molecular data. Journal of Molecular Evolution 19, 2 (
March 1983), 153–170.Google ScholarCross Ref
- [48] . 2008. Performance Evaluation for Predictive Modeling. Advanced Data Mining Techniques. Springer Science & Business Media, Berlin, 138. Google Scholar
Digital Library
- [49] . 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1 (
Jan. 2001), 3–55.DOI: DOI: https://doi.org/10.1145/584091.584093 Google ScholarDigital Library
- [50] . 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 785–794. Google Scholar
Digital Library
- [51] . 2010. libyahoo2: A C library for Yahoo! Messenger. (
July 2010). Retrieved January 8, 2022 from http://libyahoo2.sourceforge.net/.Google Scholar - [52] . 2013. Santaclass: A self adaptive network traffic classification system. In Proceedings of the International Federation for Information Processing (IFIP’13) Networking Conference.Brooklyn, NY. IEEE, 1–9.Google Scholar
- [53] . 2015. Towards self adaptive network traffic classification. Computer Communications 56 (
Feb. 2015), 35–46.Google ScholarCross Ref
- [54] . 2003. How Samba was written. Retrieved January 8, 2022 from https://www.samba.org/ftp/tridge/misc/french_cafe.txt.Google Scholar
- [55] 2014. Twitter REST API. Retrieved March 22, 2018 from https://developer.twitter.com/en/docs/api-reference-index.Google Scholar
- [56] . 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM’03), Kansas City, Missouri. IEEE, 119–126.Google Scholar
Cross Ref
- [57] . 2015. Logcluster-a data clustering and pattern mining algorithm for event logs. In 11th International Conference on Network and Service Management (CNSM’15), Barcelona, Spain. IEEE, 1–7. Google Scholar
Digital Library
- [58] . 2016. Enhanced playback of automated service emulation models using entropy analysis. In Proceedings of the International Workshop on Continuous Software Evolution and Delivery (CSED’16). Austin, TX. IEEE, 49–55. Google Scholar
Digital Library
- [59] . 2016. Opaque service virtualisation: A practical tool for emulating endpoint systems. In Proceedings of the 38th International Conference on Software Engineering Companion. Austin, TX. ACM, 202–211. Google Scholar
Digital Library
- [60] . 1994. On the complexity of multiple sequence alignment. Journal of Computational Biology 1, 4 (
June 1994), 337–348.DOI: DOI: https://doi.org/10.1089/cmb.1994.1.337Google ScholarCross Ref
- [61] . 2011. Biprominer: Automatic mining of binary protocol features. In Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. Gwangju, South Korea. IEEE, 179–184. Google Scholar
Digital Library
- [62] . 2012. A semantics aware approach to automated reverse engineering unknown protocols. In Proceedings of the 20th IEEE International Conference on Network Protocols (ICNP’12). Austin, TX. IEEE, 1–10.
DOI: DOI: https://doi.org/10.1109/ICNP.2012.6459963Google ScholarCross Ref
- [63] . 2013. Protocol specification inference based on keywords identification. In Advanced Data Mining and Applications (ADMA’13), Lecture Notes in Computer Science, Vol. 8347, (Eds.). Springer, Berlin,443–454.
DOI: DOI: https://doi.org/10.1007/978-3-642-53917-6_40 Google ScholarCross Ref
- [64] . 2011. Inferring protocol state machine from network traces: A probabilistic approach. In Applied Cryptography and Network Security (ACNS 2011), Lecture Notes in Computer Science, Vol. 6715, and (Eds.). Springer, Berlin, 1–18.
DOI: DOI: https://doi.org/10.1007/978-3-642-21554-4_1 Google ScholarCross Ref
- [65] . 2009. ReFormat: Automatic reverse engineering of encrypted messages. In Proceedings of the 14th European Symposium on Research in Computer Security. Saint-Malo, France. Springer, 200–215. Google Scholar
Digital Library
- [66] . 2017. Protocol vulnerability detection based on network traffic analysis and binary reverse engineering. PloS One 12, 10 (
Oct. 2017), e0186188.Google ScholarCross Ref
- [67] . 2008. Automatic network protocol analysis. In Proceedings of the Network and Distributed System Security Symposium (NDSS’08). San Diego, CA. The Internet Society, 1–14.Google Scholar
- [68] . 1995. Lightweight Directory Access Protocol. RFC 1777. Internet Engineering Task Force (IETF’95),Fremont, CA. ISOC. http://www.rfc-editor.org/info/rfc1777. Google Scholar
Digital Library
- [69] . 2007. Extracting information from unknown protocols on campusNet. In Proceedings of the 1st IEEE International Symposium on Information Technologies and Applications in Education. Kunming, China. IEEE, 535–539.Google Scholar
Cross Ref
Index Terms
Extracting Formats of Service Messages with Varying Payloads
Recommendations
A positional keyword-based approach to inferring fine-grained message formats
AbstractMessage format extraction, the process of revealing the message syntax without access to the protocol specification, is important for a variety of applications such as service virtualization and network security. In this paper, we ...
Transforming heterogeneous messages automatically in web service composition
APWeb'06: Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and DevelopmentWhen composing web services, establishing data flow is one of the most important steps. However, still lack of solution is proposed for the fundamental problem in this step about how to link two services with heterogeneous message types. It results in ...
The Talking Cloud: A Cloud Platform for Enabling Communication Mashups
SCC '14: Proceedings of the 2014 IEEE International Conference on Services ComputingThe recent proliferation of API hosting frameworks has dramatically eased the development of interesting web mashups and provided monetization opportunities for enterprises offering high value APIs. Most of these mashups are based on request/response ...






Comments