skip to main content
article
Free Access

PPMexe: Program compression

Published:01 January 2007Publication History
Skip Abstract Section

Abstract

With the emergence of software delivery platforms, code compression has become an important system component that strongly affects performance. This article presents PPMexe, a compression mechanism for program binaries that analyzes their syntax and semantics to achieve superior compression ratios. We use the generic paradigm of prediction by partial matching (PPM) as the foundation of our compression codec. PPMexe combines PPM with two preprocessing steps: (i) instruction rescheduling to improve prediction rates and (ii) heuristic partitioning of a program binary into streams with high autocorrelation. We improve the traditional PPM algorithm by (iii) using an additional alphabet of frequent variable-length supersymbols extracted from the input stream of fixed-length symbols. In addition, PPMexe features (iv) a low-overhead mechanism that enables decompression starting from an arbitrary instruction of the executable, a property pivotal for runtime software delivery. We implemented PPMexe for x86 binaries and tested it on several large applications. Binaries compressed using PPMexe were 18--24% smaller than files created using off-the-shelf PPMD, one of the best available compressors

References

  1. Araujo, G., Centoducatte, P., Azevedo, R., and Pannain, R. 2000. Expression tree based algorithms for code compression on embedded RISC architectures. IEEE Trans. Very Large Scale Integration Syst. 8, 5, 530--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Baker, B. S. and Manber, U. 1998. Deducing similarities in Java sources from bytecodes. In Proceedings of the USENIX Annual Technical Conference. 179--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bunton, S. 1997. Semantically motivated improvements for PPM variants. Comput. J. 40, 2/3, 76--93.Google ScholarGoogle ScholarCross RefCross Ref
  4. Burrows, M. and Wheeler, D. 1994. A block-sorting lossless data compression algorithm. Tech. Rep., Digital Equipment Corporation.Google ScholarGoogle Scholar
  5. Burtscher, M., Ganusov, I., Jackson, S. J., Ke, J., Ratanaworabhan, P., and Sam, N. B. The VPC trace-compression algorithms. IEEE Trans. Comput. 54, 11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chaitin, G. J. 1966. On the length of programs for computing finite binary sequences. J. ACM 13, 4, 547--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chaitin, G. J. 1969. On the length of programs for computing finite binary sequences: Statistical considerations. J. ACM 16, 1, 145--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cleary, J. and Witten, I. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 4, 396--402.Google ScholarGoogle ScholarCross RefCross Ref
  9. Debray, S., Evans, W., and Muth, R. 2000. Compiler techniques for code compaction. ACM Trans. Program. Lang. Syst. 22, 2, 378--415. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ernst, J., Evans, W., Fraser, C., Lucco, S., and Proebsting, T. 1997. Code compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 358--365. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Franz, M. and Kistler, T. 1997. Slim binaries. Commun. ACM 40, 12, 87--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Fraser, C. 1999. Automatic inference of models for statistical code compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 242--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Fraser, C., Myers, E., and Wendt, A. 1984. Analyzing and compressing assembly code. In Proceedings of the ACM SIGPLAN Symposium on Compiler Construction 19, 117--121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gilchrist, J. 2000. The archive compression test. http://compression.ca.Google ScholarGoogle Scholar
  15. Hennessy, J. L. and Patterson, D. A. 1995. Computer Architecture: A Quantitative Approach, 2nd ed. Morgan Kaufman, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hoevel, L. W. and Flynn, M. J. 1977. The structure of directly executed languages: A new theory of interpretive system design. Tech. Rep. CSL-TR-77-130, Stanford University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hong, I., Kirovski, D., and Potkonjak, M. 1997. Potential-Driven statistical ordering of transformations. In Proceedings of the Design Automation Conference. 347--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hopcroft, J. E. and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Howard, P. 1993. The design and analysis of efficient lossless data compression systems. Ph.D. thesis, Brown University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Howard, P. and Vitter, J. 1993. Design and analysis of fast text compression based on quasi-arithmetic coding. In Proceedings of the Data Compression Conference. 98--107.Google ScholarGoogle Scholar
  21. Huffman, D. 1952. A method for construction of minimum redundancy codes. Proc. IEEE 40, 1098--1101.Google ScholarGoogle ScholarCross RefCross Ref
  22. Intel Corp. 1999a. http://www.intel.com/design/pentiumiii.Google ScholarGoogle Scholar
  23. Intel Corp. 1999b. Intel architecture software developer's manual, vol. 2: Instruction set reference manual. http://developer.intel.com/design/processor/.Google ScholarGoogle Scholar
  24. Intel Corp. 2000. http://www.intel.com/design/pentium4.Google ScholarGoogle Scholar
  25. Kirovski, D., Kin, J., and Mangione-Smith, W. H. 1997. Procedure based program compression. In Proceedings of the International Symposium on Microarchitecture. 204--213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kolmogorov, A. N. 1965. Three approaches to the quantitative definition of information. Problems Inf. Transmission 1, 1, 1--7.Google ScholarGoogle Scholar
  27. Korolev, L. 1958. Coding and code compression. J. ACM 5, 4, 328--333. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Lau, J., Schoenmackers, S., Sherwood, T., and Calder, B. 2003. Reducing code size with echo instructions. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 84--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Lekatsas, H. and Wolf, W. 1999. Random access decompression using binary arithmetic coding. In Proceedings of the Data Compression Conference. 306--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Liao, S. 1996. Storage assignment to decrease code size. ACM Trans. Program. Lang. Syst. 18, 2, 235--253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Liao, S., Devadas, S., Keutzer, K., and Tjiang, S. 1995. Instruction selection using binate covering for code size optimization. In Proceedings of the ACM IEEE International Conference on Computer-Aided Design. 393--399. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lucco, S. 2000. Split-Stream dictionary program compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 27--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Moffat, A. 1990. Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 11, 1917--1921.Google ScholarGoogle ScholarCross RefCross Ref
  34. Mohney, D. 2003. It's all about the last mile. http://www.theinquirer.net.Google ScholarGoogle Scholar
  35. Murtagh, T. 1991. An improved storage management scheme for block structured languages. ACM Trans. Program. Lang. Syst. 13, 3, 327--398. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Proebsting, T. 1995. Optimizing a ANSI C interpreter with superoperators. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 322--332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pugh, W. 1999. Compressing Java class files. In Proceedings of the Programming Language Design and Implementation Conference. 247--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Rao, A. and Pande, S. 1999. Storage assignment optimizations to generate compact and efficient code on embedded DSPs. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 128--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 465--471.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Rissanen, J. and Mohiuddin, K. M. 1989. A multiplication-free multialphabet arithmetic code. IEEE Trans. Commun. 37, 3, 129--146.Google ScholarGoogle ScholarCross RefCross Ref
  41. Romer, T. H., Lee, D., Voelker, G. M., Wolman, A., Wong, W. A., Baer, J.-L., Bershad, B. N., and Levy, H. M. 1996. The structure and performance of interpreters. In Proceedings of the ACM Architectural Support for Programming Languages and Operating Systems Conference. 150--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Shannon, C. 1951. Prediction and entropy of printed English. Bell Syst. Tech. J. 50--64.Google ScholarGoogle Scholar
  43. Srivastava, A. and Vo, H. 2001. Vulcan: Binary transformation in a distributed environment. Tech. Rep. MSR-TR-2001-50. Microsoft Research.Google ScholarGoogle Scholar
  44. Systa, T., Yu, P., and Muller, H. 2001. Shimba---An environment for reverse engineering Java software systems. Softw. Pract. Exper. 31, 4, 371--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Truman, T., Pering, T., Doering, R., and Brodersen, R. 1998. The InfoPad multimedia terminal: A portable device for wireless information access. IEEE Trans. Comput. 47, 10, 1073--1087. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Weaver, C. 2000. SPEC2000 binaries. http://www.simplescalar.org.Google ScholarGoogle Scholar
  47. Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Wolfe, A. and Chanin, A. 1992. Executing compressed programs on an embedded RISC architecture. In Proceedings of the International Symposium on Microarchitecture. 81--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Zhang, X. and Gupta, R. 2005. Whole execution traces and their applications. ACM Trans. Architecture Code Optimization 2, 3, 301--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. IT-24, 530--536.Google ScholarGoogle Scholar

Index Terms

  1. PPMexe: Program compression

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!