Abstract
With the emergence of software delivery platforms, code compression has become an important system component that strongly affects performance. This article presents PPMexe, a compression mechanism for program binaries that analyzes their syntax and semantics to achieve superior compression ratios. We use the generic paradigm of prediction by partial matching (PPM) as the foundation of our compression codec. PPMexe combines PPM with two preprocessing steps: (i) instruction rescheduling to improve prediction rates and (ii) heuristic partitioning of a program binary into streams with high autocorrelation. We improve the traditional PPM algorithm by (iii) using an additional alphabet of frequent variable-length supersymbols extracted from the input stream of fixed-length symbols. In addition, PPMexe features (iv) a low-overhead mechanism that enables decompression starting from an arbitrary instruction of the executable, a property pivotal for runtime software delivery. We implemented PPMexe for x86 binaries and tested it on several large applications. Binaries compressed using PPMexe were 18--24% smaller than files created using off-the-shelf PPMD, one of the best available compressors
- Araujo, G., Centoducatte, P., Azevedo, R., and Pannain, R. 2000. Expression tree based algorithms for code compression on embedded RISC architectures. IEEE Trans. Very Large Scale Integration Syst. 8, 5, 530--533. Google Scholar
Digital Library
- Baker, B. S. and Manber, U. 1998. Deducing similarities in Java sources from bytecodes. In Proceedings of the USENIX Annual Technical Conference. 179--190. Google Scholar
Digital Library
- Bunton, S. 1997. Semantically motivated improvements for PPM variants. Comput. J. 40, 2/3, 76--93.Google Scholar
Cross Ref
- Burrows, M. and Wheeler, D. 1994. A block-sorting lossless data compression algorithm. Tech. Rep., Digital Equipment Corporation.Google Scholar
- Burtscher, M., Ganusov, I., Jackson, S. J., Ke, J., Ratanaworabhan, P., and Sam, N. B. The VPC trace-compression algorithms. IEEE Trans. Comput. 54, 11. Google Scholar
Digital Library
- Chaitin, G. J. 1966. On the length of programs for computing finite binary sequences. J. ACM 13, 4, 547--569. Google Scholar
Digital Library
- Chaitin, G. J. 1969. On the length of programs for computing finite binary sequences: Statistical considerations. J. ACM 16, 1, 145--159. Google Scholar
Digital Library
- Cleary, J. and Witten, I. 1984. Data compression using adaptive coding and partial string matching. IEEE Trans. Commun. 32, 4, 396--402.Google Scholar
Cross Ref
- Debray, S., Evans, W., and Muth, R. 2000. Compiler techniques for code compaction. ACM Trans. Program. Lang. Syst. 22, 2, 378--415. Google Scholar
Digital Library
- Ernst, J., Evans, W., Fraser, C., Lucco, S., and Proebsting, T. 1997. Code compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 358--365. Google Scholar
Digital Library
- Franz, M. and Kistler, T. 1997. Slim binaries. Commun. ACM 40, 12, 87--94. Google Scholar
Digital Library
- Fraser, C. 1999. Automatic inference of models for statistical code compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 242--246. Google Scholar
Digital Library
- Fraser, C., Myers, E., and Wendt, A. 1984. Analyzing and compressing assembly code. In Proceedings of the ACM SIGPLAN Symposium on Compiler Construction 19, 117--121. Google Scholar
Digital Library
- Gilchrist, J. 2000. The archive compression test. http://compression.ca.Google Scholar
- Hennessy, J. L. and Patterson, D. A. 1995. Computer Architecture: A Quantitative Approach, 2nd ed. Morgan Kaufman, San Francisco, CA. Google Scholar
Digital Library
- Hoevel, L. W. and Flynn, M. J. 1977. The structure of directly executed languages: A new theory of interpretive system design. Tech. Rep. CSL-TR-77-130, Stanford University. Google Scholar
Digital Library
- Hong, I., Kirovski, D., and Potkonjak, M. 1997. Potential-Driven statistical ordering of transformations. In Proceedings of the Design Automation Conference. 347--352. Google Scholar
Digital Library
- Hopcroft, J. E. and Ullman, J. D. 1979. Introduction to Automata Theory, Languages and Computation. Addison-Wesley, Reading, MA. Google Scholar
Digital Library
- Howard, P. 1993. The design and analysis of efficient lossless data compression systems. Ph.D. thesis, Brown University. Google Scholar
Digital Library
- Howard, P. and Vitter, J. 1993. Design and analysis of fast text compression based on quasi-arithmetic coding. In Proceedings of the Data Compression Conference. 98--107.Google Scholar
- Huffman, D. 1952. A method for construction of minimum redundancy codes. Proc. IEEE 40, 1098--1101.Google Scholar
Cross Ref
- Intel Corp. 1999a. http://www.intel.com/design/pentiumiii.Google Scholar
- Intel Corp. 1999b. Intel architecture software developer's manual, vol. 2: Instruction set reference manual. http://developer.intel.com/design/processor/.Google Scholar
- Intel Corp. 2000. http://www.intel.com/design/pentium4.Google Scholar
- Kirovski, D., Kin, J., and Mangione-Smith, W. H. 1997. Procedure based program compression. In Proceedings of the International Symposium on Microarchitecture. 204--213. Google Scholar
Digital Library
- Kolmogorov, A. N. 1965. Three approaches to the quantitative definition of information. Problems Inf. Transmission 1, 1, 1--7.Google Scholar
- Korolev, L. 1958. Coding and code compression. J. ACM 5, 4, 328--333. Google Scholar
Digital Library
- Lau, J., Schoenmackers, S., Sherwood, T., and Calder, B. 2003. Reducing code size with echo instructions. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems. 84--94. Google Scholar
Digital Library
- Lekatsas, H. and Wolf, W. 1999. Random access decompression using binary arithmetic coding. In Proceedings of the Data Compression Conference. 306--315. Google Scholar
Digital Library
- Liao, S. 1996. Storage assignment to decrease code size. ACM Trans. Program. Lang. Syst. 18, 2, 235--253. Google Scholar
Digital Library
- Liao, S., Devadas, S., Keutzer, K., and Tjiang, S. 1995. Instruction selection using binate covering for code size optimization. In Proceedings of the ACM IEEE International Conference on Computer-Aided Design. 393--399. Google Scholar
Digital Library
- Lucco, S. 2000. Split-Stream dictionary program compression. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 27--34. Google Scholar
Digital Library
- Moffat, A. 1990. Implementing the PPM data compression scheme. IEEE Trans. Commun. 38, 11, 1917--1921.Google Scholar
Cross Ref
- Mohney, D. 2003. It's all about the last mile. http://www.theinquirer.net.Google Scholar
- Murtagh, T. 1991. An improved storage management scheme for block structured languages. ACM Trans. Program. Lang. Syst. 13, 3, 327--398. Google Scholar
Digital Library
- Proebsting, T. 1995. Optimizing a ANSI C interpreter with superoperators. In Proceedings of the ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. 322--332. Google Scholar
Digital Library
- Pugh, W. 1999. Compressing Java class files. In Proceedings of the Programming Language Design and Implementation Conference. 247--258. Google Scholar
Digital Library
- Rao, A. and Pande, S. 1999. Storage assignment optimizations to generate compact and efficient code on embedded DSPs. In Proceedings of the ACM SIGPLAN Programming Languages Design and Implementation Conference. 128--138. Google Scholar
Digital Library
- Rissanen, J. 1978. Modeling by shortest data description. Automatica 14, 465--471.Google Scholar
Digital Library
- Rissanen, J. and Mohiuddin, K. M. 1989. A multiplication-free multialphabet arithmetic code. IEEE Trans. Commun. 37, 3, 129--146.Google Scholar
Cross Ref
- Romer, T. H., Lee, D., Voelker, G. M., Wolman, A., Wong, W. A., Baer, J.-L., Bershad, B. N., and Levy, H. M. 1996. The structure and performance of interpreters. In Proceedings of the ACM Architectural Support for Programming Languages and Operating Systems Conference. 150--159. Google Scholar
Digital Library
- Shannon, C. 1951. Prediction and entropy of printed English. Bell Syst. Tech. J. 50--64.Google Scholar
- Srivastava, A. and Vo, H. 2001. Vulcan: Binary transformation in a distributed environment. Tech. Rep. MSR-TR-2001-50. Microsoft Research.Google Scholar
- Systa, T., Yu, P., and Muller, H. 2001. Shimba---An environment for reverse engineering Java software systems. Softw. Pract. Exper. 31, 4, 371--394. Google Scholar
Digital Library
- Truman, T., Pering, T., Doering, R., and Brodersen, R. 1998. The InfoPad multimedia terminal: A portable device for wireless information access. IEEE Trans. Comput. 47, 10, 1073--1087. Google Scholar
Digital Library
- Weaver, C. 2000. SPEC2000 binaries. http://www.simplescalar.org.Google Scholar
- Witten, I., Moffat, A., and Bell, T. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, CA. Google Scholar
Digital Library
- Wolfe, A. and Chanin, A. 1992. Executing compressed programs on an embedded RISC architecture. In Proceedings of the International Symposium on Microarchitecture. 81--91. Google Scholar
Digital Library
- Zhang, X. and Gupta, R. 2005. Whole execution traces and their applications. ACM Trans. Architecture Code Optimization 2, 3, 301--334. Google Scholar
Digital Library
- Ziv, J. and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theor. IT-24, 530--536.Google Scholar
Index Terms
PPMexe: Program compression
Recommendations
PPMexe: PPM for Compressing Software
DCC '02: Proceedings of the Data Compression ConferenceWith the emergence of software delivery platforms such as Microsoft's .NET, code compression has become one of the core enabling technologies strongly affecting system performance. In this paper, we present PPMexe - a set of compression mechanisms for ...
Unbounded length contexts for PPM
DCC '95: Proceedings of the Conference on Data CompressionThe prediction by partial matching (PPM) data compression scheme has set the performance standard in lossless compression of text throughout the past decade. The original algorithm was first published in 1984 by Cleary and Witten, and a series of ...
A Hybrid Lossless Compression Scheme for Efficient Delivery of Medical Image Data over the Internet
ICCMS '10: Proceedings of the 2010 Second International Conference on Computer Modeling and Simulation - Volume 01Medical imaging applications generate large volumes of medical data leading to challenges for transmission and storage. In this paper, a novel lossless 3D compression scheme for medical image delivery is proposed. It is based on Prediction by Partial ...






Comments