Abstract
Differencing between compressed archives is a common task in file management and synchronization. Applications include source code distribution, application updates, and document synchronization. General purpose binary differencing tools can create and apply patches to compressed archives, but don’t consider the internal structure of the compressed archive or the file lifecycle. Therefore, they miss opportunities to save space based on the archive’s internal structure and metadata. To address the gap, we develop a content-aware, format independent theory for differencing on compressed archives and propose a canonical form and digest for compressed archives. Based on them, we present Donag, a content-aware differencing and patching algorithm that produces smaller patches than general purpose binary differencing tools on versioned archives by exploiting the compressed archives’ internal structure. Donag uses the VCDiff and BSDiff engines internally. We compare Donag’s patches to ones produced by bsdiff, xdelta3, and Delta++ on three classes of compressed archives: open-source code repositories, large and small applications, and office productivity documents (DOCX, XLSX, PPTX). Donag’s patches are typically 10% to 89% smaller than those produced by bsdiff, xdelta3, and Delta++, with reasonable memory overhead and throughput on commodity hardware. In the worst case, Donag’s patches are negligibly larger.
- [1] . 2009. Software Updates: Courgette. Online. (
Jul. 2009). Retrieved 4 Nov. 2018 from http://dev.chromium.org/developers/design-documents/software-updates-courgette.Google Scholar - [2] . 2013. A Universal Delta Model. Dissertation. Universita di Bologna, Bologna, Italy. https://core.ac.uk/download/pdf/11014284.pdf.Google Scholar
- [3] . 2016. Measuring the quality of diff algorithms: A formalization. Computer Standards and Interfaces46 (2016), 52–65. Google Scholar
Digital Library
- [4] . 2000. Method and system for differencing container files. US Patent. (
Nov. 2000). https://patents.google.com/patent/US6148340A/en.Patent No. US6148340A, Filed 30 Apr. 1998, Issued 14 Nov. 2000. Google Scholar - [5] . 2008. Canonical XML Version 1.1.
W3C Recommendation . World Wide Web Consortium.Google Scholar - [6] . 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171). 21–29. Google Scholar
Cross Ref
- [7] . 1991. A survey of computer graphics image encoding and storage formats. SIGGRAPH Comput. Graph. 25, 2 (
April 1991), 67–75. Google ScholarDigital Library
- [8] . 2004. CNE for NetWare 6 Study Guide (1st ed.). Novell Press. Google Scholar
- [9] . 2020. rsync. (
Aug. 2020). Retrieved 2 Sep. 2021 from https://rsync.samba.org/.Google Scholar - [10] . 2018. SlideDiff: Animating textual and media changes in slides. In Proceedings of the ACM Symposium on Document Engineering 2018, DocEng 2018, Halifax, NS, Canada, August 28-31, 2018. ACM, 37:1–37:4. Google Scholar
Digital Library
- [11] . 1996. RFC1951: DEFLATE Compressed Data Format Specification version 1.3. IETF.Google Scholar
Digital Library
- [12] . 2019. VCDiff-java. Online. Retrieved 22 Sep. 2020 from https://github.com/ehrmann/vcdiff-java.
Version 0.1.1. Google Scholar - [13] . 2011. Dynamic manipulation of archive files. US Patent. (
Sep. 2011). https://patents.google.com/patent/US8024382B2/en.Patent No. US8024382B2, Filed 20 Jan. 2009, Issued 20 Sep. 2011. Google Scholar - [14] . 2006. Compressing and searching XML data via two zips. In Proceedings of the 15th International Conference on World Wide Web (WWW ’06). ACM, New York, NY, USA, 751–760. Google Scholar
Digital Library
- [15] . 2011. CRX Package Format. Retrieved 3 Oct. 2019 from http://www.adambarth.com/experimental/crx/docs/crx.html.Google Scholar
- [16] . 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (
Sep. 1952), 1098–1101. Google ScholarCross Ref
- [17] . 1998. Delta algorithms: An empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 2 (
April 1998), 192–214. Google ScholarDigital Library
- [18] . 2008. Modeling delta encoding of compressed files. International Journal of Foundations of Computer Science 19, 01 (2008), 137–146. Google Scholar
Cross Ref
- [19] . 2007. Compressed delta encoding for LZSS encoded files. In 2007 Data Compression Conference (DCC’07). 113–122. Google Scholar
Digital Library
- [20] . 2012. The VCDIFF Generic Differencing and Compression Data Format.
RFC RFC 3284. IETF. https://tools.ietf.org/html/rfc3284.Google Scholar - [21] . 2002. Engineering a differencing and compression data format. In USENIX Annual Technical Conference 2002. USENIX.Google Scholar
- [22] . 2020. Computer Networking: A Top Down Approach Powerpoint Slides. (2020). Retrieved 23 Sep. 2020 from http://gaia.cs.umass.edu/kurose_ross/ppt.htm.Google Scholar
- [23] . 1987. Data compression. ACM Comput. Surv. 19, 3 (
Sep. 1987), 261–296. Google ScholarDigital Library
- [24] . 2009. gdiff: Generic diff and patch. Retrieved 6 Oct. 2019 from http://hackage.haskell.org/package/gdiff.Google Scholar
- [25] . 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX Association, Santa Clara, CA, 256–273. https://www.usenix.org/conference/fast14/technical-sessions/presentation/lin.Google Scholar
- [26] . 2016. xdelta3. Retrieved 22 Sep. 2020 from http://xdelta.org/.Google Scholar
- [27] . 2019. jbsdiff. Retrieved 10 Dec. 2020 from https://github.com/malensek/jbsdiff.Google Scholar
- [28] . 2019. On the lifecycle of the file. ACM Trans. Storage 15, 1, Article
1 (Feb. 2019), 45 pages. Google ScholarDigital Library
- [29] . 2020. Windows Sysinternals. Retrieved 1 Oct. 2020 from https://docs.microsoft.com/en-us/sysinternals/.Google Scholar
- [30] . 2019. NTFS overview. Retrieved 2 Oct. 2019 from https://docs.microsoft.com/en-us/windows-server/storage/file-server/ntfs-overview.Google Scholar
- [31] . 2015. Excel Data Analysis For Dummies (3rd ed.). John Wiley & Sons, Inc. Google Scholar
Digital Library
- [32] . 2006. Binary diff/patch utility. (2006). Retrieved 6 Oct. 2019 from http://www.daemonology.net/bsdiff/.Google Scholar
- [33] . 2012. APPNOTE.TXT - .ZIP File Format Specification (version 6.3.3 ed.). PKWare Inc.Google Scholar
- [34] . 2010. The Structure of Android Package (APK) Files. (
Nov. 2010). Retrieved 3 Oct. 2019 from https://web.archive.org/web/20110208193918 http://en.ophonesdn.com/article/show/354.Google Scholar - [35] . 2014. DELTA++: Reducing the size of Android application updates. IEEE Internet Computing 18, 2 (
Mar. 2014), 50–57. Google ScholarCross Ref
- [36] . 2012. WAN Optimized replication of backup datasets using stream-informed delta compression. In 10th USENIX Conference on File and Storage Technologies (FAST 12). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast12/wan-optimized-replication-backup-datasets-using-stream-informed-delta-compression.Google Scholar
Digital Library
- [37] . 1982. Data compression via textual substitution. J. ACM 29, 4 (
Oct. 1982), 928–951. Google ScholarDigital Library
- [38] . 2003. Lossless Compression Handbook. Academic Press, An Imprint of Elsevier Science, Chapter Algorithms for Delta Compression and Remote File Synchronization, 269–290.Google Scholar
Cross Ref
- [39] . 2004. zipdiff. Online. (2004). Retrieved 22 Sep. 2020 from http://zipdiff.sourceforge.net/index.html.Google Scholar
- [40] . 2010. Solaris 10 ZFS Essentials (1st ed.). Prentice Hall. Google Scholar
Digital Library
- [41] . 2014. Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation 79 (2014), 258–272.
Special Issue: Performance 2014. Google ScholarCross Ref
Index Terms
Donag: Generating Efficient Patches and Diffs for Compressed Archives
Recommendations
Near-Optimal Compression for Compressed Sensing
DCC '15: Proceedings of the 2015 Data Compression ConferenceIn this note we study the under-addressed quantization stage implicit in any compressed sensing signal acquisition paradigm. We also study the problem of compressing the bit-stream resulting from the quantization. We propose using Sigma-Delta (ΣΔ) ...
Scalability in recursively stored delta compressed collections of files
AWC '14: Proceedings of the Second Australasian Web Conference - Volume 155The archiving and maintenance of vast quantities of data is a key challenge for the current use of information technology. When storing large repositories, possibly mirrored at multiple sites, an archiving system aims to reduce both storage and ...
Organizing and compressing collections of files using differences
IDEAS '20: Proceedings of the 24th Symposium on International Database Engineering & ApplicationsA collection of related files often exhibits strong similarities among its constituents. These similarities, and the dual differences, may be used for both compressing the collection and for organizing it in a manner that reveals human-readable ...






Comments