skip to main content
research-article

Donag: Generating Efficient Patches and Diffs for Compressed Archives

Published:21 September 2022Publication History
Skip Abstract Section

Abstract

Differencing between compressed archives is a common task in file management and synchronization. Applications include source code distribution, application updates, and document synchronization. General purpose binary differencing tools can create and apply patches to compressed archives, but don’t consider the internal structure of the compressed archive or the file lifecycle. Therefore, they miss opportunities to save space based on the archive’s internal structure and metadata. To address the gap, we develop a content-aware, format independent theory for differencing on compressed archives and propose a canonical form and digest for compressed archives. Based on them, we present Donag, a content-aware differencing and patching algorithm that produces smaller patches than general purpose binary differencing tools on versioned archives by exploiting the compressed archives’ internal structure. Donag uses the VCDiff and BSDiff engines internally. We compare Donag’s patches to ones produced by bsdiff, xdelta3, and Delta++ on three classes of compressed archives: open-source code repositories, large and small applications, and office productivity documents (DOCX, XLSX, PPTX). Donag’s patches are typically 10% to 89% smaller than those produced by bsdiff, xdelta3, and Delta++, with reasonable memory overhead and throughput on commodity hardware. In the worst case, Donag’s patches are negligibly larger.

REFERENCES

  1. [1] Adams Stephen. 2009. Software Updates: Courgette. Online. (Jul. 2009). Retrieved 4 Nov. 2018 from http://dev.chromium.org/developers/design-documents/software-updates-courgette.Google ScholarGoogle Scholar
  2. [2] Barabucci Gioele. 2013. A Universal Delta Model. Dissertation. Universita di Bologna, Bologna, Italy. https://core.ac.uk/download/pdf/11014284.pdf.Google ScholarGoogle Scholar
  3. [3] Barabucci Gioele, Ciancarini Paolo, Iorio Angelo Di, and Vitali Fabio. 2016. Measuring the quality of diff algorithms: A formalization. Computer Standards and Interfaces46 (2016), 5265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Bittinger Reed, Brubaker Nils C., III Barron Cornelius Housel, and Wang Steve. 2000. Method and system for differencing container files. US Patent. (Nov. 2000). https://patents.google.com/patent/US6148340A/en. Patent No. US6148340A, Filed 30 Apr. 1998, Issued 14 Nov. 2000.Google ScholarGoogle Scholar
  5. [5] Boyer John and Marcy Glenn. 2008. Canonical XML Version 1.1. W3C Recommendation. World Wide Web Consortium.Google ScholarGoogle Scholar
  6. [6] Broder A. Z.. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171). 2129. Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Carlson Wayne E.. 1991. A survey of computer graphics image encoding and storage formats. SIGGRAPH Comput. Graph. 25, 2 (April 1991), 6775. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] IV David James Clarke. 2004. CNE for NetWare 6 Study Guide (1st ed.). Novell Press. Google ScholarGoogle Scholar
  9. [9] Davison Wayne. 2020. rsync. (Aug. 2020). Retrieved 2 Sep. 2021 from https://rsync.samba.org/.Google ScholarGoogle Scholar
  10. [10] Denoue Laurent, Carter Scott, and Cooper Matthew. 2018. SlideDiff: Animating textual and media changes in slides. In Proceedings of the ACM Symposium on Document Engineering 2018, DocEng 2018, Halifax, NS, Canada, August 28-31, 2018. ACM, 37:1–37:4. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Deutsch P.. 1996. RFC1951: DEFLATE Compressed Data Format Specification version 1.3. IETF.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Ehrmann David. 2019. VCDiff-java. Online. Retrieved 22 Sep. 2020 from https://github.com/ehrmann/vcdiff-java. Version 0.1.1.Google ScholarGoogle Scholar
  13. [13] Evans Garrick D., Han Liang, Kreisel Carolyn E., and Zhang Tong. 2011. Dynamic manipulation of archive files. US Patent. (Sep. 2011). https://patents.google.com/patent/US8024382B2/en. Patent No. US8024382B2, Filed 20 Jan. 2009, Issued 20 Sep. 2011.Google ScholarGoogle Scholar
  14. [14] Ferragina P., Luccio F., Manzini G., and Muthukrishnan S.. 2006. Compressing and searching XML data via two zips. In Proceedings of the 15th International Conference on World Wide Web (WWW ’06). ACM, New York, NY, USA, 751760. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Labs Google Code. 2011. CRX Package Format. Retrieved 3 Oct. 2019 from http://www.adambarth.com/experimental/crx/docs/crx.html.Google ScholarGoogle Scholar
  16. [16] Huffman D. A.. 1952. A method for the construction of minimum-redundancy codes. Proceedings of the IRE 40, 9 (Sep. 1952), 10981101. Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hunt James J., Vo Kiem-Phong, and Tichy Walter F.. 1998. Delta algorithms: An empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 2 (April 1998), 192214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Klein Shmuel T., Serebro Tamar C., and Shapira Dana. 2008. Modeling delta encoding of compressed files. International Journal of Foundations of Computer Science 19, 01 (2008), 137146. Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Klein Shmuel T. and Shapira Dana. 2007. Compressed delta encoding for LZSS encoded files. In 2007 Data Compression Conference (DCC’07). 113122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Korn D., MacDonald J., Mogul J., and Vo K.. 2012. The VCDIFF Generic Differencing and Compression Data Format. RFC RFC 3284. IETF. https://tools.ietf.org/html/rfc3284.Google ScholarGoogle Scholar
  21. [21] Korn David G. and Vo Kiem-Phong. 2002. Engineering a differencing and compression data format. In USENIX Annual Technical Conference 2002. USENIX.Google ScholarGoogle Scholar
  22. [22] Kurose Jim and Ross Keith. 2020. Computer Networking: A Top Down Approach Powerpoint Slides. (2020). Retrieved 23 Sep. 2020 from http://gaia.cs.umass.edu/kurose_ross/ppt.htm.Google ScholarGoogle Scholar
  23. [23] Lelewer Debra A. and Hirschberg Daniel S.. 1987. Data compression. ACM Comput. Surv. 19, 3 (Sep. 1987), 261296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Lempsink Eelco and Löh Andres. 2009. gdiff: Generic diff and patch. Retrieved 6 Oct. 2019 from http://hackage.haskell.org/package/gdiff.Google ScholarGoogle Scholar
  25. [25] Lin Xing, Lu Guanlin, Douglis Fred, Shilane Philip, and Wallace Grant. 2014. Migratory compression: Coarse-grained data reordering to improve compressibility. In 12th USENIX Conference on File and Storage Technologies (FAST 14). USENIX Association, Santa Clara, CA, 256273. https://www.usenix.org/conference/fast14/technical-sessions/presentation/lin.Google ScholarGoogle Scholar
  26. [26] Macdonald Josh. 2016. xdelta3. Retrieved 22 Sep. 2020 from http://xdelta.org/.Google ScholarGoogle Scholar
  27. [27] Malensek Matthew. 2019. jbsdiff. Retrieved 10 Dec. 2020 from https://github.com/malensek/jbsdiff.Google ScholarGoogle Scholar
  28. [28] May Michael J., Laron Etamar, Zoabi Khalid, and Gerhardt Havah. 2019. On the lifecycle of the file. ACM Trans. Storage 15, 1, Article 1 (Feb. 2019), 45 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Microsoft. 2020. Windows Sysinternals. Retrieved 1 Oct. 2020 from https://docs.microsoft.com/en-us/sysinternals/.Google ScholarGoogle Scholar
  30. [30] Docs Microsoft. 2019. NTFS overview. Retrieved 2 Oct. 2019 from https://docs.microsoft.com/en-us/windows-server/storage/file-server/ntfs-overview.Google ScholarGoogle Scholar
  31. [31] Nelson Stephen L. and Nelson E. C.. 2015. Excel Data Analysis For Dummies (3rd ed.). John Wiley & Sons, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Percival Colin. 2006. Binary diff/patch utility. (2006). Retrieved 6 Oct. 2019 from http://www.daemonology.net/bsdiff/.Google ScholarGoogle Scholar
  33. [33] PKWare Inc. 2012. APPNOTE.TXT - .ZIP File Format Specification (version 6.3.3 ed.). PKWare Inc.Google ScholarGoogle Scholar
  34. [34] Platform OPhone. 2010. The Structure of Android Package (APK) Files. (Nov. 2010). Retrieved 3 Oct. 2019 from https://web.archive.org/web/20110208193918 http://en.ophonesdn.com/article/show/354.Google ScholarGoogle Scholar
  35. [35] Samteladze N. and Christensen K.. 2014. DELTA++: Reducing the size of Android application updates. IEEE Internet Computing 18, 2 (Mar. 2014), 5057. Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Shilane Philip, Huang Mark, Wallace Grant, and Hsu Windsor. 2012. WAN Optimized replication of backup datasets using stream-informed delta compression. In 10th USENIX Conference on File and Storage Technologies (FAST 12). USENIX Association, San Jose, CA. https://www.usenix.org/conference/fast12/wan-optimized-replication-backup-datasets-using-stream-informed-delta-compression.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Storer James A. and Szymanski Thomas G.. 1982. Data compression via textual substitution. J. ACM 29, 4 (Oct. 1982), 928951. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Sul Torsten and Memon Nasir. 2003. Lossless Compression Handbook. Academic Press, An Imprint of Elsevier Science, Chapter Algorithms for Delta Compression and Remote File Synchronization, 269290.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Sullivan Sean C. and Stewart James. 2004. zipdiff. Online. (2004). Retrieved 22 Sep. 2020 from http://zipdiff.sourceforge.net/index.html.Google ScholarGoogle Scholar
  40. [40] Watanabe Scott. 2010. Solaris 10 ZFS Essentials (1st ed.). Prentice Hall. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Xia Wen, Jiang Hong, Feng Dan, Tian Lei, Fu Min, and Zhou Yukun. 2014. Ddelta: A deduplication-inspired fast delta compression approach. Performance Evaluation 79 (2014), 258272. Special Issue: Performance 2014.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Donag: Generating Efficient Patches and Diffs for Compressed Archives

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Storage
            ACM Transactions on Storage  Volume 18, Issue 3
            August 2022
            244 pages
            ISSN:1553-3077
            EISSN:1553-3093
            DOI:10.1145/3555792
            • Editor:
            • Sam H. Noh
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 21 September 2022
            • Online AM: 27 July 2022
            • Accepted: 21 December 2021
            • Revised: 10 December 2021
            • Received: 8 April 2021
            Published in tos Volume 18, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Refereed
          • Article Metrics

            • Downloads (Last 12 months)169
            • Downloads (Last 6 weeks)7

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          Full Text

          View this article in Full Text.

          View Full Text

          HTML Format

          View this article in HTML Format .

          View HTML Format
          About Cookies On This Site

          We use cookies to ensure that we give you the best experience on our website.

          Learn more

          Got it!