skip to main content
research-article

WAN-optimized replication of backup datasets using stream-informed delta compression

Authors Info & Claims
Published:06 December 2012Publication History
Skip Abstract Section

Abstract

Replicating data off site is critical for disaster recovery reasons, but the current approach of transferring tapes is cumbersome and error prone. Replicating across a wide area network (WAN) is a promising alternative, but fast network connections are expensive or impractical in many remote locations, so improved compression is needed to make WAN replication truly practical. We present a new technique for replicating backup datasets across a WAN that not only eliminates duplicate regions of files (deduplication) but also compresses similar regions of files with delta compression, which is available as a feature of EMC Data Domain systems.

Our main contribution is an architecture that adds stream-informed delta compression to already existing deduplication systems and eliminates the need for new, persistent indexes. Unlike techniques based on knowing a file's version or that use a memory cache, our approach achieves delta compression across all data replicated to a server at any time in the past. From a detailed analysis of datasets and statistics from hundreds of customers using our product, we achieve an additional 2X compression from delta compression beyond deduplication and local compression, which enables customers to replicate data that would otherwise fail to complete within their backup window.

References

  1. Aronovich, L., Asher, R., Bachmat, E., Bitner, H., Hirsch, M., and Klein, S. T. 2009. The design of a similarity based deduplication system. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR '09). ACM, New York, 6:1--6:14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bhagwat, D., Eshghi, K., Long, D. D., and Lillibridge, M. 2009. Extreme binning: scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.Google ScholarGoogle Scholar
  3. Bobbarjung, D. R., Jagannathan, S., and Dubnicki, C. 2006. Improving duplicate elimination in storage systems. Trans. Storage 2, 424--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brin, S., Davis, J., and García-Molina, H. 1995. Copy detection mechanisms for digital documents. In Proceedings of ACM SIGMOD International Conference on Management of Data. 398--409. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences. 21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Broder, A. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 1998. Min-wise independent permutations (extended abstract). In Proceedings of the 30th Annual ACM Symposium on Theory of Computing. ACM, New York, 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Burns, R. C. and Long, D. D. E. 1997. Efficient distributed backup with delta compression. In Proceedings of the 5th Workshop on I/O in Parallel and Distributed Systems. New York, 27--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chan, M. C. and Woo, T. Y. C. 1999. Cache-based compaction: a new technique for optimizing web transfer. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM '99), 117--125.Google ScholarGoogle Scholar
  10. Chen, Y., Qu, Z., Zhang, Z., and Yeo, B.-L. 2004. Data redundancy and compression methods for a disk-based network backup system. In Proceedings of the International Conference on Information Technology: Coding and Computing. 778. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Debnath, B., Sengupta, S., and Li, J. 2010. Chunkstash: speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Douglis, F. and Iyengar, A. 2003. Application-specific delta-encoding via resemblance detection. In Proceedings of the USENIX Annual Technical Conference. 113--126.Google ScholarGoogle Scholar
  14. EMC Corporation. 2010. Data Domain Boost Software. http://www.datadomain.com/products/dd-boost.html.Google ScholarGoogle Scholar
  15. Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., and Hawkes, R. 2007. Jumbo store: providing efficient incremental upload and versioning for a utility rendering service. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gailly, J. L. and Adler, M. 2003. The GZIP compressor. http://www.gzip.org.Google ScholarGoogle Scholar
  17. Guo, F. and Efstathopoulos, P. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms: an empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 192--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jain, N., Dahlin, M., and Tewari, R. 2005. Taper: tiered approach for eliminating redundancy in replica synchronization. In Proceedings of the 4th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kulkarni, P., Douglis, F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference. 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. 111--123. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. MacDonald, J. 2000. File system support for delta compression. M.S. thesis, Department of Electrical Engineering and Computer Science, University of California, Berkeley.Google ScholarGoogle Scholar
  23. Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Min, J., Yoon, D., and Won, Y. 2010. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6, 824--840. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mogul, J. C., Douglis, F., Feldmann, A., and Krishnamurthy, B. 1997. Potential benefits of delta encoding and data compression for http. In Proceedings of the ACM SIGCOMM 1997 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 181--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Muthitacharoen, A., Chen, B., and Mazières, D. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01). 174--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Park, K., Ihm, S., Bowman, M., and Pai, V. S. 2007. Supporting practical content-addressable caching with CZIP compression. In Proceedings of the USENIX Annual Technical Conference. 14:1--14:14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Park, N. and Lilja, D. 2010. Characterizing datasets for data deduplication in backup applications. In Proceedings of the IEEE International Symposium on Workload Characterization. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Patterson, H., Manley, S., Federwisch, M., Hitz, D., Kleiman, S., and Owara, S. 2002. Snapmirror: file system based asynchronous mirroring for disaster recovery. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of the USENIX Annual Technical Conference. 73--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Quinlan, S. and Dorward, S. 2002. Venti: a new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rabin, M. O. 1981. Fingerprinting by random polynomials. Tech. rep., Center for Research in Computing Technology.Google ScholarGoogle Scholar
  33. Riverbed Technology. 2011. Riverbed Steelhead Product Family. http://www.riverbed.com/us/assets/media/documents/data_sheets/DataSheet-Riverbed-FamilyProduct.pdf.Google ScholarGoogle Scholar
  34. Shilane, P., Huang, M., Wallace, G., and Hsu, W. 2012a. WAN optimized replication of backup datasets using stream-informed delta compression. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shilane, P., Wallace, G., Huang, M., and Hsu, W. 2012b. Delta compressed and deduplicated storage using stream-informed locality. In Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Spring, N. T. and Wetherall, D. 2000. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of the ACM SIGCOMM 2000 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 87--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Suel, T. and Memon, N. 2002. Algorithms for delta compression and remote file synchronization. In Lossless Compression Handbook, K. Sayood, Ed. Academic Press, San Diego, CA.Google ScholarGoogle Scholar
  38. Suel, T., Noel, P., and Trendafilov, D. 2004. Improved file synchronization techniques for maintaining large replicated collections over slow networks. In Proceedings of the 20th International Conference on Data Engineering. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Trendafilov, D., Memon, N., and Suel, T. 2002. Zdelta: An efficient delta compression tool. Tech. rep., Department of Computer and Information Science, Polytechnic University.Google ScholarGoogle Scholar
  40. Tridgell, A. 2000. Efficient algorithms for sorting and synchronization. Ph.D. thesis, Australian National University.Google ScholarGoogle Scholar
  41. Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., and Hsu, W. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Xia, W., Jiang, H., Feng, D., and Hua, Y. 2011. Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. You, L. and Karamanolis, C. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the 21st Symposium on Mass Storage Systems.Google ScholarGoogle Scholar
  44. You, L., Pollack, K., Long, D. D. E., and Gopinath, K. 2011. Presidio: a framework for efficient archival data storage. ACM Trans. Storage 7, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the Data Domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. 269--282. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. WAN-optimized replication of backup datasets using stream-informed delta compression

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Storage
                ACM Transactions on Storage  Volume 8, Issue 4
                November 2012
                82 pages
                ISSN:1553-3077
                EISSN:1553-3093
                DOI:10.1145/2385603
                Issue’s Table of Contents

                Copyright © 2012 ACM

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 6 December 2012
                • Received: 1 August 2012
                • Accepted: 1 August 2012
                Published in tos Volume 8, Issue 4

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
                • Research
                • Refereed

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader
              About Cookies On This Site

              We use cookies to ensure that we give you the best experience on our website.

              Learn more

              Got it!