Abstract
Replicating data off site is critical for disaster recovery reasons, but the current approach of transferring tapes is cumbersome and error prone. Replicating across a wide area network (WAN) is a promising alternative, but fast network connections are expensive or impractical in many remote locations, so improved compression is needed to make WAN replication truly practical. We present a new technique for replicating backup datasets across a WAN that not only eliminates duplicate regions of files (deduplication) but also compresses similar regions of files with delta compression, which is available as a feature of EMC Data Domain systems.
Our main contribution is an architecture that adds stream-informed delta compression to already existing deduplication systems and eliminates the need for new, persistent indexes. Unlike techniques based on knowing a file's version or that use a memory cache, our approach achieves delta compression across all data replicated to a server at any time in the past. From a detailed analysis of datasets and statistics from hundreds of customers using our product, we achieve an additional 2X compression from delta compression beyond deduplication and local compression, which enables customers to replicate data that would otherwise fail to complete within their backup window.
- Aronovich, L., Asher, R., Bachmat, E., Bitner, H., Hirsch, M., and Klein, S. T. 2009. The design of a similarity based deduplication system. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR '09). ACM, New York, 6:1--6:14. Google Scholar
Digital Library
- Bhagwat, D., Eshghi, K., Long, D. D., and Lillibridge, M. 2009. Extreme binning: scalable, parallel deduplication for chunk-based file backup. In Proceedings of the 17th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.Google Scholar
- Bobbarjung, D. R., Jagannathan, S., and Dubnicki, C. 2006. Improving duplicate elimination in storage systems. Trans. Storage 2, 424--448. Google Scholar
Digital Library
- Brin, S., Davis, J., and García-Molina, H. 1995. Copy detection mechanisms for digital documents. In Proceedings of ACM SIGMOD International Conference on Management of Data. 398--409. Google Scholar
Digital Library
- Broder, A. 1997. On the resemblance and containment of documents. In Proceedings of the Compression and Complexity of Sequences. 21. Google Scholar
Digital Library
- Broder, A. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. 1--10. Google Scholar
Digital Library
- Broder, A. Z., Charikar, M., Frieze, A. M., and Mitzenmacher, M. 1998. Min-wise independent permutations (extended abstract). In Proceedings of the 30th Annual ACM Symposium on Theory of Computing. ACM, New York, 327--336. Google Scholar
Digital Library
- Burns, R. C. and Long, D. D. E. 1997. Efficient distributed backup with delta compression. In Proceedings of the 5th Workshop on I/O in Parallel and Distributed Systems. New York, 27--36. Google Scholar
Digital Library
- Chan, M. C. and Woo, T. Y. C. 1999. Cache-based compaction: a new technique for optimizing web transfer. In Proceedings of the 18th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM '99), 117--125.Google Scholar
- Chen, Y., Qu, Z., Zhang, Z., and Yeo, B.-L. 2004. Data redundancy and compression methods for a disk-based network backup system. In Proceedings of the International Conference on Information Technology: Coding and Computing. 778. Google Scholar
Digital Library
- Debnath, B., Sengupta, S., and Li, J. 2010. Chunkstash: speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Douglis, F. and Iyengar, A. 2003. Application-specific delta-encoding via resemblance detection. In Proceedings of the USENIX Annual Technical Conference. 113--126.Google Scholar
- EMC Corporation. 2010. Data Domain Boost Software. http://www.datadomain.com/products/dd-boost.html.Google Scholar
- Eshghi, K., Lillibridge, M., Wilcock, L., Belrose, G., and Hawkes, R. 2007. Jumbo store: providing efficient incremental upload and versioning for a utility rendering service. In Proceedings of the 5th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Gailly, J. L. and Adler, M. 2003. The GZIP compressor. http://www.gzip.org.Google Scholar
- Guo, F. and Efstathopoulos, P. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- Hunt, J. J., Vo, K.-P., and Tichy, W. F. 1998. Delta algorithms: an empirical analysis. ACM Trans. Softw. Eng. Methodol. 7, 192--214. Google Scholar
Digital Library
- Jain, N., Dahlin, M., and Tewari, R. 2005. Taper: tiered approach for eliminating redundancy in replica synchronization. In Proceedings of the 4th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Kulkarni, P., Douglis, F., LaVoie, J., and Tracey, J. M. 2004. Redundancy elimination within large collections of files. In Proceedings of the USENIX Annual Technical Conference. 59--72. Google Scholar
Digital Library
- Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: large scale, inline deduplication using sampling and locality. In Proceedings of the 7th USENIX Conference on File and Storage Technologies. 111--123. Google Scholar
Digital Library
- MacDonald, J. 2000. File system support for delta compression. M.S. thesis, Department of Electrical Engineering and Computer Science, University of California, Berkeley.Google Scholar
- Manber, U. 1994. Finding similar files in a large file system. In Proceedings of the USENIX Winter Technical Conference. 1--10. Google Scholar
Digital Library
- Min, J., Yoon, D., and Won, Y. 2010. Efficient deduplication techniques for modern backup operation. IEEE Trans. Comput. 60, 6, 824--840. Google Scholar
Digital Library
- Mogul, J. C., Douglis, F., Feldmann, A., and Krishnamurthy, B. 1997. Potential benefits of delta encoding and data compression for http. In Proceedings of the ACM SIGCOMM 1997 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 181--194. Google Scholar
Digital Library
- Muthitacharoen, A., Chen, B., and Mazières, D. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP '01). 174--187. Google Scholar
Digital Library
- Park, K., Ihm, S., Bowman, M., and Pai, V. S. 2007. Supporting practical content-addressable caching with CZIP compression. In Proceedings of the USENIX Annual Technical Conference. 14:1--14:14. Google Scholar
Digital Library
- Park, N. and Lilja, D. 2010. Characterizing datasets for data deduplication in backup applications. In Proceedings of the IEEE International Symposium on Workload Characterization. Google Scholar
Digital Library
- Patterson, H., Manley, S., Federwisch, M., Hitz, D., Kleiman, S., and Owara, S. 2002. Snapmirror: file system based asynchronous mirroring for disaster recovery. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Policroniades, C. and Pratt, I. 2004. Alternatives for detecting redundancy in storage systems data. In Proceedings of the USENIX Annual Technical Conference. 73--86. Google Scholar
Digital Library
- Quinlan, S. and Dorward, S. 2002. Venti: a new approach to archival storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Rabin, M. O. 1981. Fingerprinting by random polynomials. Tech. rep., Center for Research in Computing Technology.Google Scholar
- Riverbed Technology. 2011. Riverbed Steelhead Product Family. http://www.riverbed.com/us/assets/media/documents/data_sheets/DataSheet-Riverbed-FamilyProduct.pdf.Google Scholar
- Shilane, P., Huang, M., Wallace, G., and Hsu, W. 2012a. WAN optimized replication of backup datasets using stream-informed delta compression. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Shilane, P., Wallace, G., Huang, M., and Hsu, W. 2012b. Delta compressed and deduplicated storage using stream-informed locality. In Proceedings of the 4th USENIX Conference on Hot Topics in Storage and File Systems. Google Scholar
Digital Library
- Spring, N. T. and Wetherall, D. 2000. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of the ACM SIGCOMM 2000 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 87--95. Google Scholar
Digital Library
- Suel, T. and Memon, N. 2002. Algorithms for delta compression and remote file synchronization. In Lossless Compression Handbook, K. Sayood, Ed. Academic Press, San Diego, CA.Google Scholar
- Suel, T., Noel, P., and Trendafilov, D. 2004. Improved file synchronization techniques for maintaining large replicated collections over slow networks. In Proceedings of the 20th International Conference on Data Engineering. Google Scholar
Digital Library
- Trendafilov, D., Memon, N., and Suel, T. 2002. Zdelta: An efficient delta compression tool. Tech. rep., Department of Computer and Information Science, Polytechnic University.Google Scholar
- Tridgell, A. 2000. Efficient algorithms for sorting and synchronization. Ph.D. thesis, Australian National University.Google Scholar
- Wallace, G., Douglis, F., Qian, H., Shilane, P., Smaldone, S., Chamness, M., and Hsu, W. 2012. Characteristics of backup workloads in production systems. In Proceedings of the 10th USENIX Conference on File and Storage Technologies. Google Scholar
Digital Library
- Xia, W., Jiang, H., Feng, D., and Hua, Y. 2011. Silo: a similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference. Google Scholar
Digital Library
- You, L. and Karamanolis, C. 2004. Evaluation of efficient archival storage techniques. In Proceedings of the 21st Symposium on Mass Storage Systems.Google Scholar
- You, L., Pollack, K., Long, D. D. E., and Gopinath, K. 2011. Presidio: a framework for efficient archival data storage. ACM Trans. Storage 7, 2. Google Scholar
Digital Library
- Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the Data Domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies. 269--282. Google Scholar
Digital Library
Index Terms
WAN-optimized replication of backup datasets using stream-informed delta compression
Recommendations
WAN optimized replication of backup datasets using stream-informed delta compression
FAST'12: Proceedings of the 10th USENIX conference on File and Storage TechnologiesReplicating data off-site is critical for disaster recovery reasons, but the current approach of transferring tapes is cumbersome and error-prone. Replicating across a wide area network (WAN) is a promising alternative, but fast network connections are ...
A scalable deduplication and garbage collection engine for incremental backup
SYSTOR '13: Proceedings of the 6th International Systems and Storage ConferenceVery large block-level data backup systems need scalable data deduplication and garbage collection techniques to make efficient use of the storage space and to minimize the performance overhead of doing so. Although the deduplication and garbage ...
Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage
Backup storage systems often remove redundancy across backups via inline deduplication, which works by referring duplicate chunks of the latest backup to those of existing backups. However, inline deduplication degrades restore performance of the latest ...






Comments