Abstract
Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, and so on, where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial.
This article presents MapX, a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX-based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17× ∼ 4.31× in tail latency, and 76.3% (respectively, 83.8%) in IOPS for reads (respectively, writes).
- [1] . BlueStore. Retrieved from https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/administration_guide/osd-bluestore.Google Scholar
- [2] . Amazon S3. Retrieved from https://aws.amazon.com/s3/.Google Scholar
- [3] . Blobstore: Twitter’s in-house photo storage system. Retrieved from https://blog.twitter.com/engineering/en_us/a/2012/blobstore-twitter-s-in-house-photo-storage-system.html.Google Scholar
- [4] . Ceph block storage Retrieved from https://ceph.com/ceph-storage/block-storage/.Google Scholar
- [5] . Ceph file system. Retrieved from https://ceph.com/ceph-storage/file-system/.Google Scholar
- [6] . Crush tool. Retrieved from https://docs.ceph.com/docs/mimic/man/8/crushtool/.Google Scholar
- [7] . Ceph RADOS. Retrieved from https://docs.ceph.com/docs/mimic/rados/configuration/osd-config-ref/.Google Scholar
- [8] . Ceph Object Gateway. Retrieved from https://docs.ceph.com/en/pacific/radosgw/.Google Scholar
- [9] . Fio Navigation. Retrieved from https://fio.readthedocs.io/en/latest/.Google Scholar
- [10] . MapX code. Retrieved from https://github.com/nicexlab/ceph/tree/mpx.Google Scholar
- [11] . HDFS Architecture Guide. Retrieved from https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.Google Scholar
- [12] . Multi-Cloud Object Storage. Retrieved from https://min.io/.Google Scholar
- [13] . A persistent key-value store for fast storage environments. Retrieved from https://rocksdb.org/.Google Scholar
- [14] . OpenStack Swift Storage. Retrieved from https://wiki.openstack.org/wiki/Swift.Google Scholar
- [15] . 2003. A performance analysis of the iSCSI protocol. In 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies. IEEE, 123–134.Google Scholar
Cross Ref
- [16] . 2010. Cheap and large CAMs for high performance data-intensive networked systems. In USENIX Symposium on Networked Systems Design and Implementation. USENIX Association, 433–448. Retrieved from http://www.usenix.org/events/nsdi10/tech/full_papers/anand.pdf.Google Scholar
- [17] . 2009. FAWN: A fast array of wimpy nodes. In ACM Symposium on Operating Systems Principles (SOSP). ACM, 1–14. Retrieved from http://dblp.uni-trier.de/db/conf/sosp/sosp2009.html#AndersenFKPTV09.Google Scholar
- [18] . 2010. Finding a needle in Haystack: Facebook’s photo storage. In USENIX Conference on Operating Systems Design and Implementation. 47–60.Google Scholar
- [19] . 2019. The Lustre storage architecture. arXiv preprint arXiv:1903.01955 (2019).Google Scholar
- [20] . 2005. Kernel korner: ATA over ethernet: Putting hard drives on the lan. Linux J. 2005, 134 (2005), 10.Google Scholar
Digital Library
- [21] . 2005. Debunking some myths about structured and unstructured overlays. In USENIX Symposium on Networked Systems Design and Implementation.Google Scholar
- [22] . 2014. Parity logging with reserved space: Towards efficient updates and recovery in erasure-coded clustered storage. In 12th USENIX Conference on File and Storage Technologies (FAST’14). 163–176.Google Scholar
- [23] . 2008. BigTable: A distributed storage system for structured data. ACM Trans. Comput. Syst. 26, 2 (2008), 1–26.Google Scholar
Digital Library
- [24] . 1994. RAID: High-performance, reliable secondary storage. ACM Comput. Surv. 26, 2 (1994), 145–185.Google Scholar
Digital Library
- [25] . 2013. Optimistic crash consistency. In 24th ACM Symposium on Operating Systems Principles. ACM, 228–243.Google Scholar
- [26] . 2012. Consistency without ordering. In 10th USENIX Conference on File and Storage Technologies. 9. Retrieved from https://www.usenix.org/conference/fast12/consistency-without-ordering.Google Scholar
Digital Library
- [27] . 2009. Better I/O through byte-addressable, persistent memory. In ACM SIGOPS 22nd Symposium on Operating Systems Principles. ACM, 133–146.Google Scholar
Digital Library
- [28] . 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113.Google Scholar
Digital Library
- [29] . 2011. SkimpyStash: RAM space Skimpy key-value store on flash-based storage. In ACM SIGMOD International Conference on Management of Data (SIGMOD’11). ACM, New York, NY, 25–36.
DOI: DOI: DOI: https://doi.org/10.1145/1989323.1989327Google ScholarCross Ref
- [30] . 2010. FlashStore: High throughput persistent key-value store. PVLDB 3, 2 (2010), 1414–1425. Retrieved from http://dblp.uni-trier.de/db/journals/pvldb/pvldb3.html#DebnathSL10.Google Scholar
Digital Library
- [31] . 2007. Dynamo: Amazon’s highly available key-value store. ACM SIGOPS Oper. Syst. Rev. 41, 6 (2007), 205–220.Google Scholar
Digital Library
- [32] . 2004. Canon in G major: Designing DHTs with hierarchical structure. In IEEE International Conference on Distributed Computing Systems. 263–272.Google Scholar
- [33] . 2003. The Google file system. In ACM Symposium on Operating Systems Principles (SOSP). 29–43.Google Scholar
- [34] . 2014. Analysis of HDFS under HBase: A Facebook messages case study. In 12th USENIX Conference on File and Storage Technologies (FAST’14). 199–212.Google Scholar
- [35] . 1995. The Zebra striped network file system. ACM Trans. Comput. Syst. 13, 3 (1995), 274–310.Google Scholar
Digital Library
- [36] . 2003. SkipNet: A scalable overlay network with practical locality properties. In USENIX Symposium on Internet Technologies and Systems.Google Scholar
- [37] . 1990. Linearizability: A correctness condition for concurrent objects. ACM Trans. Program. Lang. Syst. 12, 3 (
July 1990), 463–492.DOI: DOI: DOI: https://doi.org/10.1145/78969.78972Google ScholarDigital Library
- [38] . 2005. Exporting storage systems in a scalable manner with pNFS. In 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies (MSST’05). IEEE, 18–27.Google Scholar
Digital Library
- [39] . 2017. Research on data migration optimization of Ceph. In 14th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP). IEEE, 83–88.Google Scholar
Cross Ref
- [40] . 2011. RAID6L: A log-assisted RAID6 storage architecture with improved write performance. In IEEE 27th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 1–6.Google Scholar
- [41] . 2003. Koorde: A simple degree-optimal distributed hash table. In International Workshop on Peer-to-Peer Systems (IPTPS). 98–107.Google Scholar
- [42] . 2004. Diminished chord: A protocol for heterogeneous subgroup formation in peer-to-peer networks. In International Workshop on Peer-to-Peer Systems (IPTPS). 288–297.Google Scholar
- [43] . 2009. Cassandra: A structured storage system on a P2P network. In ACM SIGMOD International Conference on Management of Data.Google Scholar
- [44] . 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 2 (2010), 35–40.Google Scholar
Digital Library
- [45] . 1996. Petal: Distributed virtual disks. In ACM SIGPLAN Notices, Vol. 31. ACM, 84–92.Google Scholar
- [46] . 2008. Measurement and analysis of large-scale network file system workloads. In USENIX Annual Technical Conference, Vol. 1. 2–5.Google Scholar
Digital Library
- [47] . 2019. URSA: Hybrid block storage for cloud-scale virtual disks. In 14th EuroSys Conference. ACM.Google Scholar
Digital Library
- [48] . 2017. PARIX: Speculative partial writes in erasure-coded systems. In USENIX Annual Technical Conference (USENIX ATC’17). USENIX Association, 581–587.Google Scholar
- [49] . 2011. SILT: A memory-efficient, high-performance key-value store. In 23rd ACM Symposium on Operating Systems Principles. ACM, 1–13.Google Scholar
- [50] . 2017. WiscKey: Separating keys from values in SSD-conscious storage. ACM Trans. Stor. 13, 1 (2017), 5.Google Scholar
Digital Library
- [51] . 1984. A fast file system for UNIX. ACM ACM Trans. Comput. Syst. 2, 3 (1984), 181–197.Google Scholar
Digital Library
- [52] . 2008. Parallax: Virtual disks for virtual machines. In ACM SIGOPS Operating Systems Review, Vol. 42. ACM, 41–54.Google Scholar
- [53] . 2014. Blizzard: Fast, cloud-scale block storage for cloud-oblivious applications. In 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI’14). 257–273.Google Scholar
- [54] . 2004. Providing administrative control and autonomy in structured peer-to-peer overlays. In International Workshop on Peer-to-Peer Systems (IPTPS). 162–172.Google Scholar
- [55] . 2014. F4: Facebook’s warm BLOB storage system. In USENIX Conference on Operating Systems Design and Implementation. 383–398.Google Scholar
- [56] . 2012. Flat datacenter storage. In USENIX Operating Systems Design and Implementation (OSDI).Google Scholar
- [57] . 2016. Ambry: LinkedIn’s scalable geo-distributed object store. In International Conference on Management of Data. 253–265.Google Scholar
Digital Library
- [58] . 2011. Fast crash recovery in RAMCloud. In ACM Symposium on Operating Systems Principles (SOSP). 29–41.Google Scholar
- [59] . 2002. DualFS: A new journaling file system without meta-data duplication. In 16th International Conference on Supercomputing. ACM, 137–146.Google Scholar
Digital Library
- [60] . 2001. A scalable content-addressable network. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication. 161–172.
DOI: DOI: DOI: https://doi.org/10.1145/383059.383072Google ScholarCross Ref
- [61] . 2014. IndexFS: Scaling file system metadata performance with stateless caching and bulk insertion. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Press, 237–248.Google Scholar
Digital Library
- [62] . 2001. Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing (Middleware). 329–350.Google Scholar
Cross Ref
- [63] . 2006. Cycloid: A constant-degree and lookup-efficient P2P overlay network. Perform. Eval. 63, 3 (2006), 195–216.Google Scholar
Digital Library
- [64] . 2017. Optimistic causal consistency for geo-replicated key-value stores. In IEEE 37th International Conference on Distributed Computing Systems. IEEE, 2626–2629.Google Scholar
Cross Ref
- [65] . 1993. Parity logging overcoming the small write problem in redundant disk arrays. In ACM SIGARCH Computer Architecture News, Vol. 21. ACM, 64–75.Google Scholar
- [66] . 2001. Chord: A scalable peer-to-peer lookup service for internet applications. ACM SIGCOMM Comput. Commun. Rev. 31, 4 (2001), 149–160.Google Scholar
Digital Library
- [67] . 2013. Robustness in the Salus scalable block store. In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13). 357–370.Google Scholar
- [68] . 2005. Parallax: Managing storage for a million machines. In Workshop on Hot Topics in Operating Systems (HotOS).Google Scholar
- [69] . 2006. Ceph: A scalable, high-performance distributed file system. In 7th Symposium on Operating Systems Design and Implementation. 307–320.Google Scholar
- [70] . 2006. CRUSH: Controlled, scalable, decentralized placement of replicated data. In USENIX Operating Systems Design and Implementation (OSDI).Google Scholar
- [71] . 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation. 1–14.Google Scholar
- [72] . 2009. Enabling routing control in a DHT. IEEE J. Select. Areas Commun. 28, 1 (2009), 28–38.Google Scholar
Digital Library
- [73] . 2017. CubicRing: Exploiting network proximity for distributed in-memory key-value store. IEEE/ACM Trans. Network. 25, 4 (2017), 2040–2053.Google Scholar
Digital Library
- [74] . 2019. Leveraging glocality for fast failure recovery in distributed RAM storage. ACM Trans. Stor. 15, 1 (2019), 1–24.Google Scholar
Digital Library
- [75] . 2017. CubeX: Leveraging glocality of cube-based networks for RAM-based key-value store. In IEEE Conference on Computer Communications. IEEE, 1–9.Google Scholar
Cross Ref
- [76] . 2020. PBS: An efficient erasure-coded block storage system based on speculative partial writes. ACM Trans. Stor. 15 (2020), 1–26.Google Scholar
Digital Library
- [77] . 2011. Distributed line graphs: A universal technique for designing DHTs based on arbitrary regular graphs. IEEE Trans. Knowl. Data Eng. 24, 9 (2011), 1556–1569.Google Scholar
Digital Library
- [78] . 2004. Tapestry: A resilient global-scale overlay for service deployment. IEEE J. Select. Areas Commun. 22, 1 (2004), 41–53.Google Scholar
Digital Library
Index Terms
Oasis: Controlling Data Migration in Expansion of Object-based Storage Systems
Recommendations
A feedback-based adaptive data migration method for hybrid storage VOD caching systems
Nowadays Video-On-Demand (VOD) caching systems are often equipped with hybrid storage devices, which have been designed to combine the high read speed of Solid State Disks (SSDs) and the large capacity of Hard Disk Drives (HDDs). However, the number of ...
A data management method for databases using hybrid storage systems
When applications require high I/O performance, solid-state drives (SSDs) are often preferable because they perform better than traditional hard-disk drives (HDDs). Therefore, database system response time can be improved by moving frequently used data ...
Backup mainframe data to object storage systems
SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and StorageMainframe systems backup and data archival is performed via proprietary data movement protocols to FICON-attached tape devices or virtual tape libraries using specialized backup software running on the mainframe operating system. These proprietary tape ...






Comments