Abstract
In this article, we present an approach to systematically examine the schedulability of distributed storage systems, identify their scheduling problems, and enable effective scheduling in these systems. We use Thread Architecture Models (TAMs) to describe the behavior and interactions of different threads in a system, and show both how to construct TAMs for existing systems and utilize TAMs to identify critical scheduling problems. We specify three schedulability conditions that a schedulable TAM should satisfy: completeness, local enforceability, and independence; meeting these conditions enables a system to easily support different scheduling policies. We identify five common problems that prevent a system from satisfying the schedulability conditions, and show that these problems arise in existing systems such as HBase, Cassandra, MongoDB, and Riak, making it difficult or impossible to realize various scheduling disciplines. We demonstrate how to address these schedulability problems using both direct and indirect solutions, with different trade-offs. To show how to apply our approach to enable scheduling in realistic systems, we develop Tamed-HBase and Muzzled-HBase, sets of modifications to HBase that can realize the desired scheduling disciplines, including fairness and priority scheduling, even when presented with challenging workloads.
- [1] Pelikan Cache. 2020. Taming Tail Latency and Achieving Predictability. Retrieved from https://www.pelikan.io/2020/benchmark-adq.html.Google Scholar
- [2] . 2009. PARDA: Proportional allocation of resources for distributed storage access. In Proceedings of the 7th USENIX Symposium on File and Storage Technologies (FAST’09). 85–98.Google Scholar
- [3] . 2008. Tracking in a spaghetti bowl: Monitoring transactions using footprints. In ACM SIGMETRICS Performance Evaluation Review, Vol. 36. ACM, 133–144.Google Scholar
Digital Library
- [4] . 2001. Static-priority scheduling on multiprocessors. In Proceedings of the 22nd IEEE Real-Time Systems Symposium (RTSS’01). IEEE, 193–202.Google Scholar
Cross Ref
- [5] . 2010. Distributed scheduling: A review of concepts and applications. Int. J. Prod. Res. 48, 18 (2010), 5235–5262. Google Scholar
Cross Ref
- [6] . 2010. Finding a needle in Haystack: Facebook’s photo storage. In Proceedings of the 9th Symposium on Operating Systems Design and Implementation (OSDI’10).Google Scholar
Digital Library
- [7] . 2006. Open versus closed: A cautionary tale. In Proceedings of the 3rd Symposium on Networked Systems Design and Implementation (NSDI’06). 239–252. Retrieved from https://www.usenix.org/conference/nsdi-06/open-versus-closed-cautionary-tale.Google Scholar
- [8] . 2015. Multi-resource fairness: Objectives, algorithms and performance. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’15). 31–42.Google Scholar
Digital Library
- [9] . 2006. The chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation (OSDI’06). USENIX Association, 335–350.Google Scholar
Digital Library
- [10] . 2016. Cassandra Issues: Move away from SEDA to TPC. Retrieved from https://issues.apache.org/jira/browse/CASSANDRA-10989.Google Scholar
- [11] . 2019. Parties: Qos-aware resource partitioning for multiple interactive services. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’19).Google Scholar
Digital Library
- [12] . 2004. Path-based faliure and evolution management. In Proceedings of the 1st Symposium on Networked Systems Design and Implementation (NSDI’04). 23–23.Google Scholar
- [13] . 1989. Some results of the earliest deadline scheduling algorithm. IEEE Trans. Softw. Eng. 15, 10 (1989), 1261.Google Scholar
Digital Library
- [14] . 1989. Analysis of the increase and decrease algorithms for congestion avoidance in computer networks. Comput. Netw. ISDN Syst. 17, 1 (1989), 1–14.Google Scholar
Digital Library
- [15] . 2013. MongoDB: The Definitive Guide. O’Reilly Media, Inc.Google Scholar
Digital Library
- [16] . 1989. An analysis of TCP processing overhead. IEEE Commun. Mag. 27, 6 (1989), 23–29. Google Scholar
Digital Library
- [17] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 143–154.Google Scholar
Digital Library
- [18] . 2012. Activity-Oriented Petri Net for scheduling of resources. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’12). IEEE, 1201–1206.Google Scholar
Cross Ref
- [19] . 2013. The tail at scale. Commun. ACM 56, 2 (2013), 74–80.Google Scholar
Digital Library
- [20] . 2006. Enforcing performance isolation across virtual machines in Xen. In Proceedings of the ACM/IFIP/USENIX 7th International Middleware Conference (Middleware’06).Google Scholar
Digital Library
- [21] . 2017. Personal Correspondence.Google Scholar
- [22] . 2009. An implementation of the earliest deadline first algorithm in linux. In Proceedings of the ACM Symposium on Applied Computing. 1984–1989.Google Scholar
Digital Library
- [23] . 2012. Distributed computation on dynamo-style distributed storage: Riak pipe. In Proceedings of the Eleventh ACM SIGPLAN Workshop on Erlang Workshop. ACM, 43–50.Google Scholar
Digital Library
- [24] . 2017. Personal Correspondence.Google Scholar
- [25] . 2020. Caladan: Mitigating interference at microsecond timescales. In Proceedings of the 14th Symposium on Operating Systems Design and Implementation (OSDI’20). 281–297.Google Scholar
- [26] . 2011. HBase: The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly Media, Inc.Google Scholar
- [27] . 2003. The Google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03). 29–43.Google Scholar
Digital Library
- [28] . 2011. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI’11). 24–24.Google Scholar
- [29] . 2017. Personal Correspondence.Google Scholar
- [30] . 2007. pClock: An arrival curve based approach for QoS guarantees in shared storage systems. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07). ACM, New York, NY, 13–24. Google Scholar
Digital Library
- [31] . 2013. HBase Issue Tracking Page. Retrieved from https://issues.apache.org/jira/browse/HBASE-8884.Google Scholar
- [32] . 2016. HBase Issue Tracking Page. Retrieved from https://issues.apache.org/jira/browse/HBASE-8836.Google Scholar
- [33] . 2013. AGILE: Elastic distributed resource scaling for infrastructure-as-a-service. In Proceedings of the 10th IEEE International Conference on Autonomic Computing (ICAC’13). 69–82. Retrieved from https://www.usenix.org/conference/icac13/technical-sessions/presentation/nguyen.Google Scholar
- [34] . 2021. Colossus under the Hood: A Peek into Google’s Scalable Storage System. Retrieved from https://cloud.google.com/blog/products/storage-data-transfer/a-peek-behind-colossus-googles-file-system.Google Scholar
- [35] . 2010. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference (USENIX’10). 11.Google Scholar
- [36] . 2021. BPF Compiler Collection (BCC). Retrieved from https://github.com/iovisor/bcc.Google Scholar
- [37] . 2003. Lightweight EDF scheduling with deadline inheritance. University of Twente, Centre for Telematics and Information Technology, Enschede, Netherlands, Technical Report TR-CTIT-03-23 (2003).Google Scholar
- [38] . 2015. Retro: Targeted resource management in multi-tenant distributed systems. In Proceedings of the 12th Symposium on Networked Systems Design and Implementation (NSDI’15). 589–603. Retrieved from https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/mace.Google Scholar
- [39] . 2000. A comparative analysis of scheduling policies in a distributed system using simulation. In International Journal of SIMULATION Systems, Science & Technology. Vol. 1. UK Simulation Society, 12–20.Google Scholar
- [40] . 1993. The importance of non-data touching processing overheads in TCP/IP. In ACM SIGCOMM Computer Communication Review, Vol. 23. ACM, 259–268.Google Scholar
Digital Library
- [41] . 1998. Rate control for communication networks: Shadow prices, proportional fairness and stability. J. Oper. Res. Soc. 49, 3 (1998), 237–252.Google Scholar
Cross Ref
- [42] . 1976. Queueing Systems, Volume 2: Computer Applications. Vol. 66. Wiley New York. Google Scholar
- [43] . 2010. Riak core: Building distributed applications without shared state. In ACM SIGPLAN Commercial Users of Functional Programming. ACM, 14.Google Scholar
- [44] . 2015. BwE: Flexible, hierarchical bandwidth allocation for WAN distributed computing. In ACM SIGCOMM Computer Communication Review, Vol. 45. ACM, 1–14.Google Scholar
Digital Library
- [45] . 1974. Heavy traffic theory for queues with several servers. I. J. Appl. Prob. 11, 3 (1974), 544–552. Google Scholar
Cross Ref
- [46] . 2009. Cassandra—A decentralized structured storage system. In Proceedings of the 3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware.Google Scholar
- [47] . 2003. Scheduling distributed applications: The SimGrid simulation framework. In Proceedings of the 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid’03).138–145. Google Scholar
Cross Ref
- [48] . 1982. On the complexity of fixed-priority scheduling of periodic, real-time tasks. Perform. Eval. 2, 4 (1982), 237–250.Google Scholar
Cross Ref
- [49] . 2016. 2DFQ: Two-dimensional fair queuing for multi-tenant cloud services. In Proceedings of the Conference on ACM SIGCOMM. ACM, 144–159.Google Scholar
Digital Library
- [50] . 2008. Introduction to discrete-event simulation and the SimPy language. Department of Computer Science, University of California at Davis, Davis, CA. Retrieved on August 2, 2009. https://simpy.readthedocs.io/en/latest/.Google Scholar
- [51] . 1984. Modeling FMS by closed queuing network analysis methods. IEEE Trans. Comp., Hybrids, Manufact. Technol. 7, 3 (1984), 241–248.Google Scholar
Cross Ref
- [52] . 2016. MongoDB Issue Tracking Page. Retrieved from https://jira.mongodb.org/browse/SERVER-24661.Google Scholar
- [53] . 2016. MongoDB Issue Tracking Page. Retrieved from https://jira.mongodb.org/browse/SERVER-20328.Google Scholar
- [54] . 2020. Command Dispatch. Retrieved from https://github.com/mongodb/mongo/blob/master/docs/command_dispatch.md.Google Scholar
- [55] . 2021. Server-Internal Baton Pattern. Retrieved from https://github.com/mongodb/mongo/blob/master/docs/baton.md.Google Scholar
- [56] . 2011. Storage infrastructure behind Facebook messages. In Proceedings of the International Workshop on High Performance Transaction Systems (HPTS’11).Google Scholar
- [57] . 2021. Application frameworks. Commun. ACM 64, 7 (2021), 42–49.Google Scholar
Digital Library
- [58] . 1996. Queueing-theoretic solution methods for models of parallel and distributed systems. In Performance Evaluation of Parallel and Distributed Systems Solution Methods. CWI Tract 105 & 106. 1–24.Google Scholar
- [59] . 2017. Monotasks: Architecting for performance clarity in data analytics frameworks. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP’17). 184–200.Google Scholar
- [60] . 1999. Flash: An efficient and portable Web server. In Proceedings of the USENIX Annual Technical Conference (USENIX’99). 199–212.Google Scholar
- [61] . 2004. Using magpie for request extraction and workload modelling. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI’04). 259–272. Retrieved from https://www.usenix.org/conference/osdi-04/using-magpie-request-extraction-and-workload-modelling.Google Scholar
- [62] . 1977. Petri nets. ACM Comput. Surv. 9, 3 (
Sep. 1977), 223–252. Google ScholarDigital Library
- [63] . 2016. FairRide: Near-Optimal, fair cache sharing. In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). 393–406.Google Scholar
- [64] . 1997. PSPLIB - A project scheduling problem library: OR software—ORSEP operations research software exchange program. Eur. J. Oper. Res. 96, 1 (1997), 205–216. Google Scholar
Cross Ref
- [65] . 1989. Distributed scheduling of tasks with deadlines and resource requirements. IEEE Trans. Comput. 38, 8 (1989), 1110–1123. Google Scholar
Digital Library
- [66] . 2017. Rein: Taming tail latency in key-value stores via multiget scheduling. In Proceedings of the EuroSys Conference (EuroSys’17). 95–110.Google Scholar
Digital Library
- [67] . 2014. Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications. ;Login:: Mag. USENIX & SAGE 39, 6 (2014), 36–38.Google Scholar
- [68] . 1997. Dummynet: A simple approach to the evaluation of network protocols. ACM SIGCOMM Comput. Commun. Rev. 27, 1 (1997), 31–41.Google Scholar
Digital Library
- [69] . 2014. End-to-end performance isolation through virtual datacenters. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI’14). 233–248. Retrieved from https://www.usenix.org/conference/osdi14/technical-sessions/presentation/angel.Google Scholar
- [70] . 2017. Cluster-Level Storage at Google. Retrieved from www.pdsw.org/pdsw-discs17/slides/PDSW-DISCS-Google-Keynote.pdf.Google Scholar
- [71] . 1991. A timed Petri net and beam search based online FMS scheduling system with routing flexibility. In Proceedings of the IEEE International Conference on Robotics and Automation. 2548–2553. Google Scholar
Cross Ref
- [72] . 1995. Scheduling and Load Balancing in Parallel and Distributed Systems. IEEE. Google Scholar
- [73] . 2012. Performance isolation and fairness for multi-tenant cloud storage. In Proceedings of the 10th Symposium on Operating Systems Design and Implementation (OSDI’12). 349–362.Google Scholar
- [74] . 2014. From application requests to Virtual IOPs: Provisioned key-value storage with Libra. In Proceedings of the 9th European Conference on Computer Systems. ACM, 17.Google Scholar
- [75] . 2010. The hadoop distributed file system. In Proceedings of the 26th IEEE Symposium on Mass Storage Systems and Technologies (MSST’10).Google Scholar
Digital Library
- [76] . 2018. Principled schedulability analysis for distributed storage systems using thread architecture models. In Proceedings of the 13th Symposium on Operating Systems Design and Implementation (OSDI’18). 161–176. Retrieved from https://www.usenix.org/conference/osdi18/presentation/yang.Google Scholar
- [77] . 2016. On predictable performance for distributed systems. Ph.D. Dissertation. Technische Universität, Berlin.Google Scholar
- [78] . 2021. Welcome to Apache ServiceMix! Retrieved from https://servicemix.apache.org.Google Scholar
- [79] . 2021. eBPF: Introduction, Tutorial & Community Resources. Retrieved from https://ebpf.io.Google Scholar
- [80] . 2013. IOFlow: A software-defined storage architecture. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13). 182–196.Google Scholar
- [81] . 1991. The impact of operating system scheduling policies and synchronization methods on the performance of parallel applications. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’91).Google Scholar
- [82] . 2005. An analytical model for multi-tier internet services and its applications. In ACM SIGMETRICS Performance Evaluation Review, Vol. 33. ACM, 291–302.Google Scholar
Digital Library
- [83] . 1998. The application of petri nets to workflow management. J. Circ., Syst., Comput. 8, 1 (1998), 21–66.Google Scholar
Cross Ref
- [84] . 1996. Concurrent Programming in ERLANG (2nd ed.). Prentice Hall International (UK) Ltd., GBR. Google Scholar
- [85] . 2003. Capriccio: Scalable threads for internet services. ACM SIGOPS Oper. Syst. Rev. 37, 5 (2003), 268–281.Google Scholar
Digital Library
- [86] . 2007. Argon: Performance insulation for shared storage servers. In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07).Google Scholar
Digital Library
- [87] . 2012. Cake: Enabling high-level SLOs on shared storage systems. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SOCC’12). 1–14.Google Scholar
- [88] . 2001. SEDA: An architecture for well-conditioned, scalable internet services. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01).Google Scholar
Digital Library
- [89] . 2016. IBIS: Interposed big-data I/O scheduler. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, 111–122.Google Scholar
- [90] . 2015. Workload-aware resource reservation for multi-tenant NoSQL. In Proceedings of the IEEE International Conference on Cluster Computing. IEEE, 32–41.Google Scholar
- [91] . 1991. Parallel and sequential mutual exclusions for petri net modeling of manufacturing systems with shared resources. IEEE Trans. Robot. Autom. 7, 4 (1991), 515–527.Google Scholar
Cross Ref
- [92] . 2009. System Modeling and Control with Resource-Oriented Petri Nets. Vol. 35. CRC Press.Google Scholar
Digital Library
- [93] . 2014. PriorityMeister: Tail latency QoS for shared networked storage. In Proceedings of the 5th ACM Symposium on Cloud Computing (SOCC’14). 1–14.Google Scholar
Digital Library
Index Terms
Principled Schedulability Analysis for Distributed Storage Systems Using Thread Architecture Models
Recommendations
Principled schedulability analysis for distributed storage systems using thread architecture models
OSDI'18: Proceedings of the 13th USENIX conference on Operating Systems Design and ImplementationIn this paper, we present an approach to systematically examine the schedulability of distributed storage systems, identify their scheduling problems, and enable effective scheduling in these systems. We use Thread Architecture Models (TAMs) to describe ...
FPZL Schedulability Analysis
RTAS '11: Proceedings of the 2011 17th IEEE Real-Time and Embedded Technology and Applications SymposiumThis paper presents the Fixed Priority until Zero Laxity (FPZL) scheduling algorithm for multiprocessor realtime systems. FPZL is similar to global fixed priority preemptive scheduling, however, whenever a task reaches a state of zero laxity it is given ...
LLF Schedulability Analysis on Multiprocessor Platforms
RTSS '10: Proceedings of the 2010 31st IEEE Real-Time Systems SymposiumLLF (Least Laxity First) scheduling, which assigns a higher priority to a task with smaller laxity, has been known as an optimal preemptive scheduling algorithm on a single processor platform. However, its characteristics upon multiprocessor platforms ...






Comments