The Cost of Simplicity: Understanding Datacenter Scheduler Programming Abstractions

Schedulers are a crucial component in datacenter resource management. Each scheduler offers different capabilities, and users use them through their APIs. However, there is no clear understanding of what programming abstractions they offer, nor why they offer some and not others. Consequently, it is difficult to understand their differences and the performance costs imposed by their APIs. In this work, we study the programming abstractions offered by industrial schedulers, their shortcomings, and their related performance costs. We propose a general reference architecture for scheduler programming abstractions. Specifically, we analyze the programming abstractions of five popular industrial schedulers, understand the differences in their APIs, and identify the missing abstractions. Finally, we carry out exemplary experiments using trace-driven simulation demonstrating that an API extension, such as container migration, can improve total execution time per task by 81%, highlighting how schedulers sacrifice performance by implementing simpler programming abstractions. All the relevant software and data artifacts are publicly available at https://github.com/atlarge-research/quantifying-api-design.


INTRODUCTION
Society's increasing dependence on digital technologies and infrastructure has led to the widespread use of datacenters for deploying digital services [20,28].Schedulers play a vital role in orchestrating datacenter resources to meet the demands of these services [24,45].The interfaces schedulers offer to users determine the limits of the users' ability to mold the orchestration process to support their application needs.Different schedulers offer different levels of programmability and control to users [27,47,51,58].For example, some schedulers provide restricted programming abstractions1 , minimizing user input, while others offer more flexible interfaces that empower users with greater control over resource allocation and job placement [47,58].This spectrum of scheduler programming abstractions raises questions about the impact of design choices on performance, simplicity, and control that users can achieve.
The first question we raise about scheduler abstraction design is: What programming abstractions are common in current schedulers?Knowledge of programming abstractions in existing industrial schedulers informs designers of what is currently available to the users.The programming abstractions available in academic research schedulers can also suggest to designers which abstractions are necessary to incorporate the latest resource management techniques proposed by the research community.
The second question is: What programming abstractions are sacrificed for simplicity?Usually, academic schedulers offer a wide set of programming abstractions, allowing the users to customize several aspects of scheduler operational behavior.On the other hand, industrial schedulers usually implement a restricted subset for increased security and robustness [46].
The third question is: What is the performance cost of the sacrificed abstractions?Despite their security and robustness benefits, simpler abstractions have a performance cost.The performance cost is usually in the form of underutilized resources and slow-to-complete application jobs.To shed light on this issue, we conduct three different experiments.Figure 1 depicts an exemplary result with the median execution time of workflows in a trace from Google [55].We consider a scheduler that implements a crucial abstraction lacking in many industrial schedulers: metadata access to the data stored on datacenters' object storage service (e.g., AWS S3).Comparing it against a scheduler lacking this abstraction, we observe a 24% reduction in median workflow runtime when using the abstraction.
To address these questions and enhance our understanding of scheduler programming abstractions, we develop a comprehensive and structured reference architecture that provides a unified view of the programming interfaces offered by schedulers.This reference architecture compliments earlier work on scheduler internals [5,30].It guides developers and researchers in designing and implementing scheduling APIs, capturing the essential abstractions in task scheduling and resource management within datacenter environments.
Establishing a common reference architecture brings several benefits.First, the reference architecture provides a common framework for analyzing and comparing existing industrial and academic schedulers.The comparison helps identify similarities, differences, and potential shortcomings, thus enabling the assessment of different implementations and design alternatives [5].Second, it serves as a knowledge base for designing better schedulers that can meet the demands of modern applications by addressing shortcomings [5,11,22,36].Finally, establishing a common reference model reduces the risk of a scheduler being specialized to the current interface by providing a view of all possible programming interfaces.This helps avoid non-extensible designs that must be re-engineered at great development cost, as has been the case with Condor [51] and Borg [10] when the need for a new design arises.
To understand datacenter scheduler programming abstractions and the cost of missing ones, we make a four-fold contribution: (1) We design a reference architecture for datacenter scheduler programming abstraction (Section 3).We propose a set of design principles and, with them, design an architecture that considers different stakeholders and the programming abstractions of existing schedulers.(2) We analyze existing industrial and academic schedulers by mapping them to the reference architecture (Section 4).This mapping allows us to compare them using a common language.The comparison reveals abstractions proposed in literature but missing from industrial schedulers.(3) We analyze the effect of missing abstractions on the performance of modern schedulers (Section 5).To this end, we implement three missing abstractions in an event-driven simulator and conduct simulations using real-world traces collected by major datacenter operators, e.g., Google and Microsoft.(4) We contribute to open science and reproducibility by releasing data and software artifacts.To enable the experiments in this work, we have significantly extended OpenDC [39], a state-ofthe-art simulator.We release the code enabling this work's capabilities through Github: https://github.com/atlarge-research/quantifying-api-design.The repository has been archived using Zenodo at: https://zenodo.org/doi/10.5281/zenodo.10605424.

DATACENTER SCHEDULER SYSTEM MODEL
This section contextualizes this work by describing common datacenter scheduling-related concepts depicted by Figure 2.

Workload
The workload is executed using the resources the scheduler assigns to the user.Following the taxonomy proposed by Andreadis et al. [5], we consider four types of workloads: (1) Batch workflows are workloads comprising several tasks with dependencies between them.(2) Bag-of-tasks are jobs formed by several tasks without any dependency between them.(3) Long running tasks run for a very long time and are usually inside a host such as a VM.(4) Managed jobs are workloads where a manager coordinates all the tasks, such as Spark.
The users specify the requirements to execute the workload.Usually, these comprise the amount of CPU and memory.However, in some cases, other requirements, such as the start time, the dependencies between the tasks, the scalability of the resources, etc., are also specified.To submit the workload requirements, users interact with the scheduler through its API.

Scheduling
A user submits a workload to use the resources through a central component, the scheduler [10,41].The scheduler takes care of several tasks: finding resources to assign to the workload based on the specified requirements, transferring the workload to the resources, starting the execution of the workload, managing the workload through its lifecycle (from placement to workload cleanup), and notifying to the user about lifecycle events.
Throughout the execution of a workload, the resource requirements of the workload and the number of resources available to the scheduler can change.Therefore, the scheduler must adapt to changing workload requirements by increasing or decreasing dynamically allocated resources.This is usually done through a specific subcomponent (e.g., the autoscaler in Kubernetes [2]).A scheduler can also preempt, recover, and migrate workloads when the amount of available resources changes.
Schedulers can be monolithic [35] and run in a single process that handles all tasks.They can be distributed where tasks are split into other components, such as the autoscaler [2].In the same way, the scheduler and its members can be replicated in several processes in parallel.Still, they must coordinate among themselves when assigning resources to the workloads.In addition, schedulers can be centralized [42], where a single entity implements the scheduler and dictates the policies and mechanisms, or it can be decentralized [51] so that several entities implement a scheduler.Each of them has different policies and mechanisms.When the scheduler is decentralized, the other instances must coordinate through a common protocol and sometimes use a central matchmaker.

Scheduler resources
The workloads are executed on top of the resources that the scheduler manages.Resources typically refer to physical machines usually located within a datacenter.These datacenters consist of multiple clusters, each housing several hosts, with each host functioning as a node within a rack.It is important to note that while our discussion primarily focuses on virtualized resources such as VMs or containers running on hosts through a hypervisor, it is also possible to manage bare metal resources.However, virtualized environments are more prevalent and present a wider range of interesting phenomena for modeling and analysis.
In this work, we model the resources of a host as the combination of CPU, memory RAM, and storage.CPUs can have different frequencies and number of cores.Memory and storage can have different sizes.We model resource consumption using a discrete model, where the workload reports how many resource it requires and for how long.The hypervisor consolidates the consumption of the different workloads through a fair-sharing policy.

Programming abstraction
Schedulers offer a set of programming abstractions for users to interact with.Programming abstractions are the API offered by schedulers and are the language by which the user submits workloads and modifies the workload's requirements during the workload's life cycle.Programming abstractions are offered through a GUI, CLI, or a protocol such as HTTP.
The API includes both the interactions of the scheduler with the applications and the resources.In this work, we investigate API extensions that allow the scheduler to interact with applications and the resources allocated after the initial resource allocation.
Resource management systems, such as autoscalers, interact with schedulers and other resource managers in a completely automated manner without any user intervention.We consider the API between these different systems a part of the scheduler programming abstraction.The API constrains the actions available to these systems.Obtaining system data and performing actions not supported by the API is difficult for the systems we analyze in this work.

REFERENCE ARCHITECTURE FOR SCHEDULER PROGRAMMING ABSTRACTIONS
We propose a reference architecture to understand and describe standard programming abstractions available in current schedulers.
With systematic categorization and organization, the reference architecture will offer a framework for analyzing and comparing existing schedulers and a comprehensive view of the range of common abstractions that a scheduler can implement.This helps us answer the question What are the programming abstractions common in current schedulers?.
Our process for designing the reference architecture has the following steps: (1) Stakeholder and use case identification (2) Requirements analysis (3) Model industrial schedulers (4) Model emerging concepts from academia (5) Unify industrial schedulers with emerging concepts We describe our requirements in Section 3.1.We identify five popular schedulers in the industry, and we analyze their APIs.Consulting experts in the field, we select the following schedulers: Kubernetes [3], SLURM [35], Spark [56], Condor [51], and Airflow [1].We further analyze these schedulers in Section 4.
After analyzing industrial and emerging scheduler designs from academia, we extract, filter, generalize, and unify them into a reference architecture.

Requirements
We identify the requirements that must be met by the reference architecture.This has to be: R1 Understandable.Different stakeholders should be able to easily understand the different components that make up the reference architecture, how they relate to each other, or their high-level meaning.We enable this through the principles in Section 3.2 and the description language in Section 3.3.R2 Actionable.The design must take into account whether users can use it to take concrete actions.We use the architecture in Section 4 to identify missing abstractions in industrial schedulers.We quantify the cost of missing abstractions in Section 5. R3 Pragmatic.The reference architecture concepts can be implemented in code and evaluate different programming abstractions comparatively.The reference architecture has been realized in the OpenDC simulator and used for experiments in Section 5. R4 Comprehensive.Can represent all already known concepts used in industrial schedulers and emerging concepts from academia.We map five industrial schedulers to the reference architecture in comparison in Section 4. The reference architecture was built by analyzing 15 research prototypes from the community.

Design principles
For the design of the scheduling programming abstractions reference architecture, we identify the following design principles.
P1 Separation of objects from actions.We distinguish between the actions that can be performed and the objects, which represent the system's state, that are used as input to the actions.This separation facilitates comprehension (R1).P2 Grouping of related actions.There may be several actions that are related to each other.Therefore, to facilitate comprehension, related actions are grouped.P3 Avoidance of concrete technologies in objects.We keep the objects as high-level as possible to avoid strong coupling to a specific technology.

Reference Architecture Design
We analyze industrial and emerging scheduler designs from academia for scheduling abstractions.Then we extract, filter, generalize, and unify them into a reference architecture.In this process, we follow the requirements and design principles we set out in the previous subsections.The reference architecture allows us to describe the different abstractions provided by the schedulers we analyzed using a common language.This common language allows enables us to compare the schedulers' APIs to each other in Section 4.
The reference architecture is depicted in Figure 3.The high-level components of the reference architecture are actions and objects that comprise the scheduler API.Object describes the current or desired state of the system.Actions describe physical events (such as leasing a VM) that are executed when certain conditions are met.The conditions use objects in their specification.Each action must have three types of conditions: WHAT, WHEN, and WHERE, and for each condition, there can be one or more objects.This way, programming abstractions can be understood through the following syntactic structure: <action> <object> IN <object> WHEN <object>, where the objects and actions are filled using the reference architecture.
Listing 1: Example scheduler action.P r o v i s i o n : L e a s e U s e r R e s o u r c e < t y p e : j o b , r u n t i m e : 5 days > IN S c h e d u l e r R e s o u r c e < t y p e : vm , cpu : 2 .4 Ghz , memory : 1 6 Gb> WHEN Event < day : 1 1 , month : 1 2 , y e a r : 2 0 2 3 > Consider the scheduler interaction in Listing 1; the action is "Provision:Lease, " indicating the provisioning and leasing of resources.The objects involved are "UserResource" with specific characteristics such as job type and a runtime of 5 days, and "SchedulerResource" with attributes like VM type, CPU of 2.4GHz, and memory of 16GB.The condition "IN" specifies that the "UserResource" is allocated within the "SchedulerResource".Lastly, the "WHEN" condition indicates an event occurring on December 31, 2022.
Tables 1 and 2 define and describe the actions and objects within the reference architecture.These tables serve as a resource for understanding the specific elements of the reference architecture and their respective functionalities.
In addition to the visual representation of the reference architecture for scheduling programming abstractions shown in Figure 3, we have also defined a formally defined syntax which we use in Listing 1.The syntax is based on the Extended Backus-Naur Form (EBNF) and provides a structured and consistent way to express conditions using actions and objects in the programming abstractions.Due to space constraints, we do not present the formal syntax definition here but will add it as an appendix.The formally defined syntax enables precise communication using the reference architecture.

The mapping process
For each considered scheduler, we consult its official documentation, source code, and articles we find online.Then, using these resources, for each component of the reference architecture, we identify if there is a complete, partial, or no match.The meaning of the match is different for objects than for model actions.In the case of actions, a complete match is when the scheduler offers the action.A partial match is when the action is offered in a limited way; that is, the action may only be offered at a specific moment in the lifecycle, e.g., it only allows to scale when the CPU utilization is more than 80%, or when the parameters with which the action can be performed are limited, e.g., a service can only be scaled by adding VMs of the same type of resources.A no-match is when the scheduler does not offer the action.In the case of objects, a full match means that the scheduler restricts the object parameters, and the user can flexibly specify whatever parameters they need.For example, the user can add any metadata information.A partial match means the scheduler allows the user to specify only a limited set of object parameters.For example, the user can only specify CPU constraints, not any other resource type.A no-match means that the scheduler does not allow that object type.

Mapping results
Using the reference architecture, we analyze the shortcomings of the selected group of five industrial schedulers.Currently, it is not known when nor why you should use some schedulers and not others.It is also unclear if any scheduler has a clear missing gap or how to fill it.For that, it is necessary to analyze the scheduling APIs.We map their APIs into the reference architecture and aggregate the results in two tables.In Table 3, we map the actions, and in Table 4, the objects.We specify whether each action and object is a full, partial, or no match.

Replicate
Access input data Access to data that user jobs take as input.Access intermed.
Access to data that user jobs generate during their runtime.

Access metadata
Access to the information about the user data.

Replicate
Replication of the user data.

Partition
Partitioning of the user data so that a subset of the data is placed in different scheduler resources.

Recover
Recovery of the user data after the failure of execution or the storage system.

Communicate
Communication with the user resources, scheduler resources, or even the scheduler, such as setting a callback for getting notified about scheduling events.
Table 2: Objects in the reference architecture.

Event
Representation of objects in time or instantiations of properties in objects.Such as concrete date-times (00:00 of 31st of December 2022) or an instantiation of a property like a metric reaching a numeric value (CPU utilization is greater than 80%).

User resource
Representation of any kind of input from the user.This includes execution units like a job, task, etc., but also data as a file, environment variable, etc.

Scheduler resource
Representation of resources owned and managed by the scheduler.Resources can be virtual machines, containers, storage systems, databases, etc. Communication process Representation of the process of communication, such as a signal, message, callback, etc.
The results indicate that industrial schedulers have several shortcomings.Several actions are under-implemented.There is a very clear pattern, where most schedulers implement three actions: lease / release, configure scheduler, access input data.In most cases, all others are either partially or not implemented.The biggest shortcoming is in manage data action and its objects, where most sub-actions and objects are not implemented.Overall, the industrial schedulers examined in our study do not provide data management abstractions to the user.This means that users have less control over the data and, consequently, less chance to optimize performance.For example, if the user has several unordered data items to process, consulting the metadata and obtaining information about the placement and requests load of the storage systems where the data is stored, could optimize how and when the data is processed.
In all other cases, the communicate action is partially implemented except in SLURM.Similarly, most communication objects are partial matches.This might imply a lower performance since it does not allow the user to inform during runtime about application-level insights, nor vice versa, the scheduler to inform the user about scheduling-level insights.Moreover, partial matches imply that actions and objects are limited to a particular subset and do not allow the user to specify arbitrary inputs.For example, the Condor API only provides communication actions with user jobs, not the scheduler.Therefore, the user can dynamically inform about application-level insights to their jobs but not to the scheduler, reducing the scope of potential performance improvements.Key Takeaway: Many actions and objects have partial or no matches, meaning their APIs are under-implemented.Consequently, they reduce users' ability and scope to optimize their applications' performance.The main shortcomings are found in manage data action and its objects but also in communicate actions and their objects to a lesser extent.Sub-actions related to provisioning other than lease, such as scale, migrate, and recover, are also not well supported by schedulers.

EVALUATING THE PERFORMANCE COST OF SIMPLE SCHEDULING ABSTRACTIONS
In this study, we address the limited programmability of industrial schedulers and highlight the need for greater user programmability to improve user-application performance.We identify underimplemented programming abstractions in scheduler APIs in Section 4. In this section, we design experiments to quantify the performance cost of these missing abstractions.The experiments focus on three specific use cases: 1) reservations, 2) migration requests, and 3) metadata access.We analyze the shortcomings of various industrial schedulers in implementing these abstractions and propose extensions to address them.This answers the question What is the performance cost of the sacrificed abstractions?raised in Section 1.
A comprehensive overview of these experiments can be found in Table 5, which outlines the API extensions, parameters, traces, and metrics for each use-case.

Implementation, Input Setup, and Open-Sourcing
Software: The reproducibility of the experiments is ensured through the use of the OpenDC data center discrete event simulator [39], which is deterministic.We performed multiple runs with different seeds of randomness to capture variations in the results.For each experiment run, we calculated the empirical cumulative distribution function (ECDF) to analyze the distribution of the measured metrics.This approach allowed us to assess the behavior and performance of the proposed extensions across different scenarios and obtain comprehensive insights.
Input data: Traces from private and public cloud environments, Azure [13], Google [55], and Bitbrains [49] -a Dutch ICT provider -were selected to provide realistic and diverse workload data for evaluating the proposed extensions.By leveraging real-world traces, our research captures the variability and complexity of cloud workloads, ensuring the relevance and validity of our findings.
These traces are open source, and the simulator has parsers for the respective formats.The Azure and Bitbrains traces were used as they were provided, while the first 2.5 days were used from a 30-day Google trace.The characteristics of the different traces are outlined in Table 6.

Simulated environment:
The number of machines in the simulated environment are different for different traces and utilization levels.The environments have 35 machines for the Google trace, 102 machines fore the Azure trace, and 1039 machines for the Bitbrains trace when simulating the workloads at 75% utilization.The machines are heterogeneous having 4 to 32 cores depending on the configuration.The precise environment specifications for each experiment are described in topology files located in the experiment's folder in the applications git repository.

Reservation
Goal: Schedulers utilize resources better if they know when tasks arrive and their resource requirements.We investigate if a scheduler with an API that accepts this additional information performs better for three different traces and by how much.
In the context of scheduling and resource allocation in datacenters, there is a specific category of jobs that are long-running and periodically submitted, which are provisioned into VMs ( 1 and 2 in Figure 4).These jobs exhibit predictable patterns, as they recur regularly and have well-defined resource requirements.Examples of such jobs include data processing pipelines, scientific simulations, and batch processing tasks.
Since their resource requirements and execution patterns are known in advance, schedulers could use this knowledge to allocate resources more efficiently and reduce waiting times.However, in practice, existing schedulers often do not effectively utilize the predictability of these long-running and predictable jobs [54].As a result, these jobs may be subject to sub-optimal resource allocation and longer waiting times than necessary.
We propose an extension to datacenter schedulers that enhances scheduling long-running and predictable jobs by incorporating reservation programmability.This extension enables schedulers to be aware of these jobs' recurring nature and resource requirements, allowing for more optimized resource allocation and scheduling.
To enable reservations, we extend the system by modifying the lease action, including two additional parameters: runtime estimates and a specified provisioning time for future reservations.When a user submits a reservation request, instead of immediately provisioning it, the scheduler adds the request to a reservation queue 3 alongside other pending reservations.During this time, the scheduler applies algorithmic optimizations to improve future provisioning 4 .In our experiment, we employ a simple Earliest Finish Time (EFT) scheduling policy [52] to optimize the reservation queue by prioritizing tasks with earlier estimated finish times, ensuring that resources are allocated efficiently and effectively.Tasks without reservation are scheduled according to the FIFO policy.Once the specified provisioning time arrives, the scheduler provisions the reserved resources into a VM 5 , fulfilling the user's reservation request.In Listing 2, we provide an example of the extension, showcasing the syntax for reservations.
Listing 2: API for reservations using syntax from Section 3.3, with the extension highlighted in green.We take a scheduler that does not implement reservations as our baseline and investigate the effects of incorporating reservation capabilities into this scheduler.We utilize real-world workload traces from Google, Azure, and Bitbrains to evaluate the performance.We sample a fraction (reservation ratio) of the trace to reserve and The experiment configurations involve resource utilization and reservation ratio variations (the proportion of reserved resources compared to the total available resources).The resource utilization levels are set at 75%, 80%, and 85%, and the reservation ratios at 0, 0.5, and 1.0 to observe the impact of reservation programmability.These resource utilizations are common in datacenters with high resource utilization [6].Metrics collected in the experiment include waiting time (the duration tasks spend in the queue before execution) and slowdown (the decrease in task execution speed).
Figure 5 depicts the Azure trace's waiting time and slowdown under nine different configurations.Slowdown, calculated as the ratio of execution time plus waiting time to execution time, represents the overall task performance.In the Azure trace data, we observe a clear relationship between reservation ratios, waiting times, and slowdowns.Specifically, when the system utilization reaches 85%, the system with reservations has a 43% (35-hour) shorter 50th percentile waiting time than the system without reservations (ratio=0.0means no reservations).In the same scenario, reservations reduce slowdown by 70% (68 units) compared to not using reservations.However, at a lower utilization of 80%, there is an increase in waiting time of 2.5 hours (50th percentile) and a 12-unit (60th percentile) increase in slowdowns.In the other traces examined, there is no significant impact on the waiting times and slowdowns with varying reservation ratios.This could be due to workload characteristics, resource utilization levels, or the configuration of the scheduling system.Further investigation is needed to determine the underlying reasons for the lack of impact.
The results are not as promising for the Google and Bitbrains traces.The Azure trace differs from the other traces as it has a multi-hour task duration.The Google trace has short tasks lasting seconds, and the Bitbrains trace has long jobs lasting weeks.The full analysis for the other traces is available in the technical report.Key Takeaway: Reservations reduce slowdown by as much as 70% for the Azure trace, but not as much for the other traces.The results are dependent on the durations of the tasks in the trace.

Migration
Goal: We investigate if offloading migration, to mitigate interference, to container orchestrators running on top of VMs leased from a datacenter scheduler is better than the datacenter scheduler itself performing VM migration.We investigate this for three traces.
Datacenter operators oversubscribe their machines as tenants often do not utilize all the allocated resources.Oversubscription means allocating more resources to tenants than there are physically available.Oversubscription leads to interference between tenants if tenants allocated to the same physical machine fully utilize their allocated resources.In such cases, the datacenter operator can migrate one or more tenants to less utilized physical machines to reduce interference.
Migration has a cost proportional to the size of the VM migrated [16,37].Therefore it is efficient to migrate only part of a VM if possible.Nowadays, tenants use container orchestrators (K1 in Figure 6), such as Kubernetes, making partial migration possible.The orchestrator requests resources from the datacenter We propose an extension to datacenter schedulers that enables partial migration by making them aware of the tenants' orchestrators.The key to enabling partial migration is to enable bidirectional communication between the datacenter scheduler and the orchestrator.The orchestrator registers a remote callback with the datacenter scheduler before it requests any VM allocations.The datacenter scheduler uses this callback ( 4 in Figure 6) to request the orchestrator to migrate 5 some containers when its monitoring detects interference.In Listing 3, we provide an example of the extension, showcasing the syntax for migrations.scheduler.We use three real-world workload traces from Google, Azure, and Bitbrains for our evaluation.
For each trace, we evaluate the impact of migrations at three oversubscription ratios: 3, 4, and 5.An oversubscription ratio of 3 means that each physical CPU was fully available to three tenants.Oversubscription ratios ranging from 3 to 16 are common in datacenters whose users have low utilization [34,43].We model the cost of migration as the time it takes to migrate the RAM used by the VM/container at a conservative rate of 512Mbps.The RAM based cost model and the migration bandwidth are supported by existing literature [37].Our hypothesis is that migrating a container takes less time than migrating a VM running multiple containers.
We simulate 5 Kubernetes clusters simultaneously using the datacenter.We configure the datacenter topology such that the traces run at 85% average utilization.The metrics we use are total workload execution time and packing efficiency.We calculate packing efficiency by summing the CPU utilization of each virtual machine (VM) and dividing it by the total number of VMs.This metric provides insights into how effectively the resources allocated to the VMs were utilized.A higher packing indicates better utilization of resources, while a lower value suggests potential inefficiencies or underutilization.By analyzing packing efficiency, we can assess the effectiveness of the scheduling mechanisms in optimizing resource allocation and maximizing overall system performance.
Figure 7 and 8 depict the packing efficiency and the total execution time (90th percentile) of the Azure trace under six different configurations, respectively.In the Azure trace, the highest oversubscription ratio of 5.0 achieved a remarkable 15% improvement in packing compared to configurations without the API extension.Additionally, using the API led to improved performance in terms of total time per task.For example, with the highest oversubscription ratio of 5.0, the 90th percentile (P90) of total time per task in the Azure trace were reduced by 81% when container-level migrations were employed.
In the remaining Google and Bitbrains traces, using the API resulted in shorter total time per task, indicating higher performance.The 99th percentile total time per task in the Google trace showed a reduction of 73% (4.4 hours) with the highest oversubscription ratio of 5.0.However, it is important to note that not all configurations yield better performance with container-level migrations.However, in the Bitbrains trace, no significant improvement in performance is observed.The results indicate the minimal impact of container-level migrations on performance in this particular trace.Key Takeaway: Offloading migration to container orchestrators benefited the Azure and the Google traces, not the Bitbrains trace.The Bitbrains trace differs from other traces as it has an extremely long task duration, with tasks running for weeks.

Metadata access
Goal: We investigate if providing datacenter schedulers access to additional information about task data accesses and storage subsystem busyness has a performance impact.We analyze the impact of a trace from IBM object storage [21] combined with the compute trace from Google.
Datacenters offer object storage services that enable users to store and retrieve data efficiently.Services like AWS S3 provide a scalable and reliable solution for storing large amounts of data.In the context of data analysis workloads, users often deploy applications that require accessing multiple objects from the storage ( 1 and 2 in Figure 9).These workloads (e.g.: data analytics [4], ML [19]) are often "bag of tasks" where tasks are executed independently and the objects to read are known in advance.Such workloads benefit from reordering their storage access based on the prevailing resource utilization at the time of access.
Without access to fine-grained information about object placement and load levels, users cannot optimize their data retrieval process.As a result, the workload takes longer to complete.The inefficiencies in object access lead to increased latency, reduced throughput, and decreased overall system performance [40,57].
We propose an extension that empowers users to access object metadata to address this limitation.This extension allows users to make informed decisions regarding the order in which they retrieve data items.By introducing the accessMetadata action in the scheduler's programming model, users query the metadata for specific object IDs and obtain estimates of retrieval times.The scheduler retrieves this information by monitoring the storage servers ( 3 ).This capability enables users to strategically postpone the retrieval of objects from congested storage servers, allowing them to process those objects later when congestion levels have subsided.In Listing 4, we provide an example of the extension, showcasing the syntax for metadata access.determine the impact of adding metadata access to that scheduler.Our evaluation is based on a combination of real-world workload traces, specifically a trace from Google and an IBM object storage trace [21].We chose to focus on Google trace for this experiment due to its availability of detailed information about workflows.We use the interarrival time, duration, and resource usage of tasks from the Google trace.For each task, we associate an object identifier from the IBM trace.We read identifiers from the IBM trace sequentially.This maintains the popularity distribution of object identifiers and their temporal locality.We assume each task reads from distributed storage at 1Gbps [8].We simulate a 10 node distributed object storage system, with objects accessed by their identifiers.
We analyze the impact by activating and deactivating metadata access while maintaining a fixed workload trace and storage service utilization.The workload trace utilization is set at 80%.We capture two key metrics to evaluate the system's performance: buffer sizes of the object storage service and total workflow times.The buffer sizes provide insights into the waiting line and load balancing across servers.Smaller buffer sizes indicate lower system load and more efficient workload distribution across servers.Additionally, we measure the total time for each workflow, which encompasses both the waiting time and the execution time.
Figure 10 displays the normalized buffer sizes and total execution times of the trace.The results demonstrate that activating the metadata access API leads to substantially reduced buffer sizes, approximately 27% (70 GB), within the object storage service, resulting in improved performance.Furthermore, metadata-aware workflow execution substantially reduces total time per workflow, with a notable 24% (26-hour) decrease in the median value.These findings emphasize the critical role of metadata access in optimizing object retrievals and enhancing overall performance.Key Takeaway: The significant performance improvements observed in reduced buffer sizes and shorter execution times highlight the value of exposing storage metadata using an API.

THREATS TO VALIDITY
The reference architecture we proposed has two main limitations.
First, the reference architecture design is limited to the objects we define.In our reference architecture, we identify only five distinct objects and do not specify sub-objects for each.For example, our Scheduler Resource object does not differentiate between an API that offers VMs or Edge mobile devices.While this is a limitation, we have deliberately chosen to keep our objects at a high level of abstraction to future-proof our architecture.As the types of resources available for scheduling are constantly changing, we believe it is more important to differentiate objects by what they represent in the highest level of abstraction than by their specific content.
However, to fully leverage the power of our reference architecture, it will be necessary to build more specific models that differentiate between schedulers with different requirements.For example, Spark-like schedulers have different scheduling requirements than Kubernetes-like schedulers.These models must differentiate between objects based on their specific content rather than just their highest level of abstraction.
Second, the simulation scenarios we use and the simulator itself are not a replacement for real-world systems.However, the simulator we use, OpenDC, has been validated for VM and container scheduling for the Bitbrains and Azure traces [39].The storage part of the simulator and the Google trace have not yet been validated.But we do use realistic models for migration [37] and storage accesses [8].These models based on measurements from real systems ensure that our results are indicative of real-world performance.

RELATED WORK
Schopf's multi-stage model of the grid scheduling process [30], the Global Grid Forum [26], and the datacenter scheduler reference architecture [5] offer conceptual models of the internal workings of schedulers.Our work complements these models by specifically addressing the external-facing aspects of scheduling, the programming interface.
Conceptual models of APIs have been proposed for specific computing environments, such as grid computing and cloud computing.Foster et al. presented a reference architecture for grid computing [22], and the National Institute of Standards and Technology (NIST) introduced models for cloud computing [36].While these models provide valuable guidance for designing APIs in their respective domains, they do not deal with the concrete API needs of schedulers like Spark and Kubernetes, which have unique characteristics and requirements.
Efforts have been made to develop schedulers that combine multiple scheduling abstractions into a single system, such as Ghost [29] and ESCHER [7].Ghost delegates OS kernel scheduling decisions to users, granting them greater control over the scheduling process.ESCHER allows users to express arbitrary scheduling constraints as resource requirements, enabling fine-grained control over the scheduling process.Apache Beam [23] and CWL [14] allow users to specify a workflow and run it on multiple resource managers.But they do not allow control over the scheduling mechanism apart from simple labels.

CONCLUSION
In this work, we designed a reference architecture for datacenter scheduler APIs (Section 3).Our reference architecture covers APIs implemented in 5 industrial schedulers (Kubernetes, SLURM, Spark, Condor, Airflow) and 15 academic schedulers.We use the reference architecture to identify abstraction not implemented or under-implemented in the five industrial schedulers (Section 4).We find that the industrial schedulers do not implement abstractions for data management, task migration, and autoscaling.
We evaluate the performance impact of missing abstractions related to resource reservation, container migration, and storage metadata access in Section 5. We find a 27% improvement in resource usage and a 24% reduction in median workflow runtime when implementing metadata access, a 15% increase in utilization and an 81% improvement in total execution time per task (90th percentile) for container migrations, and a 43% reduction in waiting times (50th percentile) for reservations.
For future work, we intend to provide a toolkit for users to experiment with different designs using the OpenDC simulator.We also plan to validate our simulations beyond the basic validation with VMs, including validation with containers and storage services.

Figure 1 :
Figure 1: Performance penalty due to a missing programming abstraction: storage metadata access.
P r o v i s i o n : L e a s e U s e r R e s o u r c e < t y p e : app , i d : 1 , runtime:1h> IN S c h e d u l e r R e s o u r c e < t y p e : vm , c o r e s : 8 , cpu − f r e q : 2 .4 Ghz , memory : 3 2 Gb> WHEN Event<day:11, month:12, year:2023>

Figure 5 :
Figure5: ECDFs of waiting time and slowdown per task of the Azure trace using the reservation extension.We evaluate the system at different utilization levels and with a different fraction of the trace being reserved in advance (ratio).Ratio 0.0 implies no reservations.

Listing 3 :Figure 8 :
Figure 8: 90th percentile (P90) total runtime per task of the Azure trace using different migration techniques.Each bar represents a different <Oversubscription ratio>/<Migrations API> configuration.

Figure 10 :
Figure 10: Comparison of buffer sizes in the object storage service (left) and ECDF analysis of total execution times per workflow (right) between the configuration with and without the metadata access API.This uses the Google Compute trace combined with the IBM object storage trace.

Table 1 :
Description of the actions that compose the reference architecture.
Preempt Abortion of execution or assignment of a user resource, putting it back in the scheduler queue.Recover Recover a task after failure, restart execution, or put it back into the scheduler queue.Configure scheduler Configuration of the behavior of the scheduler.

Table 3 :
Full overview of programming abstraction actions of schedulers mapped to the reference architecture.Legend:

Table 5 :
Summary of evaluation experiments.

Table 6 :
Characteristics of the traces used in the experiments