Exemplary Determination of Cgroups-Based QoS Isolation for a Database Workload

An effective isolation among workloads within a shared and possibly contended compute environment is a crucial aspect for industry and academia alike to ensure optimal performance and resource utilization. Modern ecosystems offer a wide range of approaches and solutions to ensure isolation for a multitude of different compute resources. Past experiments have verified the effectiveness of this resource isolation with micro benchmarks. The effectiveness of QoS isolation for intricate workloads beyond micro benchmarks however, remains an open question. This paper addresses this gap by introducing a specific example involving a database workload isolated using Cgroups from a disruptor contending for CPU resources. Despite the even distribution of CPU isolation limits among the workloads, our findings reveal a significant impact of the disruptor on the QoS of the database workload. To illustrate this, we present a methodology for quantifying this isolation, accompanied by an implementation incorporating essential instrumentation through eBPF. This not only highlights the practical challenges in achieving robust QoS isolation but also emphasizes the need for additional instrumentation and realistic scenarios to comprehensively evaluate and address these challenges.


INTRODUCTION
In the ever-evolving landscape of computing, the paradigm shift toward cloud computing and larger-scaled compute environments has revolutionized the way organizations deploy, manage, and utilize computing resources.Cloud computing, in particular, offers unparalleled scalability, flexibility, and cost-effectiveness, enabling businesses and scientists to work on challenges that were unattainable without it [14,17].However, the shared nature of resources in such environments introduces inherent challenges, necessitating robust mechanisms to ensure isolation among disparate workloads and tenants.These challenges can be imposed by the desire to consolidate physical hardware, overbooking or overcommitting as a business model, or by misbehaving disruptive tenants acting as "noisy neighbors".
There is a wide range of solutions that aim to solve these isolation challenges.One aspect towards a solution is various virtualization technologies.These range from classic hypervisor-based implementations over manifold container-based solutions towards more recent developments in the concept of application sandboxing.Many of them pursue different strategies to achieve adequate isolation; however, they do share some commonalities.A frequently used strategy is the utilization of Cgroups [7,23].
Cgroups are provided by the Linux kernel.They enable an operator to distribute processes into groups and subsequently assign resource limits to those groups.These mechanisms have proven to work very well, specifically when solely observing the isolated and limited resource.The patterns of resource usage of real-world applications are often more complex [5].Their QoS is not necessarily directly dependent on a few distinct resources, as it is a measure of end-to-end performance that inherently involves any amount of resources [4,16].
In this paper, we focus on Cgroup-based CPU isolation.For this specific case, we investigate whether one tenant's QoS is impacted by another disrupting tenant, even though their CPU limits are evenly shared with no overbooking in place.With this, we aim at answering the following two research questions: RQ 1 (Isolation measurement).How can QoS isolation be challenged in complex deterministic scenarios?RQ 2 (Cgroup sufficiency).Is Cgroup-based isolation enough for reliant QoS isolation?
Answering these questions, we provide several contributions.First, we provide a strategy to measure isolation between two tenants considering the impact on QoS.Second, we suggest metrics that quantify the degree of isolation.Finally, we provide a tool developed for this work that enables low-overhead instrumentation of compute resources for isolated process trees.
The remainder of this paper is structured as follows.In section 2 we discuss the fundamentals of this work.This includes eBPF profiling, Cgroups and a discussion of isolation and its quantification for QoS.This is followed by a description of the methodology applied in section 3 and lays the foundation for the answer to RQ 1.The methodology is followed by important details of the implementation in section 4. It gives a brief overview of the technologies involved in the experimental setup and the workflow of measured scenarios.The final results in section 5 discuss the observations and in this process answers RQ 2. We close with a review of related work in section 6 and a final summary in section 7.

BACKGROUND
This section describes important background aspects for the subsequent progression of this work.This includes low-overhead instrumentation, Cgroups, and isolation considerations.

Linux Profiling with eBPF
The Linux profiling subsystem efficiently gathers and collects performance data, enabling developers and operators to pinpoint and enhance resource utilization patterns.Retrieving these comes with a performance penalty depending on the method of accessing it.
eBPF facilitates the execution of verified code within a dedicated Virtual Machine (VM) integrated in the Linux kernel, extending the capabilities of the original Berkeley Packet Filter (BPF) [13].Beyond executing functions upon receiving network packets, eBPF can observe and respond to various event sources as part of the Linux profiling subsystem, including Performance Monitoring Counters (PMCs), tracepoints, and both kernel and user functions.
Although these events are not technically part of eBPF, it provides an accessible means of leveraging them.Specifically, the instrumentation and processing of profiling data directly within the kernel space can reduce instrumentation overhead, since frequent interactions with kernel and userspace are kept to a minimum.
The typical lifecycle of an eBPF program is depicted in fig. 1, as presented by Gregg [6].As depicted here, a typical first step is the (i) generation of BPF byte-code by arbitrary eBPF tooling.Upon this generation, the byte-code is (ii) loaded into the kernel for a verifying step before being passed to the eBPF VM.For exchanging data between Kernel-and userspace, the (iii) perf_output and (iii) async read channels can be utilized.Within the scope of this work, we are employing instrumentation on (a) Tracepoints.Tracepoints are static points of kernel instrumentation [19], established and implemented by kernel developers to trigger an event upon a specific call.They also incorporate hardware-specific counters, such as CPU cycles per core since boot time.
eBPF based instrumentation is naturally tightly coupled with the currently loaded Linux kernel.The BPF Type Format (BTF) aims to improve the portability of eBPF based tools by providing a metadata format, which encodes debug information related to the functions and structures of the kernel referenced in the eBPF programs.The profiling tool trac 1 , developed during this work, utilizes this format.The tool itself is in an ongoing development phase.
This section only briefly outlines eBPF and Linux profiling, with a more detailed exposition available in the previous work of fellow authors [2,20].

Cgroups
Control groups2 are a Linux feature that enables precise control over the utilization of various system resources [8].The Linux kernel ensures that the processes assigned to such a group adhere to the limits specified for the Cgroup.Additionally, Cgroups can be unique, shared, and nested, essentially creating a hierarchical structure.
Cgroups offer powerful measures to control, limit, and possibly isolate resources.Used in conjunction with namespaces, they act as an essential enabler for virtualization, particularly in the context of container virtualization [20].
The Cgroups project underwent a significant restructuring effort, resulting in the recent release of Cgroups v2.This effort was first merged into the kernel with version 4.5 and is able to fully replace v1 since kernel version 5.6 [3].At the time of writing, the list of Cgroup controllers include (i) cpu, (ii) cpuset, (iii) freezer, (iv) hugetlb, (v) io, (vi) memory, (vii) perf, (viii) pids, and (ix) rdma.
This work focuses on the usage of the (i) CPU controller implemented by Cgroups v2.It enables setting a limit on the number of CPU cycles per second.

Isolation Terminology
Isolation is a condition that occurs when two workloads share a resource and compete for it.The degree to which they interfere with each other characterizes isolation.If their influence on each other is distinctive, the isolation is considered low, and vice versa.This concept is discussed by several authors [10,12,22].This study follows the definition of isolation provided by Krebs et al. who define: Definition 1 (Isolation).Performance isolation is the ability of a system to ensure that tenants working within their assigned quota (i.e., abiding tenants) will not suffer performance degradation due to other tenants exceeding their quotas (i.e., disruptive tenants).
In a similar context, particularly in cloud computing, the term "noisy neighbor" is often used in related literature.This term refers to a disruptive tenant that adversely affects another tenant.According to the definition provided by Longbottom [11], a noisy neighbor is described as follows: Definition 2 (Noisy Neighbor).A workload within a shared environment is utilizing one or more resources in a way that it impacts other workloads operating around it.

QoS Isolation Quantification
Performance degradation is a measure of how strong an abiding workload   is affected by a disruptive workload   .It can be determined as "performance loss rate"   [9,12,18,22].
Here   1 represents a workload in an undisrupted environment, whereas  2 represents the same workload impacted by a disruptive workload   .
Krebs et al. extends this simple model with one specifically targeted at QoS isolation determination [10].We apply and slightly adapt this model to fit our measured parameters.
Taking eq. ( 1) as a basis, we can determine the actual performance ratio    and    by calculating 1−  .This leads to the simplified eq. ( 2) and eq.( 3).
Using eq. ( 2) and eq.( 3) we can then determine the remaining relative performance  at a certain    as    .As Krebs et al. further states, these kind of values represent only a distinct point where the disruption is to a specific degree.To address this, we can try to reduce the resulting series of eq. ( 4) to a single isolation metric  .

𝜌 (𝑞 𝑊
An approach is to limit the number of samples    to  equidistant points and subsequently compute their arithmetic mean: As this likely neglects the maximum amount of degradation, we can further derive another metric that describes the maximum isolation impact   as follows: Naturally, employing either eq.( 5) or eq.( 6) might overlook the inherent curve of   , potentially introducing a bias to the outcome.Further considerations on deriving a metric that avoids this are left for future work.

METHOD
This section presents the method behind the conducted experiments and thus elaborates on the scenarios, instrumentation, and isolation quantification.These aspects are adapted from previous work [20,21].
Goal.As mentioned in section 1 we aim to measure the Cgroup QoS isolation for the CPU resource.According to section 2.4 we need at least two distinct measurements to analyze the isolation capability of a technology.One being the reference workload in an uncontended environment, and the other being the same workload under contention.
Scenarios.As we are interested in whether a QoS-based isolation is as high as a specific isolation for a certain resource, we choose an appropriate isolation scenario.Earlier work has shown that this is the case for fairly distributed resources where no overbooking, overcommitting, or aggressive resource stealing happens [20,21].Volpert et al. show that this is particularly true for the "harmony" scenario.
Therefore, this work analyzes the isolation of two scenarios: (i) baseline and (ii) harmony.These are itemized in table 1 Here,   and   describe the workload performed within their respective imposed limits   and   .  is considered static in both scenarios and is instrumented regarding its consumed resources and QoS status.It is further supposed to resemble a realistic workload and is thus realized as a macro or synthetic benchmark [9].For the (ii) harmony scenario,   gradually increases over time and is also instrumented for its consumed resources.Its purpose is to specifically stress the single resource that is being isolated.
Instrumentation.Again, the resource instrumentation approach follows the principles outlined in previous work by the authors [20,21].In summary, it is independent of isolation technology and performed outside of the isolation group.This is achieved with eBPF.
Isolation quantification.In section 2.4, we introduce and briefly examine metrics for quantifying QoS isolation.Utilizing eBPF and QoS metrics reported by  we can quantify the isolation at specific degrees of contention by   .

EXPERIMENT DESIGN
In this section, we describe the abstract workflows of the experiments.These are followed by a presentation and reasoning behind the choices for the tools and instrumentation points selected.

Experiment workflow
As described in section 3, the experimental workflow follows two scenarios.The execution of a scenario is highlighted in fig.The process begins with (i) the initialization of an isolation group.In this phase, (ii) load is generated by   and   .The (iii) profiling process on the host system is initiated concurrently.This profiling monitors the isolation groups.Upon completion, data is (iv) collected and (v) stored on external storage.
Each run takes 5 minutes and is repeated 3 times.After each run, the whole physical systems are reset and pruned to guarantee no unintended side effects by residue of past experiments and improving reproducibility.

Approach and Implementation
The following briefly iterates over the actual implementation of the method as described in section 3 is realized.
Load generation.As stated in section 3 we need two distinct workloads   and   .  is supposed to act as a realistic workload.Here, we choose to run a YCSB3 benchmark on a remote host against a Postgres database [1].The throughput in operations per second and thus the QoS workload   is determined for this databse.For the sake of simplicity, we choose an insert-only workload stressing the database for 5 minutes.In that 5 minutes, YCSB tries to execute 100, 000, 000 inserts of 500 bytes with 90 threads.After its run-time, it reports a list of all operations with timestamp and latency.Operations per second can be derived by resampling to a desired frequency and counting the operations.These operations per second are considered to be the QoS metric of   .
The Postgres database is continually instrumented with respect to its CPU cycles and operations per second.Its configuration is generated by PGtune 4 optimizing Postgres with half of the total resources available on the physical server as described in section 5.1 [15].
is considered to be a micro benchmark that continuously stresses the CPU.We use the stress-ng implementation to realize that load.It is set up such that it increases its utilization over time, until it fully utilizes its granted resources.To achieve a linear load generation behavior, we partition this load generation into multiple intervals with configurable resolution.
Assuming ideal isolation, the measures of both workloads resemble a progression, as illustrated in 3. Instrumentation.Since we focus on CPU isolation, we select an instrumentation point as outlined in section 2.1 that gives a detailed view on CPU utilization.Modern CPUs provide hardware-based counters that report the cycles that are executed on each core.The progression of the counters over time, along with the maximum number of possible cycles per core, can be used to derive CPU utilization.
In order to keep the instrumentation overhead as low as possible, we opt to utilize eBPF instrumentation.This allows us to gain finegrained control over the sampling frequency and efficient profiling inside the kernel space.To leverage eBPF instrumentation, we built a profiling tool named "trac5 ".
Trac allows to be attached onto a root process.This root process and any process invoked by it are subsequently instrumented for either CPU cycles, resident memory, disk I/O, or network I/O.The gathering of those metrics happens inside the kernel space, where they are collected in a datastructure, called a BPF map.After profiling, these maps can be accessed by the user-space counterpart of the profiling tool.The collected data are processed and presented as CSV time series with a resolution of up to 1.
Isolation.To isolate processes with Cgroups, we leverage the isolation tool "nsJail6 ".NsJail is a Linux process isolation tool that utilizes the Linux namespace subsystem, Cgroup resource limits, and seccomp-bpf syscall filters to achieve process isolation.
In particular, we use the tool's Cgroup capabilities to isolate the stress-ng CPU load generator, as well as the Postgres database.
The stress-ng CPU load generator itself does not utilize other system resources such as memory, disk I/O and network I/O.As a consequence, we do not isolate these between workloads.Moreover, memory, disk I/O, and network I/O utilized by the Postgres database are negligible for the configuration and workload applied.

EVALUATION
This section systematically discusses the results of the evaluation outlined in the sections before.

Evaluation Environment
The experimental configuration encompasses a pair of physical servers, symmetrically arranged and equipped with identical components.Both servers feature two Intel CPUs, specifically the "Intel(R) Xeon(R) CPU E5-2630 v3", operating at a base clock frequency of 2.40 GHz with 32 cores.Memory associated with these CPUs totals 16 • 16 = 256 GiB of DDR4 memory clocked at 2133 MHz.The physical storage disk for the database state is a Samsung SM843TN, which exhibits a Input Output Operations Per Second (IOPS) performance of 15, 000 for "random write" operations.
Communication for actual workload between all nodes is separated and facilitated by Mellanox Technologies' Network Interface Card (NIC) from the "MT27800 ConnectX-5" family, capable of a network throughput of 50 Gbit/s.
Figure 4 visualizes the interaction between the pair of physical servers mentioned above.Here YCSB is responsible to generate and control   (Postgres) from a remote host.We do so to limit possible interference on  by the load generated by YCSB.The latter should not be accounted for as it would act as a "noisy neighbor".  generates its own workload and is not externally controlled.The complete experiment set-up includes additional auxiliary servers responsible for workflow automation.Notable involved software components are itemized in table 2.

Results
In the following, we iteratively discuss the results of the scenarios presented in table  For the actual isolation metric determination we present an additional graph highlighting the impact on   QoS isolation at every observed degree of stress imposed by the disruptive workload   .
Figure 5 visualizes the baseline scenario.The y-axis shows   in operations per second at a given interval in seconds.As described above, this information is provided by YCSB.As each experiment is repeated multiple times and actual operations per second are volatile, we adapted the visualization accordingly.Every measured data point is plotted as a small circle resulting in a scatter plot.An overlay as a smoothed thicker line highlights the trend of those data points.The smoothing algorithm applied implements the Locally Estimated Scatterplot Smoothing (LOESS) method.This results in a trend for this baseline graph that settles roughly at 175, 000/.
The visualization method for   in fig.6 follows the same principle.However, the visualization of the disruptive workload   does not apply said algorithm.Instead, it plots an overlays as the mean Adding this disruptive workload   has a significant impact on the behavior of   .After an initial pausing duration of 100 we can see an immediate degradation of / for   .This gets worse as   reaches its full utilization and results in a degradation of   of almost 50%.
Most importantly, neither workload ever exceeds 50% of the physical system capacity, as defined by its assigned CPU cycle limit.This means that the CPU cycles isolation works well considering the fact that no workload is able to exceed its limit.This is in direct conflict of the 50% QoS degradation observed.It is evident that a harmonic split of the seemingly available total resource of CPU cycles can have an impact on each other's CPU performance.
A more detailed visualization with a specific focus on the impact on isolation is presented in fig. 7. Here, the x-and y-axes represent the relative degradation ratio of the workload as defined in section 2.4 with the dimension of time completely removed.Therefore, this graph represents    for every    .Again, because of the volatile nature of the measure points, we present the graph as a trend overlay over a scatter plot.Here, we can see a slight change in the degree of degradation above 50% of    .What is also easily visible here is that good isolation between  and  is represented by a higher value, while worse isolation is represented by a lower value within the interval of [0, 1].In table 3 discrete interesting values of fig.7 are presented.Taking into account the equidistant    values in the interval [0.1, 0.9] of this table results in I avg = 0.83 for eq.( 5).Furthermore, we can also calculate I max = 0.60.These values are naturally different from each other, as they both describe different properties of the isolation function .
The observations above lead to the following interpretation.
Interpretation.The reason behind the observation that   can have such a huge impact on   even though they should not impact each other can be manifold.However, two aspects seem to play an important role here.The system we execute our experiment on features two hyperthreading enabled CPUs.Theoretically speaking, they are able to fully utilize all logical cores with maximum cycles if the workload fits.This was observed in previous work of the authors [20].However, the workload in terms of QoS decreases significantly when the actual physical cores are fully utilized.This assumption is indicated by the slight change of slope in fig.7 at    ≈ 50%.Although more cycles could be utilized by the respective workloads while staying within their cycle limit, they are not able to use them to maintain their QoS.
Another limiting factor could be due to the saturation of only loosely related resources in regard to CPU cycles.This could be due to the overhead induced by process scheduling.Thus, CPU cycles could be increasingly reserved for such essential tasks, leading to even more starvation of   .

RELATED WORK
A prevalent method for assessing the isolation capability involves calculating the   as outlined in eq.(1).In line with this approach, previous studies commonly determine this on a per-resource basis [12,18,22,23].We extend these findings with considerations regarding QoS.
Silva et al. reviewed the effectiveness of resource isolation for QoS isolation in the past [16].They state that providing QoS for application performance requires more than just guaranteeing a certain allocation of CPU, memory, or I/O resources.We support their findings for the more recent Cgroups v2 and extend them with further measurements and an isolation quantification model.

CONCLUSION
Over the course of this work, we designed and implemented a sophisticated experimental setup that allowed us to execute two workloads against each other in order to measure their isolation from each other.We have deliberately chosen a very specific scenario, where a synthetic "abiding" database under constant workload competes against a "disruptive" stressor that utilizes the CPU as high as possible.
We determine that those two workloads influence each other even when their CPU limits are evenly shared across the available resources without any overbooking.Neither workload exceeds its limit, but the impact on the QoS of the abiding database is clearly visible.
As a consequence, we can see that mere CPU isolation is insufficient for more complex workloads outside of micro-benchmarks that try to escape them.Aspects like hyper-threading and CPU scheduling overhead are CPU related resources that are not isolated as probably expected.Applying these findings to real-world scenarios requires in situ system tests to determine the actual impact on QoS when co-locating tenants.
The results presented in this work can be considered as a first preliminary step towards more effective QoS isolation.From this point on we see various possible future directions.One is the configuration of stricter isolation environments with limited hyper-threading and possibly system call filtering mechanisms of sandboxes.Another direction could be the improvement of instrumentation to pinpoint the actual saturated resource resulting in a drop in QoS.Lastly, those considerations could be repeated for other Cgroup, different isolation technologies or other workloads.

(
) S p a w n ( ) Ex ec ut e () Acquire () Store

Figure 2 :
Figure 2: Flow of an abstract measurement

Figure 4 :
Figure 4: Workload generation and controlling across hosts

Table 2 :
1.Each scenario is represented by a plot.Software version list

Table 3 :
Isolation metrics comparison