Pushing Performance Isolation Boundaries into Application with pBox

Modern applications are highly concurrent with a diverse mix of activities. One activity can adversely impact the performance of other activities in an application, leading to intra-application interference. Providing fine-grained performance isolation is desirable. Unfortunately, the extensive performance isolation solutions today focus on mitigating coarse-grained interference among multiple applications. They cannot well address intra-app interference, because such issues are typically not caused by contention on hardware resources. This paper presents an abstraction called pBox for developers to systematically achieve strong performance isolation within an application. Our insight is that intra-app interference involves application-level virtual resources, which are often invisible to the OS. We define pBox APIs that allow an application to inform the OS about a few general types of state events. Leveraging this information, we design algorithms that effectively predict imminent interference and carefully apply penalties to the noisy pBoxes to achieve a specified isolation goal. We apply pBox on five large applications. We evaluate the pBox-enhanced applications with 16 real-world performance interference cases. pBox successfully mitigates 15 cases, with an average of 86.3% reduction of the interference.


Introduction
Applications in production demand strong performance isolation-the ability to maintain consistent and predictable performance despite potential sources of interference.
Extensive research [10,11,26,47,58,64,68,71,78] has focused on achieving performance isolation among multiple applications running on the same server.They broadly fall into two categories: (1) partitioning hardware resources [10,47,48], and (2) dynamically adjusting CPU core assignments [18,26].They can mitigate interference between applications because the interference is caused by direct contention on hardware resources, e.g., a batch job overuses CPU or network and causes a slowdown to a latency-critical job.
What receives less attention is performance isolation within an application, which ensures that distinct activities in an application do not adversely affect each other's performance.
Providing fine-grained performance isolation is increasingly desired by users [3,23] and developers [19,72].Modern applications have a high degree of concurrency with a diverse mix of activities, such as one thread to handle each request and various background tasks, making them susceptible to intra-application interference.For example, in processing a query from a client, a thread overuses the UNDO log defined in the application, significantly slowing down another client's requests (Figure 1).Such issues lead to unpredictable performance and poor user experience.They cannot be well addressed by adjusting hardware resources like CPU cores.Indeed, they can occur when hardware resources are sufficient.
The lack of principled solutions for fine-grained performance isolation forces developers to rely on ad-hoc code, such as splitting data structures, inserting timeouts, and tuning concurrency levels, which is not only difficult and time-consuming to implement, but also ineffective.Intra-app interference is often triggered by complex interactions among activities, which are difficult to anticipate during coding.There are also many program points that can suffer from interference, so it is almost infeasible to insert isolation code everywhere.
An alternative strategy is using resource quotas.The resource container OS abstraction [7] facilitates accurate accounting of resources consumed by an application activity, e.g., functions associated with handling a request.Linux control group [53] supports thread-level resource control.However, production applications exhibit fluctuating resource usage, making it difficult to decide on a suitable quota.
To address the current gaps, this paper proposes an OS abstraction called pBox that allows developers to systematically and conveniently achieve performance isolation within an application.pBox does not enforce resource quotas.Instead, it focuses on the ultimate objective of reducing interference.Developers add pBox creation code in the application activity boundaries and specify a high-level isolation goal.At runtime, the kernel monitors if any pBox's isolation goal is in danger of being violated, and reacts to satisfy the goal.
The design of pBox is informed by our observation that intraapp interference involves application-level virtual resources, such as shared buffers, queues, tickets, and logs.In contrast to hardware resources directly managed by the OS, virtual resources are usually invisible to the OS and exhibit diverse representations.Moreover, applying resource reallocation, the common approach to mitigating interference, poses challenges in this context.Reallocating virtual resources at the system level is non-trivial and can cause side effects to applications.
Fine-grained performance isolation thus requires coordination between the OS and the application, but how to accommodate the wide variety in virtual resources and their usage among different applications?Through analyzing real-world intra-app interference issues, our insight is that despite their variety, they can be reduced to a small set of what we call state events.By exposing these events, it is feasible for the OS to recognize and mitigate interference effectively and safely.
Based on this insight, we design a few general pBox APIs for an application to communicate its state events to the kernel.A kernel manager leverages the state events and other information to provide performance isolation at pBox granularity.
At the algorithmic level, we address two key challenges.First, the pBox manager needs to proactively detect imminent interference.Compared to current cross-app isolation solutions that reactively detect interference from the overall SLO metrics, we face a more strict requirement.This is because our performance isolation targets a finer granularity, namely each pBox.Intra-app interference may also occur among a few activities, thereby escaping SLO monitors.Also importantly, since we cannot reclaim a contended virtual resource, we need to detect interference early (ideally before it occurs).This early detection is crucial to minimize a noisy pBox's impact.
To tackle this challenge, we design an algorithm that uses a worst-case style analysis to predict whether the specified isolation goal of any pBox might be violated.If so, the algorithm additionally identifies the victim and noisy pBoxes.
The second challenge is taking effective action.For safety, the pBox manager does not reallocate virtual resources.It instead simply applies a delay penalty to the noisy pBox.Because we can detect imminent interference early, the penalty typically can prevent the noisy pBox from causing more severe contention.We design an algorithm that uses an adaptive penalty length and carefully chooses the penalty timing.
Since pBox is activated during regular execution, it should not incur significant overhead.We design the pBox detection and prediction algorithms to be lightweight yet effective.We delegate some pre-processing of the application state events in a user-level library and minimize the kernel boundary crossings.We also track certain application state events when the application makes regular system calls.
Like any OS abstraction, using pBox in an application code requires developers' involvement.We design the pBox APIs to be intuitive and support typical application architectures.Our focus on high-level isolation goals alleviates developers from hard-to-specify resource quotas or reasoning about the complex relationship between virtual resources and endto-end performance.Developers do need to annotate the state events of a virtual resource.In our experience, such efforts are moderate.We also design a static analyzer that can automatically identify many of the state events in a codebase.
We implement pBox in the Linux kernel 5.4 along with a user-level library.For evaluation, we choose five large server applications-MySQL, Apache, PostgreSQL, Vanish, and Memcached-and integrate pBox APIs into these complex codebases without significant effort.To test the performance isolation capabilities of the added pBox code, we reproduce 16 real-world intra-application performance interference issues in the five applications.pBox reduces the performance interference for 15 cases, by an average of 86.3% and up to 113.6%.We compare with four start-of-art solutions [10,18,48,53].They at best only reduce the performance interference for five cases by 38.8% on average, and would make the interference worse in the majority of the cases.
This paper makes the following contributions: • We propose pBox, an abstraction that pushes the performance isolation boundaries into an application to address the intra-app interference issues facing modern applications.• We address several design challenges, including identifying a small set of general state events to support diverse virtual resources, and designing algorithms to proactively detect imminent interference and take effective actions.• We implement pBox in the Linux kernel and a library, along with a companion analyzer.We show pBox's effectiveness on real intra-app interference issues in large applications.

Background and Motivation
Intra-app interference refers to an application activity experiencing severe performance issues due to some other independent activity in that application.We discuss three real issues from MySQL to show the characteristics of this problem.Although client A does not hold an exclusive table lock and thus would not block client B, B's latencies are still impacted when A commits transactions.As Figure 1 shows, 10 seconds after client A joins, client B's latencies increase by about 4×.
The source of this interference is a virtual resource-the UNDO log.The write queries result in the rapid growth of the UNDO log, which increases the cleaning cost.In turn, read queries are severely impacted because the UNDO log is frequently held by the purge thread (iterating log entries).Case 2: Buffer Pool.MySQL keeps a buffer pool to cache the accessed table and index data.While it generally improves performance, as reported by users [66], a backup task using mysqldump can use many blocks in the buffer pool and cause severe interference to other activities.
To reproduce this case, we create a small table (200 MB) that fits in the Innodb Buffer Pool (512 MB), and a larger table (4 GB) that does not fit in the buffer pool.We run four clients with uniform sysbench [42] OLTP on the small table and in time 30 seconds, we run a background mysqldump task on the second table.As Figure 2 shows, the throughput of the four clients is initially around 300 req/sec, but the interference from the backup task causes their throughput to drop by 10×.
The virtual resources involved in the interference are the buffer pool and its free blocks.When the dump activity takes many blocks from the buffer pool, it causes other activities for the four clients to frequently evict its old pages, which in turn leads to additional I/O costs for their requests.Case 3: Tickets.MySQL uses separate threads in its Inn-oDB engine to process requests from client connections.To minimize context switches, it limits the number of concurrent threads by the innodb _ thread _ concurrency parameter.While such a design is justified, it can cause performance interference among client connections as reported by users [57,74,76].
To reproduce this issue, we create a database with 5 tables (10 records per table).The thread concurrency is set to 4. We run three clients performing write-intensive workloads and one client performing read-intensive workloads.After around 90 seconds, a fifth client joins and issues write-intensive queries.Each client only queries one dedicated table .Figure 3 shows the latencies of client 4 (executing readintensive workload).In the first 90 seconds, this client's average request latency is around 0.3 ms.When the fifth client connects, even though it operates on a different table, the latency of the fourth client increases to around 0.9 ms, which is 3× slower than the non-interference case.
The interference involves two virtual resources-an integer n _ active and tickets.If a thread tries to enter InnoDB, it checks whether the number of threads inside InnoDB has reached the concurrency limit, by comparing an integer n _ active with the innodb _ thread _ concurrency parameter.If so, it needs to wait and check again.Otherwise, n _ active is incremented and the thread is given a number of tickets.The thread can then enter and leave InnoDB freely until the tickets are used up.

Observations
Intra-app interference issues are often not strictly a bug but a design trade-off.Even after developers become aware of such a trade-off, they may find the issue difficult to fix and keep the design as is.For example, the InnoDB thread concurrency regulation in case 3 can reduce context switches and improve scalability.Its limitation has been known for more than 10 years, but developers still keep it as a hard-to-tune parameter [76].As a result, performance interference issues can exist in an application for a long time.We need solutions that can dynamically mitigate performance interference.
In addition, while intra-app performance interference involves contention, the issues are more complex than typical poor synchronization.For example, for case 2, as Figure 4 shows, while a lock is used when accessing the buffer pool, the lock is soon released after a block is obtained.Thus, the real contended virtual resources are the free blocks, which are used by the noisy activities without the lock.Similarly, the core issue in case 1 is the subsection growth of the UNDO log rather than an unfair lock.Thus, simply optimizing lock or other synchronization mechanisms is ineffective.

Challenges and Gaps
Existing interference mitigation solutions use the allocation of hardware resources as the control mechanism.They are ineffective in addressing the intra-application interference issues shown earlier, which can occur even when many idle hardware resources are available.Blindly adjusting hardware resources may even aggravate the interference.In case 1, if we lower the CPU quota for the read requests or purge thread, it would cause even worse write latencies, because the victim activities were waiting for a virtual resource from the noisy activity and would need to wait longer.
Dropping noisy requests is a non-solution either.Production workloads are unpredictable, so it is difficult to know in advance which requests will cause interference.For example, both write (case 3) and read (case 1) queries can cause interference in MySQL.Moreover, users expect applications to provide strong performance isolation instead of dropping requests.It is also common for the interference to be caused by a background activity instead of a request.
Complex applications may implement custom mechanisms that attempt to mitigate performance interference.For example, MySQL allows limiting resources at the user account level, such as the number of queries an account can issue per hour [19].However, they are helpful in preventing overload but are ineffective in addressing normal interference, which occurs even when a client sends a small number of requests.
In summary, despite the extensive effort into mitigating performance interference, there is a lack of an effective and systematic mechanism to provide strong performance isolation within applications that users expect and desire.

Overview of pBox
Motivated by the observations from Section 2. We propose a new OS abstraction called pBox that pushes the boundary of performance isolation into the application for developers to systematically minimize intra-app interference.Insight.Our insight is that the essence of intra-app performance interference is different application activities contending on virtual resources, such as buffers and tickets.Thus, it is invisible to the OS and cannot be simply mitigated through adjusting hardware resources.pBox tackles this characteristic by making the OS aware of virtual resource contention.
Abstraction.pBox is a performance isolation domain within an application that logically divides an application's execution into independent activities, preventing activities within one domain from poor performance due to the execution of activities in other domains.Existing abstractions such as resource container [7] can capture activities within an application.However, they focus on delineating resource principals while treating each activity separately.In comparison, pBox focuses on interactions across different activities and their scheduling.
It monitors the interactions to detect contention and applies scheduling actions to achieve a performance isolation goal.
Usage.Developers create a pBox around code that represents an application activity boundary.They directly specify a high-level performance isolation goal for this pBox, e.g., a maximum interference level x.The runtime then aims to achieve the goal for activities executed within this pBox.pBox supports flexible granularity.For example, in a requestbased application, developers can define a pBox for each request.They can also define a pBox for each client connection.In this case, the pBox is created when a client connection is established and destroyed when the corresponding connection is closed.Note that one connection may send  consecutive requests of different types, e.g., write requests followed by read requests.This pBox will be activated  times to provide performance isolation for  activities-the handling of each request from this connection.For  concurrent client connections, there can be  pBoxes.Besides request handling, developers also create pBoxes for other activities, e.g., one pBox for each background thread.
Architecture. Figure 5 shows the pBox system overview.pBox exposes a few general APIs (Section 4.1) for application developers to use.A user-level library will be linked with the application.The library traces critical state events (Section 4.2) about application virtual resources and communicates them to a kernel-level manager.The manager monitors the execution of all the pBoxes.Using the state events along with other information, the manager runs a detection algorithm (Section 4.3) to determine if any pBox might suffer from interference soon, and detect the potential noisy pBox(es) and victim pBox(es).It then carefully applies penalty actions on the noisy pBox(es) (Section 4.4) to achieve the isolation goal.

Design of pBox
In this section, we describe the interfaces of the pBox abstraction, the system components for supporting pBox, and the algorithms for pBox to mitigate performance interference.

Main APIs and Usage
As an abstraction, pBox should be general enough to support a wide range of applications with different architectures and programming paradigms.As Figure 6 shows, there are three common application architectures: (a) multi-threading; (b) event-driven; (c) multi-process.In (a) and (c), one request or task is typically handled by one thread or process.For (b), multiple requests or tasks share the same thread.pBox provides a few APIs (Figure 7) that support all three architectures.
An application calls create _ pbox in a region that represents an activity boundary to be protected.Such boundaries are well-defined.For example, in MySQL, if developers want to create a pBox for each client connection, they add a call at the start of function do _ handle _ one _ connection (Figure 8).At runtime, the kernel creates a new pBox instance and binds it with the current thread that handles an incoming connection.
The create _ pbox API takes an IsolationRule argument for developers to specify an isolation goal.A typical type of isolation rule specifies the relative performance behavior, particularly latency increase, compared to the ideal, noninterference execution.For example, a rule of 50% indicates that the pBox's execution latency should not be more than 50% worse than the latency if there was no interference (no other pBoxes slowing it down).In Section 4.3.1,we discuss how pBox enforces a relative isolation rule even though the ground truth of non-interference performance is unknown.
When the application starts a new activity in a pBox, developers can activate the pBox by calling activate _ pbox, which causes the manager to start tracing this pBox and provide performance isolation for it.Once the activity finishes, the application calls freeze _ pbox, which stops the tracing for this pBox.For example, if a pBox represents a client connection thread, the activate _ pbox and freeze _ pbox can be called when the thread starts and finishes processing one request from the connection, respectively (Figure 8).
When the condition of a virtual resource changes, the application signals a state event (Section 4.2) by calling update _ pbox.To support diverse virtual resources, the pBox names a virtual resource with a generic key, which is typically the address of the resource object.The manager does not need to understand the semantics of a virtual resource.It only needs the key to group recorded information such as the state events.
For event-driven applications, multiple pBoxes share the same thread and only one pBox owns a thread at one time.To support these applications, we provide two ownership transfer APIs.The unbind _ pbox API detaches the pBox bound with the current thread and then associates this pBox with a key (different from the resource key).The bind _ pbox API finds the pBox associated with a given key and binds it with the current thread.For example, in a typical event-driven application, when a request finishes processing, the connection will be

PREPARE
The pBox is deferred by a virtual resource that is currently held by another pBox.

ENTER
The pBox is no longer deferred by the resource

HOLD
The pBox is holding a virtual resource

UNHOLD
The pBox has released the virtual resource Table 1.Four state events for application virtual resources.A virtual resource can be mutual exclusive, or exclusive with multiple units.It can also be composed of multiple parts.put into the event queue.Before the queuing, developers add an unbind _ pbox call with the connection IP as the key.At the place where a new request from a connection is executed in the worker thread, developers add a bind _ pbox call.

State Event
We now introduce a key concept to pBox, state events.

4.2.1
Rationale.pBox's insight is to make the OS aware of application virtual resources.However, informing the OS of every change in application virtual resource usage is too overwhelming and imposes too much overhead.In addition, application virtual resources have a wide range of semantics and characteristics, the OS lacks the knowledge to transparently manage different virtual resources for an application.
Through analyzing real-world cases, we summarize four general types of conditions that apply to all kinds of virtual resources.Recognizing these conditions is necessary for addressing interference.We call them state events: (1) PREPARE; (2) ENTER; (3) HOLD; (4) UNHOLD.Table 1 lists their semantics.
An alternative is the traditional resource acquire-release model.However, that model does not capture the key characteristics of performance interference: one activity is causing delay to or deferred by another activity.
The PREPARE/ENTER events can capture how long an activity is deferred when it tries to acquire a resource or during the usage of the resource.The reason we distinguish the ENTER and HOLD state events is that a virtual resource may consist of multiple parts and an activity is unblocked from a partial resource but still does not hold the full resource.

Finding state events.
A state event is about the usage status of an application virtual resource.Identifying it therefore requires domain knowledge.Developers (not users) possess this knowledge to find code places to call update _ pbox.Leveraging state events from these API calls, the pBox manager automatically detects and mitigates performance interference.
One approach to finding state events is based on the types of objects that may cause contention, e.g., queues and buffers.However, applications have many custom implementations of these types, which can be easily missed.
We observe a more robust heuristic.Intra-app performance interference usually comes down to the application using waiting-related syscalls to block a victim task, such as sleep, futex, or select.Thus, developers can first find call sites of such a syscall.Then they can check whether a shared variable accessed by multiple activities is used to determine the control paths to a call site.If so, this shared variable is likely a critical virtual resource of interest.Developers can then add the four state events for this resource.In comparison, if the paths to a blocking call site only involve variables that are accessed by one activity, it is likely self-waiting (e.g., a periodic task or retries on I/O errors) that can be skipped.Figure 9 shows an example of adding the update _ pbox APIs to the MySQL InnoDB code based on the above heuristics.The shared variable srv _ conc.n_ active is a virtual resource being contended by multiple activities and the sleep call at line 281 represents an activity being blocked.
Note that developers are not expected to do a perfect job in finding state events.As we later show (Section 6.8), pBox can tolerate incomplete or inaccurate update _ pbox calls and still effectively mitigate interference.
We further design a companion static analyzer tool (Section 4.5) to help developers.The tool implements an algorithm based on the above heuristics and automatically analyzes the codebase to find potential virtual resources.

Prediction and Early Detection of Interference
To achieve strong performance isolation, the manager must monitor each pBox's execution and proactively detect if a violation is imminent.Early detection is especially important because of the fine granularity of performance isolation and the fact that we cannot reclaim a contented virtual resource.
A fundamental challenge, however, is that virtual resource usage is low-level information, while the isolation rule is about end-to-end latency.During an activity's execution, we do not know what its final latency will be, nor how much each virtual resource will contribute to the final latency.Given a relative isolation rule (Section 4.1), we also need to know the baseline (interference-free) performance, which is usually unavailable.metric.Our rationale is that interference occurs when an activity is deferred for a long time.We define the deferring time for one activity to be the additional execution time caused by other activities.Assume an activity uses a set of resources  1 ,  2 , . . .,   , and a list of PREPARE and ENTER state events are received for each resource.We denote the time of receiving the PREPARE events as   1 ,   2 , . . .and the time of receiving the ENTER events as   1 ,   2 , . ... The deferring time is calculated as   =   = 1   −   .We did not choose holding time as a metric, because holding a virtual resource for long does not mean the pBox is noisy.

Metrics and
To connect the deferring time metric to the end-to-end isolation goal, we treat the unknown baseline (interferencefree) as an ideal execution with zero deferring time.
Assume the total execution time of an activity in a pBox is   and its total deferring time is   .Its interference level The problem is that we do not know   before this activity finishes, so we need to approximately compute   .
To achieve early detection, i.e., predicting whether a pBox's execution so far is in danger of violating the performance isolation goal, we use a worst-case analysis inspired by the worst-case execution time (WCET) analysis [82].In particular, using the current defer time   and the current execution time   , we can compute a simple approximate we can have confidence that if the activity's later execution still maintains the same ratio, the pBox cannot achieve its goal.Thus, it would be a good time to take action.Algorithm 1 shows the core interference detection algorithm.When a UNHOLD event is received (line 14), the manager first checks whether the current pBox is the holder of the virtual resource.If so, it iterates through all the waiting pBoxes (line 17).If it finds a pBox whose   is too long and the current pBox is the holder before the waiting pBox, we detect potential interference and find both the noisy pBox and the victim pBox.
The aforementioned detection logic is about one activity executed in a pBox.Due to the fundamentally limited information, we may miss detecting and mitigating interference in one activity.Thus, the pBox manager also monitors the overall performance of a pBox performance and detects interference at the pBox level.To do so, it keeps a history of   as well as the   .It calculates the average interference level If the manager finds one pBox's   is close (default 90%) to , it will also take action at the end of the activity.
Besides calculating the average, the manager supports other metrics including tail and max based on the same principle.
Note that our algorithm does not assume a pBox accesses only one resource at a time.It uses unique keys to identify state events for different resources, so it tracks them separately and concurrently.It also does not have resource dependencies requirements.The deferring time is calculated based on the timing of the state events.A different order of events would change the time but not the accuracy of detection.

Tracking Execution Information.
The manager tracks each pBox's execution information to both support the detection algorithm and facilitate mitigation actions (Section 4.4).
It tracks four statuses for each pBox: start (e.g., a new client connection is established), active (e.g., a new request from the connection is received), freeze (e.g., the request handling finishes), and destroy (e.g., the connection is closed).
The manager begins to trace state events after a pBox is in an active status, and ends tracing once it is in a freeze status.
When a PREPARE event is received, the manager notes this pBox in a deferred state about a virtual resource and adds it to a competitor map (list of pBoxes waiting for a resource).When the pBox receives an ENTER event on the same resource, the deferred state is ended and the manager calculates the deferring time.If a pBox receives a HOLD event, the manager records it in a holder map.The two maps are used in the interference detection and mitigation logic.

Prevention and Mitigation of Interference
After detecting potential interference, the pBox manager needs to take action.Unlike cross-app interference, where the kernel can transparently adjust hardware resources, directly reallocating a contended virtual resource can easily introduce dangerous side effects to an application.For example, directly revoking a lock object from one activity and granting it to another activity can easily violate critical section safety.

Action and Timing.
We use penalizing the noisy pBox as the main control action so that we can achieve performance isolation without breaking application logic.There are multiple ways to achieve the penalty, such as reducing scheduling slices ( giving more to the victim pBox), and lowering priority.
We choose a simple type of penalty: adding a delay to slow down the noisy pBox.In the Linux kernel, it is done by calling schedule _ hrtimeout.Compared to other penalties, it introduces a simpler effect, which in turn makes it easier to predict the mitigation effectiveness and make the interference mitigation algorithm (Section 4.4.2) less complex.Also, this simple penalty avoids conflicting with the main OS scheduler.
Applying penalty actions to noisy pBox might violate the noisy pBox's isolation goal and trigger additional penalty action.To avoid the cascaded penalty, the detection algorithm 1 only uses the deferring time on virtual resources to determine the interference level.Thus, the violation caused by penalty action would not be considered interference.
The timing of the penalty action requires care.If a virtual resource is still held by a noisy pBox, penalizing the noisy pBox would cause the victim pBox to wait even longer for the virtual resource.Thus, the manager waits until the noisy pBox no longer holds the virtual resource to apply the penalty.
Another caveat is nested state events.A noisy pBox may hold multiple virtual resources at the same time, so during its penalty, it may still cause interference for other pBoxes.To avoid this situation, the manager conservatively waits for the noisy pBox to release all the virtual resources and takes action at once.As a result, the noisy pBox can be penalized without causing more performance interference.

Adaptive Penalty.
The mitigation effectiveness depends on the penalty action's length.An improper length may exacerbate the interference.Rather than using a fixed length, which is hard to set, we adaptively adjust the length.
When the manager detects a noisy pBox (Algorithm 1), it checks the action history.If this pBox has not been penalized for the contended virtual resource before, the manager sets an initial value  1 as the penalty length.Otherwise, the length is adjusted based on the effect of the previous penalty.
We evaluate whether a penalty is good or not by comparing the victim pBox's performance before and after the penalty.In particular, we calculate  () =    /   for the victim pBox, where    and    are the victim's average deferring time and execution time until the -th action, respectively.This ratio reflects the interference level (Section 4.3).
We design two adaptive policies.The first one is scorebased.If  ( + 1) is larger than  (), which means the penalty does not reduce the interference level, we increment the score by one.Otherwise, we decrement the score by one if it is positive.The penalty length for the next action is set to  +1 =  1 × (1 + /), where  by default is 5, so each ineffective action would increase the next penalty time.This policy's convergence to the optimal penalty may be slow.
Thus, we design a second policy inspired by the gradient descent algorithm.We measure the gap from the isolation goal (),  =  ( + 1) − , and the delta  = 1 −  ()/ ( + 1).The next penalty length is set to  +1 =   ×/.This policy is faster but a step may be too large to reach the optimal value.
The manager dynamically chooses between the two policies.If the deferring time is much larger than the penalty, it chooses the second policy.Otherwise, it chooses the first policy.
To choose the initial penalty  1 , we assume a simple but representative interference model: one noisy pBox and one victim pBox.The pBox manager derives a formula to calculate the optimal penalty length under this model: In this way, the  1 would not be far away from the real optimal result.

Is The Action Too Late?
A limitation with using delay as the penalty action is that the action might be too late.For instance, if the interference is caused by a noisy pBox holding a virtual resource for a long time near the end of an activity's execution, by the time the pBox manager can safely act, the penalty may be useless.
While adding complex transaction mechanisms may address this limitation, our design has the advantages of simplicity and safety.As we later show (Section 6), it is quite effective.
There are several reasons that can explain its effectiveness.First, our detection algorithm (Section 4.3) is proactive, which can find imminent interference before it reaches the level of violating the performance isolation goal.As a result, the penalty action(s) can be applied early on to prevent the violation or at least minimize the interference impact.
Second, we find that in real-world intra-application interference cases, a noisy pBox often creates contention on some virtual resource more than once, either within one activity or across a sequence of activities.Take case 1 in Section 2.1 as an example, the virtual resource is the UNDO log, and the noisy activity keeps adding or cleaning up entries in it.Similarly, in case 2, the virtual resource is the buffer pool, and the noisy activity frequently obtains blocks from it.
Third, lateness in taking action can be more probable when an activity finishes execution quickly.This can impose demanding requirements on the detection and mitigation, but we are targeting performance interference.In such a scenario, a noisy activity typically requires a longer execution time to cause severe interference.
For these reasons, despite the potential limitation, in practice, there are still many opportunities to effectively intervene.

Static Analyzer
We designed a companion static analyzer to help developers find state events when adding pBox to their applications.The analyzer is built on top of the LLVM framework [43] Algorithm 2 lists the core logic.The analyzer takes as input a list of standard library functions or syscalls that perform waiting, such as semaop, pthread _ sleep, pthread _ cond _ wait, pthread _ yield, and apr _ sleep.Many applications also implement custom waiting functions that are wrappers of a standard function.The analyzer identifies such a wrapper (isWrapper at line 8) by checking whether a function calls some waiting function in all the paths.Specifically, it checks the Control Flow Graph (CFG) to see if this call instruction's basic block is a post-dominator [14] for the function's entry basic block.
The analyzer then finds all callsites for these waiting functions and the wrappers.Next, it checks whether a callsite is in a loop (line 13).If so, it checks whether the loop condition uses some variables shared by multiple activities (line 15).
A callsite that matches these conditions is a candidate location to add a state event.The analyzer outputs all locations and the associated shared variables (likely virtual resources).The output guides developers to add update _ pbox calls.

Implementation
We have implemented a prototype of pBox in Linux kernel 5.4.1 and a user-level runtime library.
Lightweight Tracing.Since a pBox is activated during an activity's execution, we need to minimize the overhead of tracing and management.To make the tracing lightweight, we optimize the cost of each pBox operation.A major cost is allocating bookkeeping data structures such as the state event hash table and competitor map.We reduce this cost by using pre-allocation.For example, for the bind _ pbox operation, one pBox would normally only bind to one key (variable) at a time.Thus, we allocate a small array in the pBox's struct during its creation phase.In later binding operations, we just find a free slot in the array and allocate only if all slots are used.
After optimizing the core operation itself, the syscall overhead dominates.We further reduce the number of syscalls, especially for update _ pbox.The user-level library checks whether HOLD has a matched UNHOLD event and only calls update _ pbox when there is a match.Since the two events are used to locate the noisy pBox and take action if needed, we can skip the syscall upon redundant events.This decision needs to find the associated pBox, which still requires a syscall.To avoid this syscall, we use thread local storage to record the pBox id when it is created.Then we keep an array in each pBox to check the ownership of the virtual resource.
Supporting Event-driven Model.Event-driven applications can be single-threaded or multi-threaded (thread pool).The bind and unbind pBox APIs take a flags argument that can indicate whether the currently bound thread is a shared thread or a dedicated thread.If a noisy pBox is bound with a dedicated thread, the manager takes action immediately.If the bound thread is shared, penalizing the noisy pBox with a delay would prevent other pBoxes from using this thread.The manager instead makes the following activities from the noisy pBox wait in the task queue for a while.Specifically, the manager keeps a penalty timestamp.If an activity from the noisy pBox selected to execute next happens within the timestamp, the activity is put back to the task queue.
One challenge is how to manipulate the application task queue without causing side effects.We observe that eventdriven applications commonly leverage kernel-level queues for task management by using syscalls such as accept and epoll.In such cases, the pBox manager traces state events at the application level but take action in the kernel queues by modifying the syscall implementations to achieve transparent mitigation.If applications do not leverage kernel-level queues, developers need to annotate the task queues.
Lazy Unbind.In high-performance event-driven applications, we observe that the same thread might frequently bind and unbind to the same pBox.We introduce a lazy unbind optimization to reduce the number of syscalls.Under this mode, when the library receives a unbind _ pbox call, it marks the pBox as detached and pauses its state event tracing, but does not make a syscall.At the kernel side, the pBox is still bound with the current thread.In the next bind _ pbox call, the library checks if it is about the same detached pBox.If so, the library removes the detached flag, also without making a syscall.Otherwise, it makes a syscall for the manager to unbind the last pBox and bind the new one.

Evaluation
We evaluate pBox to answer several questions: 1) Can pBox reduce intra-app interference?2) How does pBox compare to state-of-the-art solutions?3) Is pBox robust?4) What is the overhead?5) How much effort is needed to use pBox?To measure the performance of these applications, for MySQL and PostgreSQL, we use sysbench as the benchmark tool [42].For Apache and Varnish, we use the official Apache benchmark tool [25].For Memcached, we use Mutilate [44] as the benchmark tool.

Microbenchmark
We measure the costs of pBox operations with microbenchmark.We write a test app that invokes different pBox APIs for 10 million times.We run the app 10 times and calculate the average latency for each operation.
Figure 10 shows the results.The pBox creation on average takes 8.8 s, which is much faster than the pthread creation.For the other operations, the latency is around 420 ns to 500 ns, which is close to the getpid syscall latency.

Mitigating Real-World Issues
To evaluate the effectiveness of pBox, as Table 3 shows, we collect 16 real-world intra-application performance interference issues in the five software.All cases are collected from blog posts, ServerFaults [67], and application bug trackers.Only 4 cases are marked as bugs by developers.The rest do not have associated bug reports and are usually design trade-offs.
We reproduce these cases, measure their performance on vanilla Linux, and compare it with running them on the pBox versions.In the pBoxes creation APIs, we use a relative isolation rule (interference tolerance level) of 50%.We choose 50% because contention is inevitable in modern applications and this goal is more realistic to consider the complexities of performance behavior in our evaluated applications.We evaluate the impact of different rule settings in Section 6.5.
Figure 11 shows the normalized latencies of the activities (threads or processes) that originally suffered from performance interference.pBox successfully mitigates (reduces the latencies) 15 of 16 cases.
The degree of mitigation matters.Let   denote the performance with interference,   denote without interference, and   denote the performance under a solution.Then, the original interference level is  =     −1.The last column in Table 3 lists  for each case.Most cases experience severe interference.The interference level under a solution is  =     − 1.Thus, the interference reduction ratio  =  −  =   −    −  .pBox significantly reduces the interference, by an average of 86.3% and as large as 113.6%.For case c16, pBox does not achieve effective mitigation, because the contention on the particular application resource is not heavy.In addition, since Memcached is a high-performance in-memory system, even one or two additional syscalls can be costly.The overhead of pBox exceeds the overall performance benefit from its mitigation actions.Note that pBox's improvements for cases c2 and c15 are not negligible as Figure 11 might suggest.For readability, the normalization in Figure 11 is calculated as , which does not always reflect  .For instance, in case c2,   is 23.95 ms;   is 21.67 ms;   for pBox is 21.99 ms.The normalized latency in Figure 11 is 0.91 ( 21.99 23.95 ), which seems a small improvement.However, pBox improves the victim activity's latency to be close to its non-interference latency (21.67 ms vs. 21.99 ms), achieving an 86% reduction ratio.
For tail latency, pBox reduces the 95 ℎ percentile for 13 cases (Figure 12), with an average reduction ratio of 54.6%.
In terms of the impact on the noisy pBox.The latency of the noisy pBox is only increased by an average of 34.1%.
pBox does not guarantee that the specified isolation goal can always be achieved.From measuring the first five cases, we observe that 94.6% of the activities meet the goal with pBox, whereas this number drops to 48.2% without pBox.
For standard cgroup, we use a script to dynamically identify threads that handle different types of workloads and put them into different cgroups.It also identifies background task threads and assigns them into one cgroup.Then the script configures an even CPU usage quota among the cgroups.In this way, a noisy workload or background task would not impact the CPU usage in other groups.For PARTIES, we modify its monitoring component to trace each client's latency.We use a script to identify threads that handle each client and configure them as PARTIES' control targets.PARTIES can then control resource usage at the client level.Retro is designed for Javabased distributed systems.We use the pBox codebase to re-implement its core design to apply to C/C++ programs.We   3) PARTIES [10] (4) DARC [18] (5) Retro [48].The interference is reduced if the normalized latency is below 1.The lower it is, the higher the interference reduction ratio.Normalized latency above 1 means the interference becomes worse.The numbers above the red bars are the absolute latencies (in ) for the interference performance.trace each activity's resource usage including lock and CPU, calculate the slowdown and load factor, and run Retro's BFAIR policy to throttle noisy requests.DARC provides request-level scheduling.We extend its request classifiers to support four request types for MySQL/PostgreSQL (Read, Write, Insert, Delete) and two request types for Apache/Varnish/Memcached (Post, Get).We implement a worker for each application to translate a PSP request into an app request.
Figure 11 shows the result.Cgroup reduces the interference for 3 cases by 33.6% on average and a max of 77.8%.In the remaining 13 cases, it makes the interference worse by -22.5% on average and worst by -94.6%.DARC helps 3 cases by 61.6% on average and a max of 90.8%.In the remaining 13 cases, it makes the performance worse for 535.8% and a max of 5716.5%.The reason is that DARC and cgroup limit a noisy activity's hardware resources, but the victim activities are waiting for virtual resources from the noisy activity and need to wait longer.Retro helps 5 cases by 38.8% on average and a max of 57.8%.In the remaining 11 cases, it makes the performance worse by -48.6% on average and worst by -280.8%.The reason that Retro can help most cases is partially because we implemented its control points and throttling on top of the pBox abstraction and our pBox calls would avoid bad penalty timing.PARTIES helps 3 cases by 13.5% on average and a max of 28.6%.It makes 13 cases worse by -176.2% on average and worst by -716.7%.

Penalty Action
To understand the internals of pBox's mitigation, we measure the number of penalty actions in 8 cases.Figure 13 shows the result.In general, pBox takes more penalty actions under a high interference level (listed in Table 3).However, if the interference level is too high, fewer penalty actions may occur, because that can cause pBox to choose the gap-based adaptive policy, which increases the penalty length for each action and thus decreases the number of actions.Figure 13 also shows the average number of steps that the adaptive penalty policy takes to converge (the penalty length reaches a fixed point).In cases where the gap-based policy is chosen (primarily), the convergence step is 10 times smaller than the step in cases where the score-based policy is chosen.
Figure 14 shows the penalty length distribution in the 8 cases.The cases choosing the gap-based policy have longer penalty lengths than the cases choosing the score-based policy.

Adaptive Penalty and Rule Sensitivity
We compare our adaptive penalty design (Section 4.4.2) with using fixed penalties of 10 ms and 100 ms.Table 4 shows that the adaptive penalty performs better for 7 out of 9 cases.
Users specify an isolation rule (goal) when creating pBox (Section 4.1).This setting can affect the detection and mitigation decisions.The experiment in Figure 11 uses the default 50%.We test 10 cases under different settings.Figure 15 shows the result.In general, a larger (more relaxed) isolation level can decrease the mitigation effectiveness.
The case c2 shows higher sensitivity to the rule settings.This is because the interference in this case is less severe-the level   is lower than two for c2 but greater than five for other cases.More relaxed isolation rules would cause fewer penalty actions, resulting in a lower reduction ratio.

Overhead
We measure the end-to-end overhead of pBox to an application's performance in normal conditions.We use the same application versions and configurations as Section 6.2, but we run normal workloads instead, which are assumed to not introduce significant performance interference.Specifically, we generate OLTP read-only and write-only workloads for MySQL, PostgreSQL using sysbench [42], with an initial database of 64 tables and 1 K records per table.For Memcached, we generate read-intensive and write-intensive workloads based on Facebook's USR and VAR request distribution [5].Each workload has eight settings with varying numbers of clients (Figure 16).We run Apache and Varnish under settings r1 to r64.The workload is serving HTML pages based on Varnish high-availability benchmark [77].We run each setting for 90 s and compare the average latency with and without pBox.
The overhead does not significantly increase as the concurrency level increases.We use hashtables to store virtual resources.For each resource, we use a list to store the current waiters.Adding a pBox to this list has a constant cost.Removing a pBox and finding a victim pBox have costs linear to the number of waiters.If many pBoxes are waiting on a virtual resource, it is likely that performance interference already occurs.In this case, the cost of finding a victim is shadowed by the gains of mitigating the interference.

Usage Effort
Table 5 shows the SLOC we add to the five applications for using pBox.MySQL's changes are the largest, mainly because it defines a number of custom virtual resource types that we need to cover.But the changes overall are small, especially considering the applications' large codebase sizes (Table 2).Table 5. Functions we inspected to use pBox, state events we manually found to add update _ pbox calls, and total SLOC added to the app code.Detected is the number of state events found by our analyzer.
Since we are not the application developers, we need to read the source code first.Table 5 shows the number of functions we inspected to determine the places for using pBox.It takes a graduate student a few days to complete the task for each application.Developers can likely use pBox more quickly.
We also test our static analyzer (Section 4.5).Table 5 reports the state events detected by the static analyzer.On average, the analyzer detects 81% of our manually found state events.For PostgreSQL, the analyzer detects four more points that we did not find during our manual porting.
For the remaining 19% of state events, they have the same heuristic as the others.The reason our static analyzer failed to identify them is that it only checks direct wrappers of waiting functions, but in these cases the callchain to a waiting function is deep.Additionally, some loop condition variable is the return value of a function call, and our current analyzer does not support checking if a returned variable is shared or not.
When an application evolves, if its activity boundaries and virtual resource usage code are changed (typically in a major upgrade), developers need to update the pBox calls accordingly.This is similar to how developers need to adjust the synchronization points when they make major changes to a multi-threaded program.Developers can re-run our static analyzer to assist them with updating the pBox calls.

Mistake Tolerance
We evaluate whether pBox can tolerate mistakes in using pBox APIs.We randomly remove 10% of the update _ pbox calls in our pBox-version MySQL and rerun the experiment in Section 6.2.This process is repeated five times.On average, 4 cases (out of 5 cases) show positive mitigation, with an average interference reduction ratio of 92.1%, which is slightly lower than the result (93.9%) under correct usage.

Discussions
Kernel vs. user level.Where to implement pBox (application, library, and kernel) has performance, transparency, control, and flexibility trade-offs.Our current kernel-heavy implementation is motivated by several considerations: • pBox is essentially an effort to improve the scheduling of application activities for performance isolation, which is an important property that the OS should provide to applications.Many works [26,29,38,51,53] have been implemented in the kernel to achieve performance isolation.However, they are insufficient to address the prevalent intra-application performance interference issues.• Intra-app performance interference can occur due to systemlevel resources contention like futex and network queues.In Table 3, five cases are contending on such resources.Certain application virtual resources are proxies for system resources.For example, the table lock in MySQL is implemented using pthread _ mutex, which relies on the futex syscall.The kernel-level pBox can directly modify the corresponding kernel code (e.g., futex implementation), allowing us to transparently trace state events without requiring developers to add update _ pbox calls in application code.• The timing of pBox actions is easier and more effective to enforce in the kernel.The actions' impact on the Linux scheduler is more predictable than user-level actions.The kernel-level implementation also allows future extensions using other scheduling actions, such as changing priorities.However, for certain applications that use pure user-level queues for task management, it is beneficial to provide a library-heavy pBox implementation or design upcall APIs similar to scheduler activations [2].
Testing.To test whether the added pBox API calls are effective, developers can create performance benchmarks that reproduce past interference issues.Another testing strategy is to use a strict isolation goal in performance testing.Large software often experiences minor forms of intra-app performance interference.The pBox traces should show that some mitigation actions have been taken.Additionally, since the pBox APIs are designed to be simple, developers can easily add pBox code to missed code regions during or after a production performance issue, which will benefit future performance isolation.This flexibility enables iterative instrumentation.Future Work.Like other performance interference mitigation work, pBox is a best-effort solution.It only reduces interference and does not guarantee that a given isolation goal will always be satisfied.How to provide strict performance isolation for large software is an open challenge.A related area of improvement is to provide a more rigorous analysis of the pBox's actions, such as applying queuing theory [31].pBox currently does not support distributed systems.Extensions to the tracing and detection algorithm as well as coordination on mitigation actions are needed for the support.
Numerous solutions [10,11,26,34,35,47,54,58,64,68,86] are proposed to mitigate interference by adjusting hardware resources.For instance, PerfIso [34] dynamically restricts the cores for batch jobs to protect the performance of latency-sensitive jobs.PARTIES [10] boots allocation of hardware resources for latency-critical services upon detecting QoS violations.Caladan [26] uses memory bandwidth and request processing times as the control signals to detect memory and CPU interference, and restricts CPU cores for antagonist jobs.
We focus on intra-application performance interference, which is caused by internal activities contending on applicationlevel resources such as buffers or tickets.The contention can be invisible to existing solutions.
Fine-grained Resource Management.A long history of supporting fine-grained resource management exists in the context of real-time and multimedia operating systems [13,24,37,55,56].Much of the work focuses on charging resource consumption to an application activity that is across the process or thread boundary.Similar efforts exist in generalpurpose operating systems and software [7,36,46,48,53,65].A representative work is the resource container abstraction [7], which allows developers to limit an application activity's resource usage.It is modernized by Linux cgroup [53].All these efforts still mainly target hardware resources, while pBox is about contention on virtual resources.Moreover, pBox focuses on cross-activity interference instead of managing each activity independently.It uses virtual-resource-aware scheduling to minimize the interference.
Retro [48] attributes the resource usage to different workflows and allows developers to write their own scheduling policies to control resource allocations.While Retro can trace some application resources (locks and thread pools), it mainly targets conventional interference due to multi-tenancy.pBox covers a wide variety of virtual resources.It does not target resource allocation, but instead focuses on fine-grained performance isolation and may take mitigation action at any time during an application activity's execution.
Server Overload Control.Applications may experience performance overload due to excessive requests.Solutions typically use admission control techniques [9,11,12,20,22,30,40,79] that apply rate limiting on the client side or drop requests at the proxy or server side.Intra-application performance interference is an orthogonal problem.It can happen even when the server is not overloaded.pBox does not throttle requests in providing performance isolation.
Application-Specific Scheduling.Customizing scheduling based on an application's workload characteristics can greatly improve performance, thus motivating works to provide this capability [18,33,38,39,50,59,61].For example, Syrup [39] allows developers to easily write application-specific scheduling policies.DARC [18] profiles application requests and leaves some cores idle when there are no short requests.pBox is orthogonal to these efforts.It is not a scheduler to allocate CPU and other hardware resources.It only takes action when an activity's isolation goal is in danger of being violated.Also, these solutions often assume independent requests, so hardware resources can be arbitrarily scheduled.But requests (and background tasks) involved in intra-app interference have dependencies on virtual resources, thus simply adjusting hardware resources do not help and can worsen the interference.SLO Guarantees.Some projects target SLO enforcement in multi-tenancy.PSLO [45] enforces tail latency and throughput for consolidated VM storage by controlling I/O concurrency level and arrival rate for each VM.FIRM [63] uses machine learning methods to detect SLO violations in microservices, upon which it adjusts the hardware resource provisioning.pBox aims for fine-grained performance isolation.Overall SLO may not detect interference among application activities.Reacting after SLO violation can also be too late, because the contended virtual resources cannot be directly reclaimed.Synchronization Optimization.Extensive efforts optimize locks and other synchronization primitives, e.g., scalable spin locks [52], NUMA-aware locks [16], user-defined kernel locks [60].Many intra-app interference issues are not simply due to poor synchronization or scalability bottlenecks.While locks often appear in them, the virtual resources that cause the interference are diverse and involve complex interactions among application activities.Thus, optimizing locks is insufficient.pBox focuses on end-to-end performance and isolation for an activity instead of an individual lock.
Performance Debugging.It is notoriously difficult to debug complex performance issues in large software.Many profilers and analyzers [6,8,15,32,73,80,81,84] are therefore proposed to help developers with this task.pBox targets performance issues caused by interference among application activities and provides performance isolation at runtime.The log traces from pBox can provide useful insights for developers to understand a performance interference issue.

Conclusion
This paper explores pushing the performance isolation boundaries into an application.We propose an abstraction called pBox.pBox captures general state events about diverse virtual resources, detects imminent interference among application activities, and carefully chooses actions to achieve the performance isolation goal.We apply pBox on five large applications and evaluate it with 16 real-world intra-app interference issues.pBox significantly reduces the interference for most cases.

Figure 1 .
Figure 1.A real-world intra-application performance interference issue from MySQL.Details are described in Section 2.1.

Figure 2 .
Figure 2. Throughput of all foreground clients

Figure 3 .
Figure 3. Avg.latency of requests from client 4. A fifth writeintensive client connects around time 90 s.

Figure 4 .
Figure 4. Finding a free block from the buffer pool in MySQL.

Figure 8 .
Figure 8. Example of using pBox in MySQL.

Figure 13 .
Figure 13.The number of penalty actions, interference level, and steps for the penalty length to converge to a fixed point.The scorebased and gap-based adaptive policies are dynamically chosen.

Figure 16 .
Figure 16.Overhead under different workload settings.r1 to r64: read-intensive workloads with one to 64 clients.w1 to w64: writeintensive workloads with one to 64 clients.

2.1 Real Intra-App Interference Cases Case 1: UNDO log.
[75,85]as different transaction isolation levels.The default setting establishes a snapshot at the first read.While this is convenient, users found it can cause severe performance interference in production[75,85].Inn-oDB is a multi-version concurrency control (MVCC) storage engine, which uses a UNDO log that keeps transaction history.If there are long transactions with old versions, the UNDO log can grow large.As a result, when the old transactions are committed, MySQL's purge thread needs to spend a long time cleaning up the UNDO log, blocking other activities.To reproduce this case, we create a database with 1 table and run two clients: A performs reads and B performs writes.A issues each read request in a transaction, sleeps for 10 seconds after the request finishes, then commits the transaction.By doing so, we have a long transaction that keeps an old version of the table.Consequently, each write request from client B needs to update the UNDO log and causes a large UNDO log.
Example usage of update _ pbox API in MySQL, which can mitigate interference issues such as case 3 in Section 2.1.

Table 2 .
Evaluated software.The experiments are conducted on servers with 10core (20 hyper-threads) Intel Xeon E5-2640 CPUs at 2.4 GHz, 64 GB DRAM, and a 480 GB SSD, running Ubuntu 20.04.We evaluate pBox on five large, open-source applications (Table 2): MySQL, PostgreSQL, Apache, Varnish and Memcached.We choose them because they are widely used, and cover different functionalities and architectures.They are complex enough to test pBox's generality. Setup.

Table 3 .
Description of 16 real-world intra-application interference cases we collected and reproduced in the five evaluated software.: Y means the interference case is from a bug report; N means the interference case is from some user post without a corresponding bug report.Avg.latency for each case normalized by the interference performance in the original application, compared to running the application with (1) pBox (2) cgroup (

Table 4 .
Average latency (ms) for nine evaluated cases using a fixed penalty versus using the default adaptive penalty design.