Snowcat: Efficient Kernel Concurrency Testing using a Learned Coverage Predictor

Random-based approaches and heuristics are commonly used in kernel concurrency testing due to the massive scale of modern kernels and corresponding interleaving space. The lack of accurate and scalable approaches to analyze concurrent kernel executions makes existing testing approaches heavily rely on expensive dynamic executions to measure the effectiveness of a new test. Unfortunately, the high cost incurred by dynamic executions limits the breadth of the exploration and puts latency pressure on finding effective concurrent test inputs and schedules, hindering the overall testing effectiveness. This paper proposes Snowcat, a kernel concurrency testing framework that generates effective test inputs and schedules using a learned kernel block-coverage predictor. Using a graph neural network, the coverage predictor takes a concurrent test input and scheduling hints and outputs a prediction on whether certain important code blocks will be executed. Using this predictor, Snowcat can skip concurrent tests that are likely to be fruitless and prioritize the promising ones for actual dynamic execution. After testing the Linux kernel for over a week, Snowcat finds ∼ 17% more potential data races, by prioritizing tests of more fruitful schedules than existing work would have chosen. Snowcat can also find effective test inputs that expose new concurrency bugs with higher probability (1.4 ×∼ 2.6 × ), or reproduce known bugs more quickly (15 × ) than state-of-art testing tools. More importantly, Snowcat is shown to be more efficient at reaching a desirable level of race coverage in the continuous setting, as the Linux kernel evolves from version to version. In total, Snowcat discovered 17 new concurrency bugs in Linux kernel 6.1, of which 13 are confirmed and 6 are fixed.


Introduction
Finding kernel concurrency bugs is challenging due to the extensive search space of kernel test inputs and thread interleavings.Finding a concurrency bug rst requires choosing eective concurrent test inputs-a pair or more sequential test inputs that concurrently invoke sequences of system calls [29].Second, nding a concurrency bug requires choosing error-inducing schedules of the concurrently-executing kernel threads [16,17,39,41,43].Hence, the search space is at least quadratic in the number of sequential inputs, and exponential in the number of instructions that can interleave among concurrent threads [39,40], making kernel concurrency testing particularly daunting.
Although generating good kernel concurrent tests is a broadly explored area of research and practice [17,19,25,68], deciding how fruitful such a generated test is towards a testing goal is a common and important challenge.Typically, testing campaigns target a reward such as an increase in code coverage, data-race discovery, or triggered undesirable behaviors (e.g., kernel panics and deadlocks) [19,21,25,28,68].However, computing such a reward by executing the candidate test is expensive.These kernel executions run in heavyweight environments (e.g., VMs) with expensive instrumentation [2, 19-21, 28, 36, 53, 71], yielding particularly low execution throughput (e.g., tens to hundreds of executions per minute).
This naturally limits the eciency of concurrency testing since the vast majority of random tests do not increase coverage [23,30,46].In fact, even sequential test input generation, which generally has a smaller search space, is haunted by eciency challenges.For example, Syzkaller, a mature feedback-based kernel fuzzing tool, may need to execute thousands of random sequential inputs before nding one that increases coverage [53,61].Moreover, the chance of nding tests that increase the overall coverage drops signicantly throughout a fuzzing campaign, making the test execution stage even more wasteful [30,53,61].
Thus, there is an opportunity to prioritize interesting concurrency tests and lter out less interesting ones to curtail the large waste in CPU and wall-clock times from fruitless dynamic executions.Instruction schedules likely to exercise previously uncovered code blocks can be prioritized over schedules that exercise previously observed blocks.This is the quest we embark on in this work.In prior work, one plan of attack is to choose likely-to-be-fruitful concurrency tests at construction, and another is to estimate a reward metric for a test after construction but before execution to lter appropriately.We summarize examples of each.
Since it is known that eective concurrent inputs should exercise diverse kernel inter-thread data ows [19,36,42,68], static analysis is sometimes used to reason about data ows that would be triggered by an input.Unfortunately, because real-world kernels are complex, traditional analysis approaches face limitations in either accuracy [12,25,44,60] or scalability [7,23,30,37,48,70].
Thus, heuristics that do not require heavy analysis are used [19,34,68].For example, Snowboard [19], our previous system, prioritizes the tests of concurrent test inputs whose two constituent sequential inputs both touched the same memory when executing single-threaded, as those are likely to exhibit inter-thread data ow when run together.Finding eective schedules is also challenging due to the massive interleaving space [39,41,42].A kernel concurrent test can run concurrently tens of thousands of instructions from each thread [17], making it infeasible to enumerate all possible interleavings.Hence, a targeted approach is necessary that nds and prioritizes interleavings exercising unique concurrent behaviors.Exhaustively analyzing the consequences (e.g., coverage, data ows, etc.) of interleaved instructions from multiple threads requires formal approaches, such as model checking [33,58,67].However, these approaches do not scale well to low-level, complex systems, such as the kernel.In practice, constrained random schedulers [6,17,18] that only invoke a xed number of thread switches per execution are commonly used.The limit on thread switches helps prune the interleaving space, enabling more systematic exploration.
Our work is inspired by the dramatic advances of machine learning (ML) towards code understanding.We propose general and automatic techniques that estimate whether a concurrency test is likely to be fruitful.ML approaches have been used before for software and hardware testing.Neuzz [50] showed that neural networks can learn and predict application edge coverage given the test input.Given the byte sequences of the test input, Neuzz identies promising bytes that should be mutated for higher coverage.Design2vec [56] further shows that the control and data ow graph of the testing target (hardware in Verilog RTL [66]) can be used by a model to predict test coverage.The success of Neuzz and Design2vec on hardware designs and small-scale applications suggests that ML models may accurately and eciently predict the execution of concurrent kernels.
However, new challenges arise when applying an ML approach to analyze concurrent kernels.First, representing kernel test inputs-recall, these are userspace programs invoking sequences of system calls-in byte sequences, as Neuzz and Design2vec do, would make it hard for the model to learn because of the extremely long execution paths of kernel APIs.Before analyzing the consequences of interleaved instructions, the model would have to infer the system call specication [22], entry points of dierent APIs [1,8,30], execution paths of system calls [4,30,67], and then potential interactions between threads [19], among others, exclusively from the plain input byte sequence.Every task in this pipeline is known to be notoriously challenging and often requires specialized approaches to address; it is unrealistic to expect current ML techniques to solve them all in one shot.
Second, presenting the whole kernel's control and data ow graph, which contains millions of code blocks1 , to a model, as Design2vec does, will incur severe scalability and latency problems.As shown previously [69], when the input graph is large (e.g., over 2M nodes), one model inference can take almost 3 seconds.This time cost is already close to the time of a dynamic execution, which takes about 2.8 seconds per run ( §5.2.2), and would cancel most of the benets of a predictive technique.These two challenges motivate a new design of both input representation and model to target the unique case of kernel concurrency testing.
This paper proposes S, a kernel concurrency testing framework that prioritizes schedules and test inputs using a learned kernel coverage predictor.The predictor uses a graph neural network model trained to predict whether certain concurrency-sensitive kernel blocks will execute, if the kernel runs a concurrent test input under a given schedule.Importantly, the predictor is designed to be ecient so that it can perform many predictions in the time it takes to execute a single concurrency test, and can, therefore, yield higher end-to-end testing eectiveness than state-of-art tools, even when considering the model training cost.S identies a set of concurrency-sensitive code blocks that are particularly interesting to predict.Our key observation is that traditional analysis approaches usually struggle to analyze uncovered reachable blocks when a concurrent test input runs under dierent interleavings.These  are kernel code blocks that are reachable but not actually reached when the constituent sequential input runs singlethreaded: a (short) control-ow path to them exists, but is not triggered in a single-threaded execution of the test input.Naturally, such blocks might be reached when two sequential inputs run concurrently and interfere with each other, as illustrated in Figure 1.Coverage prediction on such uncovered reachable blocks is useful because it can guide a concurrency testing tool to prioritize tests that exercise control ows different from those covered in sequential executions.Similarly, S predicts when sequentially-covered blocks are not covered in a concurrent test since this also signies a new behavior.
The coverage predictor enables S to eciently address the challenges in both concurrent input and schedule generation.Thanks to its fast inference ( §5.2.2), S can consider a large pool of candidates and only select a few likely-to-be-fruitful tests to execute; without a coverage predictor, for the same budget of executions, only a far smaller pool of candidates could be considered, potentially missing bugs.Furthermore, when the testing target is a specic part of the kernel (e.g., two instructions that potentially cause a data race), the coverage predictor enables directed testing to nd triggering test inputs faster.S is evaluated on mature versions of the Linux kernel.First, S can select more eective schedules given random concurrent inputs-it nds 17% more potential data races than the state-of-art approach on Linux kernel 6.1 in a week-long search.Second, S can nd concurrent test inputs that trigger bugs faster or with higher probability.For 6 new concurrency bugs in Linux kernel 6.1, S can nd them with a probability 1.4⇥⇠2.6⇥higher on average than related work.For 6 known data races in Linux kernel 5.12, S can nd the error-inducing test inputs 15⇥ faster on average than an existing tool.More importantly, an analysis of the end-to-end time cost (including model training) shows S is more ecient: it reaches the same amount of potential data races nearly a hundred hours faster than the state-of-art.
This paper makes the following contributions:   method for concurrent test inputs and schedules, a model architecture that can learn to predict kernel code block coverage when the given concurrent test input is executed under the given schedule.

Motivation
We consider the problem of coverage-driven concurrency testing for large software systems; we instantiate the problem specically to the Linux kernel due to its pervasive adoption and utmost signicance.
In particular, consider the general workow shown in Figure 2a.In this workow, Sequential Test Inputs (STIs) are identied (i.e., what each thread will execute), put together into a Concurrent Test Input (CTI), and a candidate interleaving schedule is chosen; then a dynamic test is run with the chosen Concurrent Test (CT) (i.e., CTI + interleaving).After the dynamic execution, some termination condition is checked, for example a race detector in bug-nding use cases, or a cumulative coverage collector, in coverage-driven testing, or a time limit otherwise.If the termination condition is not met, another candidate is considered.
In this workow, the dynamic-test execution consumes most of the computational resources ( §1).Yet, most dynamic tests are not fruitful: they might not get closer to the termination condition, either because they do not increase the cumulative coverage metric, or they do not trigger some exceptional code blocks (e.g., an assertion).Our work seeks to reduce the number of dynamic tests that need to be executed before the termination condition is met, moving to a workow that looks like Figure 2b instead.To be useful, the green diamond must meet certain challenges: High Speed.It must be fast.In particular, it must be faster than a dynamic execution.Otherwise, it would be more cost eective (and simpler) to just execute the dynamic test.Low False Positives.When it deems a dynamic test helpful, it must be correct much of the time, otherwise it would cause fruitless dynamic executions, defeating the purpose for such a component.Low False Negatives.Such a component would be fast indeed if it always said that a test will be fruitless.However, that would preclude any dynamic tests from ever executing, and consequently would make no progress towards the termination condition.
We are considering such a component built using machine learning.ML models typically require considerable resources to (a) collect training data and (b) train the model.This leads to an additional challenge: Low End-to-End Cost.Training data generation and training time must be low enough for the end-to-end cost of the approach to be lower than the original workow.
To put these challenges in perspective, consider this workow in a simplied setting (that models a race-driven exploration), in which an innite stream of test candidates is available, and the termination condition is a simple boolean predicate, e.g., "was a target data race triggered during the execution?"(see Figure 3).The (current) exhaustive approach would execute all candidate tests, resulting in 4 test executions in this scenario.An omniscient model that perfectly predicts whether a test will be fruitful (positive in this scenario) would only execute a single test.A more realistic model would have some false positives (executing test 2 even though it is fruitless), and some false negatives (not executing test 4, even though it would have been fruitful), and would keep going until the second fruitful test 7 is encountered.§A.6 explores this analytically.
Finally, we are considering the steady state of keeping Linux kernels properly tested as the code evolves from version to version.It is not only important to get to a high level of test validation quality for a single kernel version; it is also important to be able to adapt quickly to the next version, and the one after that.This brings the nal challenge: Generalization.Especially with frequent kernel updates, the cost of training can add up.An ML-based test evaluator should be able to generalize from version to version, with limited additional data-gathering and training cost, possibly by building upon prior training datasets and models.
To summarize, we seek to build a rejection lter for candidate concurrency tests.To be eective given a goal (e.g., target code coverage), such a lter must be cheap enough to build and update as new kernels arrive to be worth the reduced cost from the skipped dynamic tests that it lters.

Design
To address the challenges outlined in §2, we consider the design of a learned coverage predictor for CTs.In particular, we build our system around the following design principles: Train on multimodal data.Many types of kernel information including syntax, semantics, and single-thread executions can be used to train eective models for concurrency testing.Notably, much of these data sources is the byproduct of other kernel testing procedures such as kernel fuzzing and static analysis, making their collection cost-ecient.
Predict block coverage.Although several coverage metrics exist (e.g., alias coverage [68]), in kernel testing, which involves copious error-handling code [35] and assertions, the basic code-block coverage can be representative of exceptional kernel behavior [20].
One interleaving at a time.Much kernel-testing work considers separately the choice of STIs to execute on different threads, or the interleaving to combine them into a CT.Even when exploring interleavings, some are useful and some are not, providing an opportunity to save wasted eort.Therefore, we focus on predicting the coverage of a CT-a CTI plus a target schedule.
Predict coverage of uncovered reachable blocks and sequentially-covered blocks.ML systems are usually constrained by the size of their inputs and outputs, so it is important to reduce the context of any ML prediction.In the case of a block-coverage predictor, a single CT is likely to cover only a small amount of code blocks of the entire kernel.On the other hand, all we know about a CT is that its constituent STIs already covered a limited set of sequentiallycovered blocks, but the major point of concurrency testing is to see what else we can get those tests, run concurrently, to cover.Therefore, we structure our learning objective as one of also predicting the coverage of uncovered reachable blocks (URBs) and sequentially-covered blocks (SCBs) from the test's code."Reachable" means blocks that could be reached via a constrained control ow path from the code run by the test threads."Uncovered" means blocks that are not covered by each test thread when run sequentially.Sequentially-covered blocks are still interesting to predict because they can be aected by concurrency as well (e.g., a function skipped because of a changed control ow).This design choice limits the size of the task examples to just the blocks covered during sequential runs and those reachable within a small number of hops, making our predictor feasible and ecient.
Putting all these design directions together, S trains an ML coverage predictor that takes as input a CT (two STIs and a target interleaving), and predicts a Boolean ("covered" or "not covered") for every code block, including SCBs and URBs of the two STIs.In a sense, S learns to predict the block coverage of a CT from the block coverage of its constituent STIs and the target interleaving.
Specically, S ts in the following workow, typical with feedback-based fuzzing systems [19,20,25,36,68]: 1. Similar to SKI [17], Razzer [25], and Snowboard [19], it assumes a source of STIs, generated by a tool such as Syzkaller [20] or fsstress [3]. 2. Similar to Snowboard and Razzer, it uses information already collected during the single-thread execution of STIs (e.g., control ow) to prime a downstream CT generator.3. It uses a static analysis tool to build a control ow graph (CFG) of the whole kernel, so that uncovered reachable blocks (within : hops) can be easily identied from those sequential executions.

It collects a training dataset of concurrent executions,
by running an existing tool, such as SKI, to generate and execute CTIs under target schedules.Before the execution, it records the test input and the target schedule.After the execution, it collects the block coverage.5. Given this training dataset, it trains a per-interleaving coverage predictor model (PIC) that, given the CT candidate including its assembly code, dataow, controlow, and schedule edges, predicts the coverage of URBs and SCBs.6.At run time, the PIC model is used to lter out candidate CTs, by predicting their coverage, and using a strategy to judge whether that coverage would be fruitful towards some goal (e.g., towards increasing overall coverage, or towards reaching an interesting block).

When a new version of the kernel is available, data
generation and training steps are repeated to ne-tune the existing PIC model to the new kernel version.
We describe the details of our approach next.

CT Data Representation
At the core of S lies an ML model that predicts the coverage of blocks during the execution of a CT.We now explain how we present a CT to the model.Since we use a variety of information (code, static analysis, dynamic behavior) about a CT, we chose a graph representation.This is similar to prior ML work that focuses on testing,  e.g., design2vec [56], or that aims to learn deep static analyses, e.g., ProGraML [10].
A CT consists of a pair of STIs and scheduling hints.Each STI is a sequence of system calls that will be invoked from one application process, and the scheduling hint tells the executor how to schedule the two kernel threads (e.g., "switch to thread B when thread A executes the i-th instruction").Figure 4 shows an overview of the graph representation in S. The CT graph is made of vertices corresponding to kernel basic blocks (i.e., sequences of assembly instructions uninterrupted by control-ow entry or exit).
There are two types of vertices: sequentially-covered blocks (SCBs), i.e., blocks that were covered during the sequential execution of the two constituent STIs; and uncovered reachable blocks (URBs), i.e., blocks that are statically reachable from the sequentially-covered blocks, within a small number of control-ow hops, but that were not reached during the sequential execution.We use the whole-kernel CFG to identify URBs.We set the limit to only identify 1-hop URBs to avoid path explosion and maintain a reasonable number of nodes per CT graph ( §5.1.1).Each vertex holds two features: the vertex type (URB or SCB) and the code in the basic block (assembly instructions as text).
The vertices are connected by ve types of edges.SCB control-ow edges represent the control ow taken during the sequential execution of the constituent STIs.URB controlow edges represent static control-ow edges connecting the SCBs to URBs.A third edge type that demonstrates intrathread data ow during the sequential execution connects blocks in the same thread.A fourth edge type connects blocks from dierent threads in this case that have a potential data ow; potential data ow occurs between two instructions in dierent threads, of which one is a write and the other a read, and that address overlapping memory ranges [19,36].Finally, a fth edge type represents the candidate schedule as a pair of scheduling hints: proposed yield points from a block in one thread to a block in another thread (see next few paragraphs).A summary of the parts of a CT graph can be found in the appendix (Table 7).
Scheduling-Hint Edges.In our CT graphs, the scheduling hints are enforced via virtualization tools such as SKI, which implements a uni-processor scheduler (i.e., only one thread runs at a time).Given threads and ⌫, we consider two scheduling hints marked by thread-switch-point instructions .G and ⌫.~.The concurrent-test execution system will (try to) enforce these two hints by starting with the rst instruction of , executing up to instruction G, and then yielding to thread ⌫.Thread ⌫ will then execute from its rst instruction up until it reaches instruction ~, at which point it will yield back to .will then continue from the instruction following G.It is meaningful to have more than two scheduling hints, but we congure S to set two scheduling hints per CT because they are sucient for discovering most concurrency bugs [42].A similar setting is also used in related work [15,17,25].
We call these scheduling instructions "scheduling hints" because the actual interleaving exercised might be dierent from what was hinted.For example, SKI [17] will skip a thread switch if the thread-switch-point is not encountered, or will invoke additional switches if it detects deadlocks (e.g., a thread switch happens in the critical section).
Our graph representation captures the scheduling hints in the above example, by connecting a scheduling hint edge from the block containing instruction .G to the rst block of ⌫, and a second edge from the block containing ⌫.~back to the block containing .G; note that this is a simplication of the case where the successor instruction of .G lies in a dierent block from .G, and it essentially tells the model to "nish the block it was executing before the yield".

PIC Model Architecture
The goal of the per-interleaving coverage (PIC) predictor is to predict which, if any, of the URBs and SCBs of the two threads are covered.The model is trained using actual dynamic tests and their observed coverage upon completion.Specically S uses a model architecture that consists of two major modules.First, a sequence model (BERT) [11] that is responsible for generating embeddings of code blocks based on their assembly code.Second, a graph neural network (GNN) [49] that takes the graph as input, learns relationships between embeddings of code blocks and performs a binary classication on every node (code block) in the graph.
Since the graph neural network architectures we use are standard, we just outline the GNN "interface" here.It can be seen as a parametric function that predicts targets ˆfrom input graphs G, GNN(G; Emb(.); \ GNN ) = ˆ, where \ GNN are the learnable parameters of the model, and Emb is an input embedding function of the graph features into vectors of oating-point numbers, so that they can be used readily by the GNN.Recall that our graph features (besides the graph structure itself) are the vertex and edge types, and the text representation of a vertex (block) as assembly.
To embed vertex and edge types, we use a simple learnable embedding matrix that maps types to learnable parameters \ Emb , one per type (2 types of vertices, 5 types of edges).
To embed the assembly text asm, we use a standard BERTlike encoder (an instance of a Transformer [57]), pre-trained on all assembly code in the Linux kernel.We treat all assembly as text, but elide any numerical tokens, such as register osets, since they do not provide much useful signal to the model, and their semantics (e.g., memory accesses) are captured by other features in our graphs already.We then pretrain BERT with this preprocessed assembly text, to learn a BERT asm (asm; \ BERT ) function in the standard way [11] (i.e., training on a masked language model objective).We use this BERT-on-assembly encoder as the embedding function for the assembly feature of every vertex in the graph.
In summary, the learnable parameters of our model are \ GNN for the GNN itself, \ BERT for the Assembly encoder, and \ Emb for the 2 vertex and 5 edge types.Note that \ BERT is pre-trained once, since what looks like "natural" assembly code does not change much from kernel version to version.However, we do ne-tune these parameters during the training of the GNN whenever a new PIC model is trained on a new kernel version.
We train the GNN by minimizing the binary cross-entropy loss between the predicted coverage ˆ8 and the ground truth ~8 of all blocks.We compute the binary cross entropy between target and prediction in each graph example rst, and the model minimizes it across the examples of the training population.

Predicted-Coverage-Guided Concurrency Testing
Once the PIC model is trained, S can use it to predict the block coverage of new CT candidates that consist of new CTIs and schedules.This section introduces how S selects interesting schedules and CTIs for dynamic executions based on the predicted block coverage.S can use an external interleaving exploration tool to propose new schedules (scheduling hints) and then S generates the graph CT of these new candidates to get the predicted block coverage from the PIC model.Finally, S applies a prioritization strategy on the predicted coverage and only executes the CT if it is interesting under the strategy.S uses one of three strategies to select interesting CT candidates based on the predicted SCB and URB coverage.Their eectiveness in nding eective schedules and CTIs is evaluated in §5.3 and §5.6.2.S1: New set of positive blocks.Under this strategy, a CT is interesting if it can trigger a new predicted coverage bitmap (sequence of block-coverage Booleans) that has not been observed before.The intuition is that new coverage roughly determines a controlow change, even if it does not necessarily cover any new individual blocks.To avoid future CTs that produce the same coverage, S remembers the predicted block coverage of each previously chosen CT.
S2: New positive blocks.Under this strategy, a CT is interesting and selected if the predicted URB and SCB coverage contains at least one code block that has not been observed before.Similar to S1, S remembers predicted-to-becovered code blocks of every CT it selects, so future CTs can be evaluated.
S3: Positive blocks with limited trials.This strategy limits the number of executions that each positive code block can be attempted.On the one hand, a trial limit higher than 1 can encourage a code block to be attempted several times (e.g., in dierent calling stacks).On the other hand, the trial limit will prevent S from trying too many CTs on blocks that might be false positives produced by the model.

Implementation
Concurrent test candidate representation.S uses Syzkaller to generate and execute sequential test inputs (STIs).During the STI execution, S collects necessary information such as the SCB control ow.S uses Angr [52] to build the kernel CFG, which is necessary for URBs identication.In total, we wrote ⇠2.5K LOC in Python for converting a concurrent test candidate to an input graph to the model.

Graph dataset collection.
To label graphs for training and evaluation, S modies SKI-a customized QEMU emulator that applies PCT [6] interleaving exploration on the guest kernel-to dynamically execute and prole the concurrent test candidates, so the block coverage can be collected and used for labeling all nodes in the graph.⇠0.5K LOC in C is added to SKI to instrument the guest kernel executions for trace collection.Around 1K Python code and 0.2K LOC Bash scripts are implemented for automating and distributing data collection.
Model training and evaluation.The assembly code embedding module is a RoBERTa model trained using the framework fairseq [45].The GNN module is a GCN [32] implementation from the Pytorch Geometric framework [14].In total, about 1K LOC in Python is implemented for training the PIC model; 5K LOC Python code and 0.5K LOC Bash scripts for the evaluation.
Kernel concurrency testing.The evaluation of S uses existing kernel concurrency testing tools including SKI, Razzer and Snowboard.About 1K LOC Python code and 500 LOC C code are implemented to integrate the coverage predictor and perform concurrent test candidates selection.

Evaluation
The section evaluates the eectiveness and eciency of S in kernel concurrency testing with respect to existing testing tools.Specically, it seeks to answer four questions: RQ1: Can the PIC model accurately predict the coverage of URBs in concurrent kernel executions? ( §5.2) RQ2: Can S identify more eective test candidates given a budget by using the coverage predictor?( §5.3) RQ3: Can the cost of S amortize well as kernels evolve?( §5.4) RQ4: Is the PIC model benecial to existing testing workows?( §5.6) Setup overview.We evaluated S on Linux kernels 5.12, 5.13, and 6.1.First, we focus on Linux kernel 5.12 for the initial proof of concept.We train, tune, and evaluate PIC models on Linux 5.12 data.Second, Linux kernels 6.1 and 5.13 are used to study the generalization ability of PIC, in which dierent retraining trade-os are studied.The experimental platform details are described in §A.1.

PIC Model Training
We now describe our training methodology, given the PIC architecture ( §3.2).

Dataset Construction.
Although it is important to produce datasets to evaluate RQ1 (a typical ML microbenchmark evaluation), we are also interested in how S can be used in "practical" settings, as per the remaining RQs.This means that the "test" period for the model is signicantly longer than the training and validation period.We have therefore constructed training/validation/evaluation 3 datasets that deviate from the typical 90%/5%/5% example mix in ML research.Specically, we collected 44,686 concurrent test inputs (CTIs) (i.e., random pairs of sequential test inputs (STIs)) from SKI, on Linux kernel 5.12, and we split them into 21,621 training CTIs, 2,702 validation CTIs, and 20,363 evaluation CTIs.We then produced 64 interleavings for the training and validation CTIs, and 1000 for the evaluation CTIs; the much higher number of interleavings for evaluation CTIs was meant to facilitate experiments where we want to give S the ability to search for good schedules for a long time, beyond what a typical tool like SKI might do.When projected to our block-oriented graphs, this resulted in, on average, 64 unique interleavings per CTI for training and validation, and 953 unique interleavings for evaluation, for a total of 1.37M, 0.17M, and 19.05M graphs across the three dataset splits.A CT graph contains, on average, 9.7K vertices (2.4K URBs and 7.3K SCBs), and 14.1K edges (8.4K SCB control ow, 4.2K URB control ow, 1K intra-thread data ow, 0.6K inter-thread data ow, and always 2 scheduling hint edges).
We also augmented our graphs with shortcut edges-edges that connect vertices that are : sequential control-ow edges apart-which is a common "densication" technique that improves model performance on code GNNs [56].
We then chose the model training checkpoint with the highest Average Precision (AP) [63]; AP computes the mean precision (true positive predictions divided by all positive predictions) over all recall values. 4This gives a metric of the "goodness" of a model across all tuning points.The model with the highest AP is called PIC-5.To favor positive predictions on "surprising" blocks, we computed AP over URBs only when selecting hyperparameters.
One interesting observation from this hyperparameter exploration is that PIC models that have deeper GNN modules can achieve higher performance; the number of layers of a GNN is roughly equivalent to from "how far" in the graph information is gathered before making a decision about a vertex.In our case, this observation indicates that analyzing concurrent executions requires considering broader control and data ows.
PIC-5 was then tuned to choose a threshold for the predicted classication probability.We chose the threshold with the highest mean F2 score [64] on graph URBs over the validation dataset.We chose F2 because it favors a higher recall over a higher precision.
5.2 PIC Model Performance 5.2.1 Model accuracy.The performance of PIC-5, under the tuned threshold, is evaluated using several binary classication metrics [64,65] on the evaluation dataset ( §5.1.1).Due to the lack of advanced analysis approaches that are comparable to the PIC model, several baseline approaches are proposed and used for comparison: • All blocks as positive predictor (All pos) predicts every node in the graph as positive.This predictor represents a simple static analysis approach.• Random binary predictor (Fair coin) predicts every node in the graph as positive with a probability of 50%.  1 presents results on URBs in each graph.All pos has extremely low accuracy while Fair coin and Biased coin have much trouble with precision.The root cause of their bad performance is that positive/negative labels are extremely skewed for URBs.In other words, most URBs are actually not covered during the concurrent executions, so naive baselines cannot predict accurately.
PIC-5 achieves much better performance across metrics.First, its accuracy is satisfying.Considering the accuracy is now dominated by the true negative rate due to the skewed label distribution, the high accuracy indicates that PIC-5 has a high true negative rate-it can accurately identify URBs that are actually not-covered during concurrent executions.Second, PIC-5 outperforms the baselines by two-digit margins on precision and recall.It is expected that the precision and recall are a bit lower than accuracy because the former two metrics reect how well the predictor can correctly identify the actually-covered URBs, which is more challenging than identifying the actually-not-covered ones.
We show results here for just URBs, because they are a harder target subpopulation, but we also show results on the full set of blocks in §A.3, and they look similar.

Inference cost.
The PIC model can make predictions fast once trained and deployed for inference.On our inference machines, it takes on average 0.015 seconds to predict the coverage for one CT candidate.In contrast, one dynamic execution of a candidate takes 2.8 seconds because of the heavy instrumentation for thread serialization and bug detection.In other words, S is able to predict coverage for 190 test candidates in the same time it takes to run one dynamic execution.This favorable performance asymmetry, balanced with reasonable precision and recall of PIC, explains our positive end-to-end eciency results in the rest of this section.

Selecting Interesting Schedules with PIC
S integrates the PIC model into SKI, which uses the PCT algorithm to explore interleavings.By combining PCT with PIC to select promising interleavings according to our selection strategies ( §3.3), we build the MLPCT exploration algorithm.The eectiveness of S hinges both on the predictive power of PIC, and on the choice of the selection strategy.This section studies the impact of MLPCT towards achieving high coverage, compared to SKI using PCT.Two metrics are proposed: • Data-race-coverage measures the number of unique possible data races found by a data race detector (an implementation of DataCollider [13]) in explored interleavings.• Schedule-dependent block coverage measures the number of unique code blocks under concurrent executions excluding all SCBs of the concurrent test.Higher block coverage implies both that more kernel behaviors are explored, but also that concurrency-dependent behaviors are being explored, which is the goal of this work.
We study both kinds of exploration goodness under two scenarios: (a) given a CTI, what is the maximum coverage metric we can get ( §5.3.1), and (b) given a stream of CTIs, what is the maximum cumulative coverage we can get ( §5.3.2).Both experiments focus on testing Linux kernel 5.12 and PIC-5 is used by MLPCT .

Coverage Improvement Per CTI.
For this experiment, we choose a random CTI, and then explore it using SKI as well as our MLPCT /SKI variants.We use a budget of 50 dynamic executions for both, but also cap the number of PIC inferences to 1,600.We do this for 1.3K CTIs, and we report coverage increase averaged over all inputs.
Most MLPCT strategies perform better than PCT (10% to 20% more data races and 6.5% to 25.8% more covered scheduledependent code blocks), showing that MLPCT can identify more fruitful interleavings for actual dynamic executions.
We have also studied how increasing this budget all the way to 200 aects the MLPCT benet.As the original PCT now executes more dynamic tests, it gets closer to a saturation point, so MLPCT has less headroom to shine.This is consistent with observations in the original SKI work [17] about the number of useful unique schedules for a single CTI.Appendix A.4 shows more detail.

Cumulative Coverage Improvement.
In this experiment, we seek to achieve the highest coverage by running SKI and our ML-enabled variants on a stream of PCTgenerated CTIs, each receiving a budget of 50 dynamic test executions.Unlike the experiment in §5.3.1, earlier CTIs have a larger "unexplored" coverage map, but as more CTIs and their interleavings are tested, that coverage saturates.
As shown in Figure 5a, most MLPCT strategies reach higher coverage in terms of unique data races sooner than SKI (up to 10%).As an illustrative example, SKI took 304 hours to reach 3,500 unique possible data races, whereas the best S strategy S1 took only 155 hours.Strategy S2 seems to be overly conservative: it only selects schedules that are predicted to execute at least one uncovered code block ( §3.3), but because we cap inferences at 1,600, it runs out before it reaches all 50 dynamic executions.This is understandable considering the skewed distribution of (a) Testing Linux kernel 5.12 using PIC-5 (b) Testing Linux kernel 6.1 using PIC-5 (c) Testing Linux kernel 6.1 using PIC-6.ft.sml(d) Testing Linux kernel 6.1 using PIC-6.ft.med(e) Testing Linux kernel 6.1 using PIC-6.fs.med (f) Testing Linux kernel 5.13 using PIC-5 and PIC-5.13.ft.sml positive URBs.Other strategies explore new, unexplored coverage maps (i.e., combinations of covered URBs and SCBs), achieving higher coverage faster.
Generally, SKI/PCT requires 100-200 more hours to reach the same Data-race-coverage size as MLPCT .While this result is very encouraging, it comes with a high start-up cost: PIC-5 took 240 hours of data collection and training to achieve its performance.We next turn to understanding how this start-up cost can be amortized as the kernel evolves.

Adapting to Newer Kernels
If every time a new Linux kernel comes out, we have to spend 240 hours training to save 100 hours from data-race discovery, the cost/benet balance would not be favorable.In this section we seek to understand how little (re)training we can get away with, as we move from kernel version to kernel version, hoping to achieve an amortization point.We conduct an experiment that tests a new kernel-Linux kernel 6.1 (released about 18 months after version 5.12 with numerous changes)-using 4 new PIC models.These 4 models are trained on newly collected training data.We collect such data by generating new CTIs for Linux kernel 6.1, selecting a number of those, exploring their interleavings, and running dynamic executions, as we did to get the original training dataset.However, the new datasets are collected in a smaller scale than the dataset for Linux kernel 5.12 since full-sized data collection and training are not favorable in incremental training settings.We use the same hyperparameters as for PIC-5.The validation performance of the retrained models is analyzed in §A.5.Specically, we select the variants detailed in the top part of Table 2 (we show PIC-5 for comparison).Those include two ne-tuned variants, but also two from-scratch variants, where PIC-5 is discarded and a fresh model for Linux 6.1 is trained.
Several interesting observations are made.First, Figure 5c and Figure 5d show that ne-tuning PIC-5 with modest new 6.1 data and training time is a feasible and ecient approach to increase the testing eectiveness on the new kernel.In fact, considering that MLPCT is faster than PCT by 50-100 hours, S does not only nd more data races in 6.1 than PCT-17% more races after a week, but also incurs a similar (with PIC-6.ft.med) or even lower (with PIC-6.ft.sml) end-to-end time cost.What's more, this amortization scales with further kernel versions and ne-tuning.
Second, the from-scratch variants (Figure 5e and Figure 10 in §A.5) do not perform well, since they do not have the knowledge already gleaned from many hours of training on Linux kernel 5.12, which do instruct the model usefully about the structure and semantics of kernel code, no matter what the version.In fact, PIC-5 performs better without the benet of Linux 6.1 data (Figure 5b) than the from-scratch 6.1 models (Figure 5e), nding 300 more possible data races; this is a reminder that dataset size trumps all other scaling factors with large deep models [26].
Motivated by the promising results achieved by PIC-5 on Linux 6.1, we later conduct another experiment to test Linux kernel 5.13 (released about 2 months after 5.12) to verify the eectiveness of PIC-5 on a third kernel using two models.One is PIC-5 and another is PIC-5.13.ft.sml (Table 2), which is trained by ne-tuning PIC-5 with a small amount of new data collected on Linux 5.13, under the same data collection and training settings as PIC-6.ft.sml.We congure MLPCT to use the S1 strategy, which shows the best overall results in the previous exploration, and run MLPCT under the guidance of PIC-5 and PIC-5.13.ft.sml separately but on the same CTI stream.
Figure 5f compares the Data-race-coverage history between MLPCT , using the two models, and PCT.First, both models enable MLPCT to outperform PCT, strengthening the advantage of the model-guided approach.Second, PIC-5.13.ft.sml helps MLPCT nd possible data races faster than PIC-5 by up to 40 hours.However, they achieve a similar coverage level in the end.This result reveals that, when testing a new kernel that has fewer changes since Linux kernel 5.12, PIC-5 remains highly eective-it reaches a similar level of overall Data-race-coverage as the ne-tuned model while ne-tuning PIC-5 is more useful in terms of increasing the data race discovery speed.

Finding New Concurrency Bugs
To see if MLPCT helps discover new kernel concurrency bugs, we analyzed all data races found by MLPCT in Linux kernel 6.1.We manually pruned benign data races [42] that are annotated [28] or commented as tolerable by developers in the source code or commit messages, data races caused by synchronization primitives (e.g., locks), and data races involving only kernel variables that are not sensitive to correctness (e.g., timers).We spent about 100 person-hours total on manual inspection and reproduction.
We arrived at 14 new data races that are likely to be concurrency bugs and reported them to developers.Of those, 9 are conrmed to be bugs (3 patched), 3 are considered to be harmless and 2 are pending conrmation, as shown in Table 3a.These new bugs reside in dierent subsystems of the kernel and can cause data loss, wrong system-call return values, inconsistent kernel state, etc.All 9 conrmed new concurrency bugs are only found by MLPCT -PCT alone could not expose any of them in the time allotted to it, which shows that testing random schedules is not eective in nding new kernel concurrency bugs.A possible reason is that mainstream kernels, such as Linux, are already extensively tested under random schedules by existing fuzzing tools such as Syzkaller, which can perform basic concurrency testing by invoking system calls simultaneously in dierent threads.
Moreover, the eectiveness of MLPCT implies that prioritizing the testing of interleavings that trigger new URB control ows is helpful in nding new bugs.Taking bug #7 as an example (shown in Figure 6), this bug only manifests when two kernel threads run functions vivid_fop_release() and vivid_ratio_rx_read() concurrently under very complex interleavings.First, vivid_fop_release() must acquire and release a shared mutex lock before vivid_ratio_rx_read() grabs this lock ( › À).If the lock is acquired in the opposite order (À › ), the lock-protected code in vivid_ratio_  3a.
rx_read() will execute strictly before , therefore masking the bug.Second, vivid_ratio_rx_read() needs to update dev->rds_owner before vivid_fop_release() checks its value in the if statement (À › Ã), so that the true branch will be taken.Third, vivid_fop_release() should change dev->rds_ owner to NULL before vivid_ratio_rx_read() reads it (Ã › Õ), so that vivid_ratio_rx_read() will be induced to call vivid_ratio_rds_init() again, which is a double initialization and makes dev's state inconsistent.Our analysis indicates that bug #7 has existed for over 9 years, even though testing tools (e.g., Syzkaller) have exercised the associated code extensively.The challenge in nding this bug is discovering schedules that satisfy all order constraints: › À, À › Ã, Ã › Õ.With testing tools that only perform random interleaving exploration (e.g., SKI), the chance of discovering this bug is extremely low.However, S managed to nd this bug because it rejected many schedules that do not trigger new URBs and nally selected one that can exercise URBs in the code region Ã.The particular schedule consists of two scheduling hints that enforce the thread switches › À and À › Ã.Once the rst two order constraints are satised, the third one can be satised easily.Because the thread scheduler, unless given extra scheduling hints, is unlikely to call a thread switch after the read of dev->rds_owner but before the write to dev->rds_last_block in code region Ã, code region Õ can almost always see the dev->rds_last_block updated by Ã, which triggers the bug.

PIC Integration Case Studies
This section presents the exploration of using the PIC model to improve existing kernel concurrency testing tools, Razzer and Snowboard, by selecting more eective concurrent test inputs (CTIs) for execution.5.6.1 Find race-inducing CTIs with Razzer.Razzer [25] is a kernel concurrency testing tool that uses static analysis to identify possible data races and then tries to generate CTIs to trigger the data race.Given a possible data race, Razzer will generate many sequential test inputs (STIs) through fuzzing, and then inspect their execution code coverage to nd pairs of STIs that can separately execute the two data-racing instructions.Such pairs of STIs will be executed concurrently,  3. Overall testing results by S, which includes 17 concurrency bugs and 4 benign data races.DR denotes "data race", OV denotes "order violation [42]" and AV denotes "atomicity violation [42,43]".Bugs conrmed as harmful are in bold type.
hoping that the two instructions will race over the shared memory.
However, if a data-racing instruction resides in a URB of the STI, Razzer will not even attempt to trigger the data race.Extending Razzer to allow pairs of STIs that contain the data-racing instructions in either SCBs or URBs is a feasible solution but might encourage too many test candidates.Naturally, the PIC model is a good option to prune this search space.We evaluate this idea using 6 known harmful data races in the Linux kernel 5.12.
We extend the approach of Razzer to Razzer-Relax which reports STI pairs as CTI candidates if the data-racing instructions reside in SCBs or URBs.Razzer-Relax is able to nd more potentially useful CTIs, therefore increasing the chance of nding bugs.In addition, Razzer-PIC is designed based on Razzer-Relax.Razzer-PIC uses PIC-5 to evaluate CTIs identied by Razzer-Relax and only keeps CTIs that are predicted to cover the two corresponding code blocks under some random schedules.For each of the evaluated data races, we let these three variants of Razzer propose CTIs and then dynamically execute the selected CTIs with 5K random schedules using SKI to verify if the data race can be reproduced.
Table 4 presents the results of this experiment.First, it is shown that Razzer cannot reproduce 5 data races with the most conservative CTI search algorithm.This nding motivates the use of Razzer-Relax and Razzer-PIC.However, while Razzer-Relax can successfully reproduce all data races, it incurs a signicant time cost.For instance, it takes over  [9,27,42].
Additionally, we observe that many CTIs selected by Razzer-PIC actually trigger the two data race instructions to run in dynamic executions, which means PIC-5 does correctly predict the execution of their corresponding blocks.However, the target data race is not reproduced by these inputs because the two instructions triggered by them do not access the same memory, which is another requirement for the two concurrent memory accesses being a data race.This observation suggests the opportunity of training PIC to predict the inter-thread data ows between code blocks ( §6).PIC trained on this task can further reduce the time for concurrency bug reproduction and possibly assist points-to-analysis on the Linux kernel, which is of limited use in practice due to the high false positive rate.

Better Clustering of Similar CTIs in Snowboard.
Snowboard [19] is a kernel concurrency testing framework that builds on SKI by clustering CTIs that trigger "similar" kernel behaviors using various heuristics, and then only sampling a xed number (1 as published) of exemplar CTIs from each cluster for dynamic executions, assuming the remaining CTIs trigger similar kernel executions and therefore are unnecessary to test.Here we explore if choosing exemplars from a cluster can be improved using PIC.In contrast to §5.6.1, we seek to use PIC to only select CTIs that trigger dierent kernel executions, rather than CTIs that trigger a specic data race.We rst study if the amount of CTIs sampled per cluster would aect the eectiveness of Snowboard.We run Snowboard twice to test Linux kernel 6.1 with dierent CTI sampling sizes but the same INS-PAIR clustering strategy, which clusters CTIs by whether the two constituent STIs separately trigger a kernel instruction to respectively read from and write to a shared kernel memory region in their single-thread executions.In the rst run, we use the default CTI sampling size in Snowboard-1 CTI per cluster-and nd 1 new bug in Linux kernel 6.1 after testing 322,570 unique clusters.In the second run, we disable CTI sampling so that Snowboard will execute all CTIs in each cluster and we nd 6 new bugs (Table 3b) after testing the same number of clusters, demonstrating that the choice of cluster exemplars might determine whether exploration will bear fruit in a fertile cluster.We call the 6 INS-PAIR clusters where the exhaustive application of Snowboard nds bugs the 6 buggy clusters.
We consider an application of test candidate selection strategies ( §3.3) on choosing exemplar CTIs from a CTI cluster, by relaxing Snowboard's one-exemplar-per-cluster policy to allow multiple samples.Specically, whereas Snowboard chooses exemplars from the cluster at random, we invoke PIC for each CTI in the cluster, with a single scheduling hint that enforces the write instruction from the instruction pair to yield to the read instruction of the pair.By passing the CTI with a synthetic scheduling hint to PIC, we predict the coverage, and select CTIs that, cumulatively, exercise unique block coverage (S1) or increase total block coverage (S2).The selected exemplars from the two sampling approaches are then tested by the regular interleaving exploration mechanism of Snowboard.We call these sampling approaches SB-PIC (S1) and SB-PIC (S2), and use PIC-6.ft.med (Table 2) for them.We compare SB-PIC to a relaxed Snowboard sampling approach we call SB-RND, which samples a xed percentage of CTIs from the cluster.
We seek to compare how likely the two sampling approaches are to nd bugs, and the amount of CTIs they need to execute using the 6 buggy INS-PAIR clusters we found above.On each buggy cluster, we run SB-PIC and SB-RND separately congured with 25%, 50% and 75% sampling percentages.If the CTIs sampled by each sampler lead Snowboard to uncover the bug, we call it a bug-nding run.Since this experiment is non-deterministic, we run 1000 trials for each buggy cluster and each sampling approach.We report Table 5. Results of nding bugs using dierent sampling methods in Snowboard.Each Snowboard instance is repeated for 1000 times on every buggy INS-PAIR cluster."Bug nding probability" is the number of runs in which the bug was found divided by 1000, "# executed CTIs" is the average amount of sampled/executed CTIs, and "sampling rate" is the percentage of CTIs sampled from the cluster."Bug ID" refers to the bugs listed in Table 3b.
the percentage of bug-nding runs out of 1000 trials as bug nding probability.Additionally, we report the number of executed CTIs per cluster and the percentage of executed CTIs as sampling rate.
As reported in Table 5, SB-PIC (S1) shows the perfect bug nding probability.However, it is not a useful sampling approach because it often executes all CTIs in the cluster, which will incur signicant testing costs.Fortunately, SB-PIC (S2) produces promising results.On average, it nds each of the 6 bugs with a probability of 77.6% but only needs to execute 44.8% CTIs per cluster.In contrast, SB-RND (25%) and SB-RND (50%) sample 25% and 50% CTIs per cluster but only achieve bug nding probabilities of 29.5% and 54.6% on average-SB-PIC (S2) is respectively 2.6x and 1.4⇥ better.
Furthermore, when compared with SB-RND (75%) that has an average bug nding probability of 78.5%, SB-PIC (S2) requires fewer dynamic executions-only 44.8% CTIs per cluster-to achieve a similar bug nding capability (77.6%).This high eciency is valuable as a low sampling rate can save signicant testing resources in a real-world testing campaign where abundant clusters need to be tested to uncover a few buggy ones.For instance, testing 322,570 clusters using SB-RND (50%) and SB-RND (75%) would take about 5,662 and 8,443 hours, respectively, on a 30-vCPU VM.Therefore, sampling CTIs using SB-PIC (S2) can signicantly improve the eectiveness and eciency of Snowboard.

Discussion
Useful prediction tasks for concurrency testing.S chooses to predict the coverage of 1-hop URBs and SCBs for the concurrent test candidate.However, there might be other prediction tasks that can improve concurrency testing, such as predicting the inter-thread data ows and interrupt handler coverage.In particular, predicting the coverage of multi-hop URBs (e.g., 5-hop URBs) may provide S more details about the concurrent test execution.However, it is unlikely that this extension would yield signicant improvements.First, 1-hop URB coverage is already sucient to identify test candidates that trigger diverging kernel executions-any control ow changes during the concurrent execution will trigger 1-hop URBs.Second, adding multi-hop URBs to the concurrent test graph will greatly increase the graph size and consequently decrease the eciency of the coverage predictor (e.g., higher training and inference cost).Thus, extending S to support new prediction tasks should be motivated by a study that compares the concurrency testing eectiveness of dierent coverage metrics.
CT graph enhancements.Adding more concurrency-related information to test graphs could help S train more accurate PIC models.For instance, information about data races that might happen when the concurrent test is executed and special code blocks that belong to kernel synchronization primitives can be encoded in the graph by adding edges of new types and adding new node types.
Guide test input and schedule generation using PIC.Neuzz and Design2vec use the trained model to perform input mutation.A similar algorithm can be applied on the PIC model to identify promising test candidates that trigger new URBs.For instance, the PIC model can suggest that certain SCB control ows are needed to trigger a specic URB.However, it is still challenging to generate sequential test inputs that can trigger arbitrary SCB control ows.
Predict concurrent executions on weak memory models.The PIC model is trained using kernel concurrent execution traces collected under the sequential memory model.While it is possible to train new PIC models using traces under weak memory models, it is unclear how the hardware optimization (e.g., out-of-order execution) can be represented in the concurrent test graph.

Related work
Kernel concurrency testing.SKI [17] takes a CTI as input and executes CTs that explore various interleavings of the CTI using the PCT algorithm [6].Snowboard [19] generates eective CTIs by predicting the inter-thread data ows that could happen when running two STIs concurrently and prioritizes the testing of CTIs that trigger less-tested data ows.Then Snowboard exercises dierent interleavings of the predicted data ows to test their impact on the kernel.
Razzer [25] uses static analysis [54] to identify possible kernel data races.Razzer lters out the false positive data races using dynamic executions.It uses a fuzzing tool to nd CTIs that may trigger the possible data race and executes those CTIs to check if the data race can actually happen.Krace [68] proposes a new coverage metric called alias coverage for lesystem concurrency fuzzing, which measures the pairs of instructions that touch the shared memory during the execution.It executes every randomly-generated CTI under random interleavings and then measures the alias coverage.If a CTI increases the overall alias coverage, Krace will further mutate it to generate new CTIs.S diers from the aforementioned tools in that it introduces a new workow for concurrency testing.Given the CT candidates, S eciently evaluates their potential in exercising new kernel behaviors using the coverage predictor and only selects the more promising CTs for dynamic executions.As shown in §5.3 and §5.6, the new workow helps S outperform SKI, Snowboard and Razzer with higher testing eectiveness and eciency.
Kernel testing.Feedback-based fuzzing has been shown to be eective in generating STIs and nding kernel bugs.Syzkaller [20] tries to maximize the code/edge coverage of the kernel using a feedback-based mechanism-it keeps mutating STIs that can increase the coverage.Moonshine [46] improves Syzkaller by extracting system call sequences from real-world applications.HFL [30] uses symbolic executions to guide the mutation of STIs.S benets from the development of such tools in that more eective CTIs can be generated from better STIs.
Machine learning for software testing.NEUZZ [50] trains a neural network model to predict the application edge coverage given a test input.Then NEUZZ uses the trained model to guide input mutation-it computes the model gradients to nd out which part of the test input needs to be mutated to increase the coverage.FuzzGuard [72] explores the use of ML for directed fuzzing, in which only a specic set of code blocks (target blocks) are interested rather than all blocks.FuzzGuard trains a model to learn the reachability of target blocks given a test input and then uses the model to predict and skip inputs that cannot hit target blocks.
Design2vec [56] uses a GCN model to predict the coverage of the hardware.In addition to the test input, Design2vec inputs the whole control data ow graph of the hardware in RTL.Along a similar vein, ProGraML [10] uses a graph representation of LLVM IR, at a ner granularity (individual instructions) including data-and control-ow edges, towards predicting static properties of code.S uses a similar model architecture as Design2vec, and a little coarsergranularity than ProGraML (basic blocks).However, S can take the scheduling information of the concurrent test into consideration and predict the coverage of the test input when it is executed under a specic interleaving.
Machine learning for kernel testing.SyzVegas [62] uses reinforcement learning to schedule dierent kernel fuzzing tasks (e.g., test generation/mutation/selection), which otherwise would be scheduled under xed manually-written policies.It proposes a reward assessment model to learn the costs and benets of dierent fuzzing tasks over time and then makes better arrangements of tasks in the following runs.S is in general orthogonal to SyzVegas as it predicts the concurrent kernel executions and improves the test selection using the predicted coverage.
Healer [55] proposes an algorithm to learn the system call inuence relations-the inuence of a system call A on the execution path of another system call B. It nds such relations by running STIs in which system call B is called right after A or without A and comparing the coverage of these STIs.The system call A is concluded to have inuence on the call B if A helps B trigger new coverage.Healer keeps learning inuence relations and generating new STIs that encourage learned inuence relations to test more kernel execution paths.Compared to Healer, S learns the inuence between interleaved instructions triggered by concurrently-running system calls and predicts their coverage, which is more challenging and requires more ecient and accurate approaches such as deep learning.

Conclusion
This work introduces S, a kernel concurrency testing framework that uses a kernel code block coverage predictor for identifying and prioritizing interesting concurrent test candidates.The coverage predictor is achieved via a GNN model named PIC that takes the concurrent test input and scheduling hints and predicts if certain concurrencysensitive blocks would be executed or not.The coverage predictor enables S to use a new testing workow in which new concurrency test candidates are evaluated based on the predicted coverage and only executed dynamically if they are interesting (e.g., cover a new set of code blocks).The evaluation of S shows that this workow is both eective and ecient.S can nd more potential data races, reproduce known bugs quickly and nd new bugs with high probabilities.More importantly, the coverage predictor can generalize across dierent kernel versions, showing S can scale well to rapidly-evolving kernels.

Figure 1 .
Figure 1.Example of how uncovered reachable blocks can be triggered under concurrent executions.
Filtering out dynamic tests that are unlikely to be fruitful.

Figure 2 .
Figure 2. Comparison between the general workow and a predictor-based workow.

2 Figure 3 .
Figure 3. Filtering a stream of test candidates with a perfect and an imperfect lter.

Figure 4 .
Figure 4.The S graph representation of a CT candidate example.

Figure 5 .
Figure 5. Data-race-coverage history comparison between PCT and MLPCT .Varied total numbers of data races between gures are due to the non-deterministic random CTI and schedule generation.

Figure 6 .
Figure 6.A concurrency bug found by S, which involves four shared variables and only exposes when at least three ordering constraints are satised.#7 in Table3a.
• A new workow for ecient kernel concurrency testing.S proposes a new kernel testing workow in which the newly-generated test candidates are rst evaluated based on the predicted coverage.Then only interesting test candidates will be selected for dynamic executions.• , training examples consist of input/target pairs < G 8 , ~8 >, where G 8 is the CT graph ( §3.1), and ~8 is an assignment of COVERED/UNCOVERED to the vertices of G 8 .More precisely, G 8 is a graph (+ 8 , ⇢ 8 ), where the vertices are + 8 = ⇠ 8 [* 8 (the SCBs and URBs, respectively), and the edges are ⇢ 8 = ( 8 [% 8 [⇡ 8 [ 8 [ 8 (the sequential control ow edges, the intra-thread data ow edges, the possible control ow edges to uncovered blocks, the inter-thread possible data ow edges, and the scheduling hint edges, respectively).Similarly, 8E 2 + 8 , ~8 [E] 2 {COVERED, UNCOVERED} (covered/uncovered under the concurrent execution).

Table 1 .
URBs predictor performance.Average metrics across all graphs.BA stands for balanced accuracy.

Table 2 .
Retraining time cost (hours) of PIC models that are used to test Linux kernel 6.1 and 5.13.

Table 4 .
Data race reproducing results when using Razzer, Razzer-Relax and Razzer-PIC."# CTIs " shows the number of CTIs selected by each approach."# TP CTIs " shows the number of true positive inputs.Worst case to identify a true positive happens if it is at the end of the schedule queue.Average time to reproduce is computed by shuing the CTI execution queue 1000 times and averaging the time taken to reach the true positive.547 hours to reproduce data race #F in the worst case.In contrast, Razzer-PIC can reproduce all races as Razzer-Relax but incurs a much lower time cost.In the worse case, Razzer-PIC can nd all 6 bugs 15x faster than Razzer-Relax on average.On the most challenging races #C, #D, #E, and #F, Razzer-PIC can reduce the time by 22%-94%, saving hundreds of hours in total.Such results show the potential of the PIC model in nding error-inducing CTIs and Razzer-PIC would assist developers with bug reproduction, where the latency is crucial