One Simple API Can Cause Hundreds of Bugs An Analysis of Refcounting Bugs in All Modern Linux Kernels

Reference counting (refcounting) is widely used in Linux kernel. However, it requires manual operations on the related APIs. In practice, missing or improperly invoking these APIs has introduced too many bugs, known as refcounting bugs. To evaluate the severity of these bugs in history and in future, this paper presents a comprehensive study on them. In detail, we study 1,033 refcounting bugs in Linux kernels and present a set of characters and find that most of the bugs can finally cause severe security impacts. Besides, we analyze the root causes at implementation and developer's sides (i.e., human factors), which shows that the careless usages of find-like refcounting-embedded APIs can usually introduce hundreds of bugs. Finally, we propose a set of anti-patterns to summarize and to expose them. On the latest kernel releases, we totally found 351 new bugs and 240 of them have been confirmed. We believe this study can motivate more proactive researches on refcounting problems and improve the quality of Linux kernel.


Introduction
As a simple but efficient programming technique to manage critical resources, reference counting (refcounting hereafter) [24,25] is heavily used in modern programs.For example, there are 93.5% files involved refcounting in the Linux kernel [39].Unfortunately, refcounting requires manually invoke its APIs.When across modules in large-scale programs, it becomes too complex to be free of bugs due to missing or improper invocation.From our statistics, Linux kernels are suffering from more and more severe refcounting bugs.Given that plenty of pattern-based or semantic-based detection methods or tools have been proposed to defeat refcounting bugs [20,23,26,27,35,39,44,45,49], it raises the questions about why there are still increasingly refcounting bugs reported, what are the characteristics or root causes of these bugs and what can be done to proactively reduce or eliminate the refcounting bugs.
To answer these questions, we tried to collect and analyze the refcounting bugs reported in the historical releases of the Linux kernel [13].Totally, as shown in Figure 1, we have found 1,033 refcounting bugs in 753 versions of Linux kernels from 2005 to 2022.For each one of them, we have tried to carefully ascertain the bug details from the commit log, patch code, function-level context and even the discussions between the patch authors and developers [36].
With the bugs and dissections, our first goal is to understand the real security impacts of the bugs, which can be used to explain why more and more new refcounting bugs are detected in recent security researches [27,35,39,49].Then, we aim to explore the distribution and lifetime characteristics by conducting a throughout statistic analysis, which can help future researches to put in more effort into error-prone subsystems and long-lived kinds of bugs.Finally, we hope to correctly infer the root causes that can not only explain the reason why there are so many refcounting bugs in Linux kernel, but also help us to propose meaningful anti-patterns, which can motivate us to detect new bugs in latest releases.Security Impacts.It is our first two findings that almost each of the refcounting bugs can lead to security bugs.Specifically, 71.7% of the bugs have caused memory leaks [9] and other ones finally introduced use-after-free (UAF, hereafter) bugs [10], both of which can easily make a running kernel Listing 1.A Missing-Refcounting Bug.crash.Based on the security impacts and related patch code, we gave a more reasonable classification on these bugs.Distributions and Lifetimes.First, we found the refcounting bugs existed in every corner of the whole kernel.Besides, we found that all the bugs followed a long-tailed distribution: more than 80.0% bugs existed in only three subsystems.Finally, we found that 70.0% of bugs needed more than one year to be fixed after they were first introduced, 19 bugs existed more than 10 years and 23 bugs existed from the first Git release (v2.6) to recent ones (v5.x and v6.x).Unexplored Root Causes and Anti-Patterns.Different with all of existing protection or detection methods [27,35,39,45,49], we tried to explore the root causes of the refcounting problems.As one of our main contribution, we found that if a refcounting API is implemented with a subtle deviation, i.e., some extra behaviors or operations deviated from most implementations, it can introduce hundreds of bugs.Besides we also inferred other critical problems that have also introduced hundreds of refcounting bugs.Finally, we proposed 9 anti-patterns and detected 351 new bugs by matching them in latest releases.
Specifically, for refcounting bugs leading to memory leaks, there are two potential root causes.First, when the developers introduced any subtle deviation to the standard refcounting algorithm, it would cause many bugs.For example, we found that a special API, pm_runtime_get_sync(), has caused 94 bugs.The main reason is that this API will increase the reference counter (refcounter hereafter) even when it meets some error and returns an error-code, in which case the callers will often miss the decreasing operation, causing memory leaks.Second, when the developers, for code conciseness, add refcounter-increasing operations into some macros or find-like APIs, the new or even skilled kernel developers will often miss the decreasing operations, which will eventually lead to refcounting bugs.This problem has also caused a large number of bugs.
For the bugs leading to UAF bugs, we also found two critical reasons.First, when the developers accessed the objects after they invoked the decrement refcounting API, referred to use-after-decrease (UAD, hereafter) problem in this paper, there would be a potential UAF bug.Second, as refcounting omission is encouraged by the escape analysis [52], the Listing 2. A Misplacing-Refcounting Bug.kernel developers choose to manually complete the optimization considering the performance.Unfortunately, they often wrongly omit increasing refcounters when new references created, which can then lead to premature free [27].Each of above two problems has caused many refcounting bugs.
Based on above root causes, we proposed 9 anti-patterns described by the semantic templates [19,37].Then, we designed and implemented a static checker for each of the anti-patterns.The checker totally detected 351 new bugs in several latest kernel releases and 240 ones have been confirmed when writing this paper.By deeply analyzing the new bugs, we found that their characteristics were very similar with the historical ones.Based on this fact, we believe the refcounting bugs can be greatly diminished if the developers or researchers begin to put in more effort into the characteristics and lessons learned from this study.
The main contributions of this work are three-fold: (1) conducting the first thorough study of refcounting bugs in modern Linux kernels; (2) exploring latent problems leading to so many refcounting bugs, which can proactively defeat the bugs in the future development; (3) the anti-patterns extracted from those bugs, which help to detect hundreds of new ones.

Refcounting
Basically, to implement refcounting, programmers prefer to use an integer, i.e., refcounter, to record the number of references to a memory block.If there is a reference creation or destruction, the refcounter should be incremented or decremented.If there is no any reference, i.e., the refcounter becomes zero, the memory block will be deallocated.However, if there is any refcounting missed or not operated properly, i.e., refcounting bug [11], it will lead to severe security impacts, e.g., memory leaks and UAF bugs.

Refcounting Bugs
In this paper, we mainly consider following two main kinds of refcounting bugs, which can easily cause the memory leak and UAF bugs.While there are indeed other kinds of refcounting bugs, e.g., the ones caused by race problem [8], we will leave them as our future works.
Missing-Refcounting Bugs are bugs that programmers miss the decrement or increment that should have been done for a reference creation or destruction.Specifically, missing decrement will cause a memory leak as the refcounter will never be zero and missing increment will cause a UAF bug as the refcounter will prematurely become zero.Listing 1 presents a missing-decrement bug in the NVMEM driver [30].Specifically, the developers only focus on the returned dev from the bus_find_device function (Line 4) and they do not realize the existence of the embedded increment refcounting in the find-like API.When there is some error, the code will be terminated without any paired decrement operation (Line 7), which will inevitably lead to a memory leak.Misplacing-Refcounting Bugs are bugs that programmers do not place the increment or decrement APIs into proper places.Listing 2 shows a real-world misplacing-refcounting bug in the USB serial drivers [17].Specifically, the developers only consider to protect related operations on the target object in a locked code block to avoid race problem.However, they do not realize the potential UAF bug as the usb_seriral_put, a decreasing API, will release all the resources attached into the serial and then free itself if its refcounter is one when calling the API.

Our Goals
Considering the severity, there have been several efficient protection-/detection-strategies proposed to defeat the refcounting bugs (see §8).However, different with existing works, the main goals of this paper are to answer the reasons of increasing numbers of the refcounting bugs and to be an instructive study for future proactive solutions.Specifically, our first goal is to collect all (most) of the historical refcounting bugs fixed in Linux kernels and to build a considerable bug dataset.Then, based on it, we try to do a statistical research on the security impacts, classifications, distribution and lifetime characteristics.Finally, we want to infer the latent reasons, i.e., root causes, for these bugs, which can be useful for future kernel developments.While we also try to apply our study results to design and implement static checkers to detect new refcounting bugs, we only aim to prove the reasonableness of our findings and root causes.Pursuing a sound and complete detection solution is not the goal of this work.

Refcounting Bug Dataset
To disclose the characteristics and latent reasons, as our first primary goal, we conducted a thorough study that covers about 753 versions of Linux kernels released from 2005 to 2022.Specifically, we use a two-level filtering method to identify historical refcounting bugs.First of all, by extracting key words in the refcounting API names, e.g., "get", "take", "hold", "grab" for increasing APIs and "put", "drop", "unhold", "release" for decreasing APIs, we found out the committed patches that add/delete/move the APIs whose name strings contains above words.Then, we also conducted a deep analysis by checking the implementation of the related APIs, as the second filtering stage, which can help us to only select the refcounting bug related patches.
After above filtering, we totally collected 1,825 candidates refcounting bugs from more than one million of committed logs.By manually confirming each of them, e.g., trying to understand the real meaning of the commit messages, checking the contexts of the patch code and even the discussions between the patch authors and the developers, we finally extracted 1,033 bugs as our studied dataset.Threads to Validity.Like all characteristic's studies, on the one hand, we may miss other kinds of refcounting bugs when we use above methods.For example, in Listing 2, if developers chose to move the mutex_unlock after usb_serial_put, where no change of the refcounting API will be found, then we will miss it.Besides, the patch authors can also not use any above key word in their commit message, which will be a challenge for our current collecting strategies.To reduce the potential false negatives and collect the missed historical bugs is left as one of our future works.
On the other hand, we may also collect wrong patch commits as false positives.For example, the commit-dcb4b8ad [7] has been proved wrong by the commit-0a96fa64 [1].Specifically, the former patch, adding a "missing" decrement operation, indeed introduced an extra decrement, which will cause a potential premature free, i.e., UAF bug.To filter out above false positives, we turned to the "Fixes" tags [16], shown in following gray box, in commit logs and removed the selected bugs whose commit-IDs are the tag values of other commits.

Semantic Template based Bug Description
Considering refcounting bugs as a kind of specific semantic bugs [34,48,51] and inspired by a recent work [37], we adopt a semantic template-driven method to describe the studied refcounting bugs.In this paper, we mainly consider following semantic elements to build a reasonable template.Semantic Operators.First of all, we use two symbols to refer to refcounting operators: G to increment refcounting and P to decrement refcounting.Then, we will also use A, D, L, U to refer to common assignment, dereference, lock and unlock operations.Finally, we can use the operators as functions with one or more parameters p  , which are actually the object pointers.For example, we use G(p 0 ) to mean the increasing operation on the pointer p 0 .It is noted that two or more operators can be nested, with •, to represent the complicate behaviors.For example, we can use the U • D to mean the unlock operation nested with a pointer dereference.

Bug Semantic Tamplates
Listing 1 Specifically, the first template can tell us that there is a potential execution path in which the developers only call the increment refcounting and then jump into an error-handling code without any paired decrement operation.From the second template, we can get that there is a potential execution path in which a decrement refcounting is called with the pointer p 0 and then the same pointer is dereferenced within a nested unlock statement.Based on the above two templates, we cannot only easily infer the refcounting bugs but also be motivated to design and implement static checkers to detect new similar bugs.

General Findings
In this section, we describe the general findings from the analysis of selected refcounting bugs, including impacts, classification, distributions and lifetimes.

Security Impacts
To understand the real-world security impacts, we begin to search in patch description the key words that can reveal the potential impacts, e.g., "leak", "use-after-free", "uaf", "crash", "out of memory".Finally, we conclude following two findings.
Finding Table 2.The percentage of different kinds of refcounting bugs.The underlines of the percentages mean we will infer the root causes for them.The total number is 1033.
Based on whether added decreasing APIs in the same functions with the increasing ones, as shown in the top half of the Table 2, we divided the missing-decreasing bugs into Intra-Unpaired and Inter-Unpaired.For other special bugs, e.g., operating on a wrong API, we categorized them into Others and do not consider them in following analysis.While memory leaks usually involve a small structures, but the attackers can easily trigger the leaks many times through some loop-based scripts [4], which will eventually lead to serious security impacts.
Note that most of the misplacing-refcounting bugs are actually the missing-refcounting ones.For example, we firstly identified the bug commit-bf4a9b24 [6] as a misplacingrefcounting bugs as there is an explicit movement.However, after manually confirming with the patch description, we realized that this is actually an intra-unpaired bug as the premature exist leads to the increasing operation of unpaired.
Finding 2.About 28.3% (292/1033) of the studied bugs can lead to UAF bugs and most of them are caused by misplacingrefcounting.Specifically, about 9.1% (94/1033) of all bugs can be detected by checking if there is any reference access after the decreasing operations.
First of all, it is a real surprise finding that the misplacingrefcounting is the major kind of refcounting bugs leading to UAF bugs as recent UAF-related researches [27] only think about the missing-refcounting bugs.Furthermore, among these bugs, we find a simple but frequently reported bug type, i.e., UAD, and we will describe it more detailed in §5.4.1.
Then, we conclude the following two reasons why there are fewer UAF-causing bugs reported.On the one hand, based on our patch-committing experience, the kernel developers are very strict to accept any UAF bug patch as these patches will be applied into the stable releases.Usually, they need the committer to provide a real crash report or even the Proof-of-Concept (PoC hereafter).On the other hand, it is difficult to successfully produce a UAF-triggered PoC even after finding a real refcounting bug, which usually need a long-time fuzzing work [47].

Bug Distributions
To explore the location distributions of the bugs and make it clear what kind of source code can be error-prone, we begin to analyze the source files where each of the bugs are detected.Overall, the refcounting bugs can happen almost in every corner, even in the kernelinit code [12].By counting the numbers of bugs existed in each subsystems, we presented two kinds of results in the bar-charts of Figure 2.
Finding 3. The refcounting bugs meet the long-tailed distributions in Linux kernel.About 82.4% (851/1033) of refcounting bugs could be detected within "drivers", "net" and "fs" subsystems, among which more than a half (588/1033, about 56.9%) of all bugs occurred in "drivers".
In fact, the long-tailed distribution is not specific to refcounting bugs.An early work [22] has already reported this kind of distributions for many types of operating systems errors.Here we use another metric, bug density, i.e, number of bugs per thousand lines of code (KLOC), which is presented in the right half of the Figure 2. Different with the early work [22], it is the "block" subsystem, not the "driver", has the highest density.While there are only 18 bugs existed in "block", it only has 65 KLOC.We hope these findings can motivate the researchers to select proper targets to check [35,38,39,42,49] and fuzz [21,31,32,46,47].

Bug Lifetimes
To expose the latent periods of our studied refcounting bugs, which can help us to realize the difficulty to detect them, we analyzed the lifetime of each bugs.Specifically, we again turned to the "Fixes" tags by which we can calculate the lifetimes for the bugs, i.e., from the time of the bug first introduced to the time of the bug fixed by the current commit.[22].We use the bug-introduced time (version) and the bug-fixed time (version) to draw the lines.
However, as not all of kernel developers require the "Fixes" tags, we found only 567 bugs owning the tags.
Finding 4. It is surprising that about 75.7% (429/567) of refcounting bugs that need more than one year to be fixed after they are first introduced in the kernels.There are even 19 bugs that existed more than 10 years, including 7 bugs that can finally lead to UAF bugs.
The main reason for the long lifetimes is that refcounting bugs are actually a kind of silent semantic bugs [37] which are difficult to be detected by the mainstream fuzzing-based methods.Besides, from this finding, we should realize that there may be still lots of bugs in the un-patched running kernels.To prove this point, we make a further analysis to figure out how many kernels a refcounting bugs can infect based on their lifetimes.In Figure 3, we present the details of our analysis.It is noted that the bugs are sorted by the time (version) they are introduced and the length of the lines represents the lifetime of the bugs.
Finding 5.There are 23 bugs that existed from the first major release (v2.6.y) to the recent ones (v5.x and v6.x), which means these bugs can infect most of the Internet machines that installed Linux kernels .
As a representative example, the refcounting bug commit-0711f0d7 during boot was introduced from "the pre-KASAN stone age of v2.6.19"[12], which can make most of the modern Linux machines crash.Besides, in Figure 3, there are also many bugs whose lifetimes have spanned across two or more major releases.For example, there are about 135 bugs spanning from v4.x to v5.x and 80 bugs from v3.x to v5.x.
Finally, we can also see that there are more and more bugs that are introduced in newer versions but also more and more ones are fixed up in the same major releases (the black dots), e.g., 189 bugs were introduced and fixed within the v5.x kernels.A possible reason is that the researchers and developers begin to realize the serious impacts and more and more methods [26,35,39,49] have been proposed to detect and fix the potential refcounting bugs.

Root Causes
We conducted our root cause inference only for the representative bugs.Specifically, for leak-related bugs, we chose all missing-decreasing bugs (67.2%).For UAF-related bugs, we chose the UAD bugs (9.1%) and intra-unpaired bugs (5.1%).These types have been introduced in §4.1.We leave other complicated and valuable bugs, e.g., race problems, as our future works.
First of all, we carefully conduct a classification for the APIs that involve refcounting operations.Specifically, based on our manual analysis of the implementations of the APIs that have caused bugs, we classify all of them into following three categories.General Refcounting APIs.These APIs are used to directly increase or decrease the refcounter of the objects of basic structures, e.g., refcount_t, kref and kobject.Correspondingly, refcount_inc / refcount_dec, kref_get / kref_put, and kobject_get / kobject_put are three common pairs of general refcounting APIs.These general APIs are used widely in the whole kernel code space.Specific Refcounting APIs.These APIs are used to operate specific objects, e.g., device_node, which will use the basic object kobject as one of the field members.The developers usually implement these specific APIs by directly invoking the general ones, and pass the specific object as their parameters.These wrapped APIs are usually used in specific subsystem.For example, the of_node_get / of_node_put for the device_node are only used to support the DeviceTree related code.Refcounting-Embedded APIs.These APIs are mainly use to complete non-refcounting tasks, but embedding the refcounting operations.For example, in Listing 1, the developers provide the bus_find_device API mainly to find a proper device based on the specific bus parameter.During the implementation, they want to keep the aliveness of the returned objects and then choose to embed a specific increasing APIs, e.g" get_device into their algorithms.Note that this kind of APIs, especially the find-like APIs, have caused hundreds of missing-refcounting bugs.Now we will introduce several reasonable root causes inferred from three dimensions: (1) the differences of refcounting APIs' implementations.(2) the locations of the unpaired and missed APIs.(3) the potential risks by calling or not calling these APIs.Here, to be beneficial for the Linux community or other researchers, we have listed the APIs prone to refcounting bugs in the Appendix A.

Implementation Deviation
Overall, a subtle deviation of the refcounting implementations, compared with most common ones, can cause lots of unexpected bugs as the callers may not easily notice it.In this paper, we will introduce two kinds of deviated refcounterincreasing APIs that totally have caused hundreds of bugs.

Return-Error Problem.
While there is no need to return any error code for a regular refcounting API, there is always exception.For example, the commit-87710394 [5] reports a refcounting bug involves an internal power management API, i.e., __pm_runtime_suspend, which is used in a helper API, pm_runtime_get_sync [28] (we also found another similar API kobject_init_and_add [33]).In the implementations of this API, the developer will increase the refcounter no matter if there is any error occurred.In other words, once the refcounting API is invoked, the caller must invoke the corresponding kobject_put to avoid potential memory leaks in any potential code path.
For clarity, in top part of Listing 3, we present the implementation details of the __pm_runtime_suspend function.Note that this API adds an extra invocation to rpm_resume which can be treated as a deviation with other cases.Based on this implementation, we can see that the API will always increase the refcounter and the callers should add the decreasing API in any potential code path.Unfortunately, a common behavior of the usage of error-returned API is that when there is an error the caller will directly jump into the error path without considering to pair refcounting operations.Specifically, we present in the bottom of Listing 3 the buggy behavior reported in commit-87710394.Totally, we found in our dataset there were totally 106 bugs caused by above two APIs.

Return-NULL Problem.
In a stand implementation, the increasing API will only accept an object pointer and increase its refcounter without returning anything.However, in many cases, the developers prefer to use the same object pointer as the returned value, which can be used for the caller to invoke the corresponding decreasing APIs.As a result, there will be a potential null-pointer-dereference Listing 4. A SmartLoop and A Bug Caused by Loop Break.(NPD hereafter) bugs as the returned pointer, which may be NULL, will be directly accessed or dereferenced without any NULL-check.We find 7 new bugs caused by the return-NULL problem in latest release and 3 of them have been confirmed by the developers.

Lessons and Anti-Patterns.
While subtle deviations of refcounting APIs are accepted in kernel development considering the various requirement of the implemented functions, the callers should be very careful when they use such kinds of APIs.One way to defeat the potential refcounting bugs is to provide the detailed explanation as the API comments, which have been adopted by current releases.Another way is to proactively detect such deviations, as an important future work, and then make them public known for new or even skilled kernel developers.
For the bugs caused by implementation deviations, we propose following anti-patterns: Here, we use G  to indicate the APIs that increase refcounters no matter if there is any error, use G  to indicate the APIs may return NULL pointer, and use D  to mean a pointer dereference without any NULL-check.

Hidden Refcounting
In this paper, we use the word hidden to mean two things: (1) the invocation locations of the refcounting APIs in the bug-caused APIs are hard to be noticed.(2) the semantic similarity between the key words of refcounting API names and bugs-caused API names are very low.

Complete-Hidden Problems.
To make it clear, we use a real-world bug reported by the commit-1085f508 [2], shown in the bottom of Listing 4.During each iteration of the macro-defined for_each_matching_node, referred to SmartLoop, a refcouting API, i.e., of_find_matching_node will be invoked in the end of each iteration.It is worth noting that this find API will accept an object pointer from (Line 6) which will be used to decrease the refcounter (Line 11) and return a new object pointer whose refcounter has been increased (Line 9).However, all of above operations are hidden to callers by the smartloop definition as the developers only care about the iteration purpose.Consequently, when the break condition is satisfied (Line 18), most of the developers chose to directly break out (Line 19) and then miss the chance to pair the increasing operation.From another side, as we can see from the Table 3 (details in following subsection), the key word foreach has a very low similarity with the key words of general refcounting API names.

Increasing-/Decreasing-Hidden Problems.
To understand the real impacts of the low semantic similarities, we select and analyze the API names of the intra-unpaired bugs that the developers do not realize to pair the refcounting operation in any potential execution path within the same function, which can be easily inferred from the patch description.Totally, we found 254 such kind of bugs in our dataset.In Table 3, we present the key words in the names of the bug-caused and general refcounting APIs with their semantic similarities.
In this work, we use the learning-method, word2vec [41], to get the word vectors and then calculate their cosine similarities [18] as semantic similarities.Note that, in fact, we get the word vectors by training the CBOW [40] model with more than one million of the historical commit logs, including the code and comment text.
From the semantic similarity results in Table 3, we can see that all the key words of bug-caused APIs, e.g., "find", "parse", "open", have very low similarities with the "refcount", "increase" and "decrease".Besides, there are also very low similarities between the key words of bug-caused APIs and the general refcounting APIs, e.g., "get"/"put", "hold"/"unhold', "grab"/"drop" and "retain"/"release".From this point, we can explain that, when the developers use the find-like or parse-like APIs, they usually will not realize the existence of the refcounting behavior.Note that there is a high similarity between the "find" and the "get"(0.73),even the "put"(0.58).The main reason is that the find-like API always call the get-named or even put-named refcounting APIs.For example, we should also note that, in Listing 4, there is an of_node_put invocation, which means the of_node_get should be added if the from parameter is not NULL.In fact, we have detected 16 new such missing-increasing bugs.

Lessons and Anti-Patterns.
The developers should consider to add the key words that can imply the refcounting behaviors in the names of the functions, in which there are indeed the refcounting operations.Otherwise, the other developers who plan to call these functions will have a high probability of forgetting to realize the refcounting operation, which will finally lead to the missing-refcounting bugs.Consequently, we believe that finding and detecting the missingrefcounting bugs caused by the low semantic similarities can be an interesting research direction.
For the bugs caused by hidden refcounting problems, we propose following anti-patterns: We use M SL to mean the macro-defined smartloop, and use G  |P  to indicate the hidden refcounting APIs.

Overlooked Location 5.3.1 Error-Handle Problems.
Error-Handling blocks are the locations where developers usually put more attention on undoing the other important things, e.g., resource deallocation.By searching in our dataset, there are totally 110 bugs that are caused by developers who add the paired decreasing APIs in all paths except the error-handling blocks.

Indirect-Call
Problems.We find the leak involved refcounting bugs can be frequently reported in the interpaired functions.For example, the developers often call the increasing API in the open function of a specific file operations but fail to call the decreasing API in the release function.The paired probe and disconnect of usb_driver operations, connect and shutdown of proto_ops operations, probe and remove of platform_driver operations are the common cases where the bugs exist.

Direct-Free Problems.
If the developers confirm that they are removing the last reference object, they prefer to directly use the kfree function to free the target object, not using the decreasing API.However, there are many decreasing APIs that are not only to decrease the refcounter, but also to release other pre-allocated resources.As a result, the direct-free operation will make the allocated resource leaked as they have no chance to be freed.For example, the commit-258ad2fe [3] fixes a direct-free caused bug that leaks a name string allocated in the object initialization.Totally, there are 44 bugs that are caused by the direct-free problems.

5.3.4
Lessons and Anti-Patterns.The developers should not directly use kfree function in any case for the refcounted object.The main difficulty is to identify the refcounted object as it is possible that an object has no refcounter but only contains another refcounted object, e.g., the device structure only contains the refcounted kobject stucture but not its own refcounter.Fortunately, there are often the unique refcounting APIs, e.g., get_device/put_device for device objects.
We use followings anti-patterns to describe the bugs that are caused by overlooked locations: Here, we use S P |B  to mean the two paths, one containing the decreasing API and another one containing the error-handling code.We use F ⊤ and F ⊥ to refer to two inter-paired APIs, e.g., the initialization API xx_probe and the destruction API xx_destroy.Finally, we use the S   to mean kfree invocation.
5.4 Future Risk 5.4.1 Potential Deallocation Problems.This problem is a surprising finding, as an important root causes for the refcounting bugs that can potentially lead to the high-risk UAF bugs.Specifically, when the developers try to drop the references, they take it for granted that it is safe to access the object through the reference which has been used to decrease the refcounter.The main reason is that some of the developers firmly believe that in all paths of current release the refcounter cannot reach zero (see the details in Figure 4(1) and replies from the developers in Figure 4(3)).Unfortunately, in future, a new developer can call the UADcaused API without the increasing operation in a new module (e.g., the module E in our example), which will free the object if the reference counter is one when the API is called.There have been 94 bugs caused by the UAD problems.

Reference Escape Problems. Escape problem has
been indeed proved to introduce potential UAF bugs [35,52].Specifically, as shown in Figure 4(2), if there is a new assignment which can make the reference escaped out of current "still has a reference, so not really going to hit a UAF… it does not read correctly to 'put' a reference then continue using the object… " (3) Different Reactions from Developers Figure 4. Future Risks.The bug will be introduced when new module is added without considering the refcounting operations.function, it is better to add a corresponding increasing operation around the escape point, not outside of the function, otherwise the escaped reference will potentially cause the UAF bug when new buggy path is added in a new module (e.g., the module F in our example).Totally, there have been 74 bugs caused by escape problems.

Lessons and Anti-Patterns.
The developers should use the reference before the decrement and also should add the increment around the escape points, e.g., a new reference creation into a global variable or an out parameter.While there have been several methods to solve escape problem [35,52], considering the refcounting optimization [14, 15,27], it is still a challenge to completely defeat these bugs.Besides, automatically generation of the PoCs for the UAD bugs is also an interesting research direction.
We use followings anti-patterns to describe the bugs that are caused by future risks: Anti-Pattern 8. F  → S P(p 0 ) → S D (p 0 ) → F  .Anti-Pattern 9: F  → S A  | → F  .Here, we use S P(p 0 ) and S D (p 0 ) to refer to the deceasing and dereference with a same pointer p 0 .We use S A  | to indicate an assignment to a global variable or out parameter, which creates an escaped reference.
6 Anti-Pattern Instance Detection

Static Checker Based Detection
Driven by the anti-patterns extracted from the root causes, we have designed and implemented a static checker for each of the anti-patterns.Our checkers will be released in https://github.com/windhl/checkers_sosp23.Specifically, to build and apply the checkers on the source code of Linux kernel, we use following three steps.Lexer Parsing (G, P, M SL ).Firstly, by extending the existing lexer-parsing tool, PLY [43], we implement three lexer parsers to explore refcounting related structures, refcounting APIs, and macro-defined smartloop.Specifically, the refcounting-related structures can be used to confirm the refcounting APIs, i.e., checking if the functions containing the structure instances and operating (increasing or decreasing) Table 4.The new refcounting bugs.NPD means null pointer dereference.CFM means confirm, , PR mean patch reject, FP means false positives.More details can be seen in Table 5.Note that we do not count the 5 FP bugs into total number.the refcounters.The refcounting APIs and smartloops can be used to generate the critical semantic operators, i.e., G, P and M SL .Note that the structure parser relies on a threshold to control the parsing levels as a refcounted object can be used in another structures, which can be nested defined.For example, in §5.3.4,we have introduced that the refcounted kobject is embedded into the device whose refcounting API will directly operate the refcounter of kobject.Graph Generation(A, D, U, S, B, F).Secondly, we transform the whole kernel source files into rich information embedded graphs, namely Code Property Graphs (CPGs) [50], with the tool JOERN [29] and use the embedded Abstract Syntax Trees (ASTs) to directly identify the other critical semantic operators and contexts, e.g., A, D, B, F. Besides, we will directly use the line numbers embedded in the graph nodes to imprecisely represent the execution orders.Bug Detection.Finally, we construct the nine static checkers based on the proposed anti-patterns with the related semantic operators and contexts.Then we detect new refcounting bugs by applying each of the static checker on all source code files to search and match the corresponding anti-pattern instances .
Why not LLVM.We actually have heavily tried LLVM, but failed to compile all architecture-specific and subsystem code, which requires many combinations of compilation flags.We present the details of the bug distributions in Table 5.While we have sent the patch for each of the bugs, when writing this paper, 240 ones have been confirmed by the developers, 3 ones have been rejected and 5 ones have been proved as false positives (not contained in the total number).For other 111 bugs, we get no response.
From the second columns, we can see the number of new refcounting bugs for each subsystem.Firstly, we unexpectedly detected more than one hundred of new bugs both in arch and drivers, which totally contains 338 new bugs, i.e., about 96% of all bugs.Within the buggy modules, as shown in Table 5, all of the bugs also meet the long-tailed distributions, consistent with our Finding 3. Secondly, we totally detected 13 bugs in other subsystems and directories.Note that we found 2 bugs even in two header files in the include directory.One is in the include/linux/hypervisor.h, which involves the virtualization.The other is in the include/linux/firmware/trusted_foundation.h, which involved the Trusted Foundation of some ARM devices.Besides, we also detected two bugs in the net subsystem, which involve the legacy appletalk and the core ping sub-modules.Finally, we detected 9 bugs in the soc module of the sound subsystem.

Security Impacts
From the third column of Table 4, we can see the security impacts of above new refcounting bugs.Specifically, there are totally 296 (84.3%) bugs which can finally lead to leak bugs, 48 (13.7%) ones to UAF and 7 (2.0%) ones to NPD.The distribution of leak bugs and UAF bugs are also consistent with our previous findings.The NPD bugs are caused by the increasing APIs who may return a NULL pointer.

Patch Committing
From the Status main column, we present the details of the bug patches.We have sent the patch for each of the bugs.When writing this paper, 240 bugs have been confirmed and their patches are applied in the mainline.Besides, there are 111 bugs that have not been replied by the developers and 4 bugs are refused to be confirmed as the developers do not Listing 6.A Patch Rejct Example.
1 // net/ipv4/ping.c 2 void ping_unhash ( struct sock *sk) { 3 sock_put (sk); 4 isk -> inet_num = 0; 5 isk -> inet_sport = 0; 6 sock_prot_inuse_add (... , sk ->sk_prot ,...); 7 } think they are real bugs unless we can provide proper PoCs.Finally, 5 bugs have been proved as false positives.False Positives.The main reason of the false positives of our checkers is that we have not tried to analyze the semantic of specific structures and operations.Now we use a realworld case, shown in Listing 5, to explain it.Firstly, our checkers identified the refcounting operation in Line 4, i.e., the lpfc_bsg_event_ref, as the increasing API for evt object.Then, our tool identified the replacement of evt in Line 10 and reported a missing-decreasing bug.However, the developers told us that the if condition before the replacement in Line 8 ensures the correctness.Specifically, when running into the if -block, the evt will be a NULL pointer and the increasing operation has no chance to be executed.Patch Rejects.Totally, we got three patch rejects.As we have said before, the main reason of the rejects is that the developers do not think they are real bugs, e.g., "only not read correctly" (Figure 4 (3)).For example, as shown in Listing 6, our checkers detected a UAD bug in the ping_unhash function of the net/ipv4/ping.c.Specifically, while there has been a decreasing operation in Line 4 for the sk object, its pointer is dereferenced in Line 7. While there are two patch rejects for the UAD bugs, other 3 new UAD bugs have been already confirmed.Generating the PoCs for the UAD bugs is an important direction in the future work.Potential False Negatives.While we proposed anti-patterns for the most common refcounting bugs, we will miss the ones caused by other complicated reasons, e.g., race problems.

Lessons From New Bugs
Implementation Deviation Caused Bugs.It is surprising that we can still find a new bug in the mfd module which is caused by the pm_runtime_get_sync.The detection of this bug means we cannot prevent new bugs until the developers realize the implementation deviation problems.We also detect 7 Return-NULL caused bugs, i.e., the number of P2 in both of arch and drivers subsystems.When writing this paper, 3 bugs have been confirmed and fixed in the mainline.Hidden API Caused Bugs.There are 62 bugs caused by the hidden increasing or decreasing functions or macros in the arch and drivers subsystems.Specifically, 23 bugs were caused by the hidden decreasing functions and 39 ones caused by the break of smartloop.We have listed the related APIs and macros in Appendix A.
Overlooked Location Caused Bugs.There are 24 new bugs caused by the three kinds of overlooked location problems.First, 9 bugs are caused by the error-handling problem.In fact, during our detection, there are two kinds of error-handling locations, one is the premature exist (return) under a specific if -condition block, anther one is located by the error-labels.Second, we detected 13 bugs caused by inter-unpaired refcounting APIs.Specifically, we simplify the detection by using our lexer parser to identify all the initialization of global variables whose member fields contains paired function pointers, e.g., the struct i2c_driver and the struct platform_driver.Then we detected the paired of functions by simply matching their names with the paired words, e.g., the pairs of register/unregister or create/destroy, init/uninit.Next we confirmed if the developers only added an increasing API but missed the decreasing one.By this way, we detected 12 new bugs.Finally, 3 new bugs were caused by directly using the kfree functions.Future Risk Caused Bugs.There are 5 new bugs caused by potential deallocation problems.While there are 3 bugs are confirmed and fixed by the developers, other 2 bugs are not as developers firmly believed there would be no any UAF bugs in current version, shown in Figure 4(3).For the escape problem, we use the AST to detect all the assignment statements that involved the refcounted objects and search around if there is any increasing refcounting API invocations.Considering the refcounting omission (optimization), we have manually filtered out and confirmed the escape-caused bugs and finally we found 17 new bugs and all of them have been confirmed and fixed in the mainline.

Related Work
Inconsistent-Refcounting Methods.The key insight of these methods is very straight [39,49].Specifically, whenever there is an refcounter increment, there should be a paired decrement.Otherwise there must be a refcounting bug.While it is simple, it has detected lots of refcounting bugs, mainly involved the missing-refcounting bugs.The main limitation of these methods is that there are many refcounting omissions (optimizations) and they will break the consistence rules.A recent work [27], aiming to diagnose the UAF bugs caused by refcounting problems, presents an refcounting omission-aware model and detects the refcounting bugs based on dynamic analysis.Invariant-Analysis Methods.This kind of methods are first proposed in compiler-optimizations [52], aiming to reduce redundant refcounting operations and improve the system performance.The basic idea is to guarantee the invariant that the number of escaped references should be equal to the increment number of the refcounter.Accordingly, based on this idea, the methods, like [26,35,45], can catch a potential refcounting bug if there is any violation of the invariant rule.However, this strategy will be not efficient in large-scale programs, such as the OS kernels, in which there are many inlined APIs or macro-defined functions which will break the invariant rule.For example, the recent work [35], adopting invariant-guarantee checking, suffers from a high false positives (about 60%) when applied into Linux kernels.Cross-Checking Methods.Cross-Checking-based methods [32,38,42] are commonly used in detecting many semantic bugs.In fact, the work [49] has also use cross-checking as its second strategy to detect the refcounting bugs.Specifically, when there is a missing refcounting, they will make a cross-check within other similar places where a same object reference is created or destructed.Then, they can infer that if there is a refcounting bugs based on the common behaviors of most cases.While this strategy has also been proven to be efficient in detecting refcounting bugs, they also suffer from lots of false positives as the refcounting optimization.Template-based Methods.While this kind of methods are very effective in detecting the shallow refcounting bugs, as the lack of deep analysis, they cannot detect inter-unpaired operations in different modular functions.It is worse that the functions are often indirectly called by function pointers.Besides, this kind of methods usually need many manual works to build the meaningful semantic templates, e.g., Coccinellebased methods or tools [20,23,44].While a recent work [37] is proposed to automatically generate the semantic templates, it is a general dynamic method, only focusing on semantic failures, not semantic bugs.

Conclusion
We present the first in-depth analysis of refcounting bugs in all modern Linux kernels.We find the majority of refcounting bugs can cause security impacts, leading to leak or UAF bugs.Besides the root cause inferences, we also propose meaningful anti-patterns, by which we design and implement static checkers, which help us to detect 351 new refcounting bugs and 240 of them are confirmed.We hope the bug lessons and the anti-patterns can be helpful to motivate future proactive solutions for the refcounting problems.

Figure 1 .
Figure 1.The growth trend of refcounting bugs in Linux kernels from 2005 to 2022.

2 (Figure 2 .
Figure 2. Distributions of refcounting bugs.The left subfigure presents the refcounting bug numbers of different subsystems and the right one presents the bug density.

"
While in current version there is no use-after-free …, we should better unlock before dropping the reference in…"Patch Reject:

Table 1 .
Semantic templates for the two listed bugs.We use → to mean a potential execution path.
Listing 3.An Intra-Missing Bug Caused By Return-Error.

Table 3 .
The semantic similarities between the key words of refcounting API names and bug-caused API names.All the results are calculated based on the word2vec with more than one million of commit logs from 2005 to 2022.

Table 5 .
The details of new bugs.Considering the space limitation, we only list the Top-2 bug-caused APIs.[N] means the bug number.P means anti-patterns, NR means no response, PR means patch reject.* means the number of buggy modules.Overall, as shown in in Table 4, our checkers have totally detected 351 new refcounting bugs.The new bugs exist in five subsystems or directories: arch, drivers, include, net and sound.