PP-CSA: Practical Privacy-Preserving Software Call Stack Analysis

Software call stack is a sequence of function calls that are executed during the runtime of a software program. Software call stack analysis (CSA) is widely used in software engineering to analyze the runtime behavior of software, which can be used to optimize the software performance, identify bugs, and profile the software. Despite the benefits of CSA, it has recently come under scrutiny due to concerns about privacy. To date, software is often deployed at user-side devices like mobile phones and smart watches. The collected call stacks may thus contain privacy-sensitive information, such as healthy information or locations, depending on the software functionality. Leaking such information to third parties may cause serious privacy concerns such as discrimination and targeted advertisement. This paper presents PP-CSA, a practical and privacy-preserving CSA framework that can be deployed in real-world scenarios. Our framework leverages local differential privacy (LDP) as a principled privacy guarantee, to mutate the collected call stacks and protect the privacy of individual users. Furthermore, we propose several key design principles and optimizations in the technical pipeline of PP-CSA, including an encoder-decoder scheme to properly enforce LDP over software call stacks, and several client/server-side optimizations to largely improve the efficiency of PP-CSA. Our evaluation over real-world Java and Android programs shows that our privacy-preserving CSA pipeline can achieve high utility and privacy guarantees while maintaining high efficiency. We have released our implementation of PP-CSA as an open-source project at https://github.com/wangzhaoyu07/PP-CSA for results reproducibility. We will provide more detailed documents to support and the usage and extension of the community.


INTRODUCTION
A software call stack is an event sequence that records the functions being called during the runtime of a software program.Consider a software program that is deployed on an end-user device like a mobile phone, each user interaction with the software program will yield a call stack, which is then sent to remote servers for analysis.Call stack analysis (CSA) is a widely used technique in software engineering [Glerum et al. 2009].It facilitates analyzing the runtime behavior of software in order to optimize software performance, identify defects, and pro le software [Han et al. 2012;Hao et al. 2021;Jin and Orso 2012].To date, many software applications are deployed on remote end-user devices such as mobile phones.Hence, collecting user call stacks enables developers to continuously gain insights into how users interact with their software, thereby better optimizing the software functionality and performance.By gathering call stacks in accordance with various daily software usages, developers can perform di erent analytic tasks and create software that is more user-friendly and e cient.
Although CSA o ers several advantages, it has recently faced privacy concerns.Call stacks are essentially event sequences that illustrate the software's runtime behavior.In real-world situations, these behaviors can be proprietary and con dential, holding valuable information for the software users.As aforementioned, CSA in real-world scenarios is frequently performed over software apps deployed on user-side devices, such as mobile phones or smart watch devices.Given that on-device scenarios often involve user private data, there is concern that the collected call stacks may reveal a non-trivial amount of user privacy [Hao et al. 2021], such as user health conditions or locations, imposing compliance issues with privacy laws such as the GDPR or CCPA [Goldman 2020;Voigt and Von dem Bussche 2017].To illustrate the privacy risks of CSA in practice, we study and present a motivating example over real-world Android apps in Sec. 3.
Current Solution.Existing solutions, although laying a solid foundation for this emerging problem, may not adequately address the needs of privacy-preserving CSA in practice.First, despite a signi cant volume of contributions from the NLP community that aim to privatize texts under local di erential privacy (LDP) scheme [Habernal 2021;Igamberdiev and Habernal 2023;Krishna et al. 2021], they are not directly applicable to privatize call stacks.This is primarily because the privatization of text often manifests distinct lexical and syntactic features di erent from the original text and only retain the semantic meaning.For instance, under the LDP scheme, the text "Nice to meet you" can be privatized as "Good to see you".While this may be generally acceptable for text analysis, it is not the case for call stack analysis.Call stacks, as sequences of function calls, must adhere to the inherent caller-callee associations and the chronological order of function calls.A small perturbation in the call stack may result in a signi cant change in the call stack semantics, and thus the utility of the privatized call stack is compromised.
Second, we are also aware of recent works [Hao et al. 2021] that employ the randomized response technique to protect the privacy of call stacks from individual users.Despite the encouraging results, we note that [Hao et al. 2021] ignores the inherent semantics-level constraints of program call stacks, resulting in a considerable privacy cost while simultaneously compromising on accuracy.We will discuss the details and compare our method with them from a conceptual perspective in Sec. 3. Accordingly, we present empirical comparisons in Sec.6.1, illustrating the encouraging performance of our proposed solution.
Our Solution.We present a practical framework, PP-CSA, for privacy-preserving CSA, which o ers principled privacy guarantees while maintaining high utility and accuracy for real-world software scenarios.In particular, we design an encoder-decoder pipeline that can privatize call stacks using standard di erential privacy (DP) mechanisms (such as Laplace mechanism or Gaussian mechanism).We further propose several key optimizations on both the client and the server sides to improve the performance of PP-CSA.On the client side, we design several call stack compression schemes to reduce the length of call stacks and simplify the privatization process.Moreover, on the server side, we propose several optimizations, including a reachability-enhanced training, call graph-guided decoding, and probabilistic distribution decoding, to largely improve the accuracy of the decoded call stacks without undermining the privacy guarantees.Finally, we aggregate the results from the sampled call stacks to obtain the nal analysis result, i.e., to identify frequently used call stacks.
Evaluation Highlight.We implement PP-CSA and evaluate it on real-world Java programs and Android applications.Our evaluation shows that PP-CSA can achieve high accuracy and e ciency, while maintaining high privacy under active attackers.PP-CSA consistently outperforms the stateof-the-art solutions [Hao et al. 2021;Igamberdiev and Habernal 2023] across di erent settings.We conducted evaluations using di erent privacy budgets ( = 1, 10, 20, 30, 40, 100).Notably, PP-CSA achieves an average F1 score of 0.816 on the DaCapo dataset and 0.795 on the Android dataset, whereas [Hao et al. 2021;Igamberdiev and Habernal 2023] have only about 0.623 and 0.503 F1 scores, respectively.We also show that PP-CSA's optimizations are highly e ective and have a "synergistic e ect"; enabling full optimizations o ers the highest accuracy (on average 9.2% F1 score improvement over that of disabling all optimizations).When considering adversaries who aim to infer the user privacy from the privatized call stacks, attackers can only achieve around 0.2 adversarial uncertainty on accuracy [Hua et al. 2022] under common privacy budgets (e.g., = 40), a su ciently high level of user privacy protection.In sum, our work makes the following contributions: • This research presents a practical and privacy-preserving CSA framework that can be deployed in real-world scenarios.Our framework o ers principled privacy guarantees and is scalable to real-world software call traces.• Our presented framework, PP-CSA, o ers local di erential privacy (LDP) guarantee.PP-CSA features several key optimizations in the technical pipeline, including the LDP mechanism, the trace compression scheme, and several trace decoding schemes.Our optimizations are tailored for CSA and can e ectively enhance the performance.• Our evaluation over real-world Java/Android programs shows that our privacy-preserving CSA pipeline can achieve high accuracy and e ciency, while preserving high privacy to mitigate the privacy risks of CSA.
Artifact Availability.We have released our artifacts at [artifact 2023].We will maintain them for future research comparison and usage.

PRELIMINARY
In this section, we present the necessary background of this research, including the local di erential privacy (LDP) scheme and call stack analysis (CSA).For complete details of DP basics, we direct interested readers to relevant resources [Dwork 2006;Dwork et al. 2014].

Local Di erential Privacy (LDP)
Di erential Privacy (DP) has emerged as a standard framework to ensure privacy guarantees in data-processing algorithms [Dwork 2006;Dwork et al. 2014].This framework allows for statistical analysis of data while safeguarding the privacy of individuals within the dataset.Importantly, DP provides reliable privacy protections, even when faced with arbitrary side information.Despite its robust privacy safeguards, DP typically presupposes a trusted data curator (e.g., a central server) to execute necessary data analysis and noise addition tasks.This assumption can be limiting in situations where users distrust a centralized authority or when individual privacy protection is paramount.For instance, in CSA, the traces are often collected from user-side devices like mobile phones, and sent back to the software vendor's central server for analysis.In such scenarios, the central server is not trusted by the users, let alone trusting the central server to perform the necessary DP protection over user data.To tackle these issues, Local Di erential Privacy (LDP) was proposed as a localized approach for DP.
LDP is a prominent technique in data privacy that gives privacy assurances at the level of the individual while permitting correct data aggregation and statistical analysis.Essentially, LDP imposes the addition of random noise to each data point prior to sharing or analysis.Properly chosen random noise not only assures anonymity, but also ensures the accuracy of statistical analysis.Formally, LDP is de ned as follows.
Definition 1 (Local Differential Privacy).Let D be a domain of data values, and let be a domain of responses.A randomized mechanism (or function) M : D → is said to be ( , )-locally di erentially private if for any two inputs , ′ ∈ D and for all ∈ , the following inequality holds: where > 0 and 0 ≤ < 1.
Here, is referred to as the privacy budget, and is the failure probability.A smaller value of indicates a stronger privacy guarantee.Likewise, a smaller value of increases the likelihood of the LDP mechanism satisfying the inequality.
LDP uses a randomized mechanism to convert a data value into a response , such that the probability of obtaining from is close to that of obtaining from a di erent data value ′ .Using this mechanism, adversaries cannot deduce the original data value from the response with high con dence, ensuring indistinguishability. Compared to centralized DP, LDP o ers more potent privacy guarantees [Balle et al. 2019] (for the same , ) and assumes a more practical stance towards the data curator.

Sample LDP.
To illustrate the concept of the indistinguishability property in LDP, we introduce a simple implementation of LDP known as Randomized Response (RR) [Warner 1965].RR is a conventional technique in statistics, enabling the collection of sensitive information from respondents while preserving their privacy.Suppose a survey asks a sensitive yes/no question, such as "Have you ever used illegal drugs?"Each respondent ips a coin in secret.If the coin lands on heads, they answer truthfully; if it lands on tails, they ip a second coin in secret and answer "yes" if the second coin lands on heads, and "no" if it lands on tails.The respondent then reports the answer to the surveyor.Although the surveyor cannot determine whether the respondent answered truthfully, they can estimate the proportion of respondents who have used illegal drugs.Thus, RR enables statistical analysis of the data while preserving the privacy of individual respondents.It has been proven that RR is a well-formed (log 3, 0)-LDP [Dwork et al. 2014], meaning its = log 3 (a very low privacy budget) and = 0.
Despite the simplicity and appealing privacy guarantees of RR, it is primarily used for categorical data (e.g., yes/no questions) and not advisable for sequence data (e.g., call stack traces) [Wang et al. 2019].As a consequence, existing LDP-based CSA [Hao et al. 2021] relies on a large corpus of call stack traces (e.g., 10K) and a high privacy budget (e.g., > 100) to achieve a plausible utility.Given call stacks are usually collected during software crashes, expecting a large corpus is not practical and an excessive privacy budget risks privacy loss.
In addition to RR, another representative approach is Gaussian Mechanism [Dwork et al. 2014].We now de ne L-2 sensitivity and the Gaussian Mechanism in further detail.
The Gaussian mechanism was originally designed for central DP, but it can be extended to LDP by using with exactly one data point.Thus, Δ 2 is the L-2 sensitivity of on a single data point.As a result, the noise scale is proportional to Δ 2 and .And, loosely speaking, Δ 2 is proportional to the dimension of 's output (i.e., ).

Call Stack Analysis (CSA)
CSA is a common concept in software engineering [Glerum et al. 2009].CSA often considers a software execution trace composed of a sequence of runtime events.To date, CSA has been widely used to optimize the software performance, identify bugs, and pro le the software.For example, the Android ecosystem provides a tool called systrace [developer manual 2023] to collect the call stacks of Android apps.The collected call stacks can be used to identify the performance bottlenecks or to identify the bugs in Android apps.Similarly, FlowDroid [Arzt et al. 2014], a popular analysis framework of Android apps, uses the collected call stacks to identify the information leakage bugs of Android apps.
Depending on the speci c CSA tasks, the "runtime events" can be function calls, system calls, or other program runtime behaviors.Aligned with prior works [Hao et al. 2021], this paper considers an important and cornerstone CSA task, i.e., hot call stack analysis, where each event on the trace is a function call.The identi ed hot traces can be used to optimize the software performance, or to identify the bugs of the software.Formally, Definition 3 (Hot Call Stack Analysis).Consider a software distributed over users, denoted as 1 , • • • , .For each user , let represent the call stack when executing the software on his end device.Then, the frequency ( ) of among all users is computed as ( ) = ( ) , where ( ) denotes the number of 's occurrences among call stacks collected over all users.Then, given a threshold It is clear that when sending the local data to the untrusted server, the privacy of each user may be leaked (see an example in Sec. 3).Hence, we aim to enable the clients to perturb their local data and share it to an untrusted analysis server without leaking their privacy or undermining the analysis result.
Hot call stack analysis is not only a common but also a cornerstone CSA task.Moreover, considering this task eases an apple-to-apple comparison with prior works [Hao et al. 2021].Nevertheless, our proposed framework is general and can be applied to other group statistics-based CSA tasks.We leave the exploration of other CSA tasks as future work.See Sec. 7 for more discussions.
Call Trace v.s.Call Stack.We are also aware that call trace analysis [Hao et al. 2021] is a related concept to CSA.In general, call trace refers to the whole sequence of function calls that occurs during the execution of a program, whereas a call stack contains only the currently active function calls on the stack.In this regard, call traces can be lengthy and maintaining them during runtime can be resource-intensive.In contrast, call stacks o er a standard and rather concise representation of the currently active function calls; they are practically valuable for debugging and performance optimization purposes, especially in the scenarios where software is deployed on client-side devices.Nevertheless, our proposed technique itself is generally applicable to call trace analysis as well.

RESEARCH MOTIVATION
This section introduces the motivation of this research.We begin by presenting several real-world examples that illustrate the privacy concerns of CSA in practice in Sec.3.1.Furthermore, we provide our formulation of privacy in the context of CSA and explain the necessity of securing CSA in Sec.3.2.From a technical perspective, we also discuss the limitations of existing solutions in Sec.3.3.

Real-World Examples of Privacy Concerns in CSA
Collecting user call stacks allows developers to understand how users interact with their apps, thereby optimizing both the functionality and performance.By gathering these call stacks, vendors can perform various analytics tasks, leading to more e cient and e ective software.However, a signi cant "side e ect" is the potential leakage of sensitive data [Hao et al. 2021].
In real-world scenarios, CSA is frequently performed on software apps deployed on user-side devices like mobile phones.Given that on-device scenarios often involve user data, there is concern that the collected call stacks may reveal a non-trivial amount of sensitive data, such as user health conditions or location, depending on the software's functionality.To illustrate this point, the authors spend manual e ort to study a real-world health app collected from the Google Play Store. 1  The analyzed app is a widely used mobile application designed to support and empower patients throughout their treatment and recovery process.It allows users to log basic health records and vital signs, and also o ers a feature to swiftly share this health data with their doctors.We collect this app, and perform a manual analysis to identify potential privacy concerns after rst reverse engineering the app code.In particular, upon exploring the app, we identi ed two particular call stacks that could potentially leak user health information.These two cases are detailed in Listing 1 and 2. Listing 1. Code for medicine reminders (simplified to ease understanding).
Our rst observed concern arises from the medicine reminder feature.As illustrated in Listing 1, when the designated time for taking medicine approaches, the onReceive function of the AlarmReceiver class is activated.This function, in turn, invokes the showNotification method, which is in charge of displaying the medicine reminder.Within showNotification, the function fetchMedicineInfo is called to pull the medicine record.Thus, if the call stack successively lists onReceive, showNotification, and fetchMedicineInfo, it hints that the user might have a chronic illness and is on regular medication, possibly exposing sensitive health details.When seeing such a call stack, the server can infer that the user is taking medicine, and may even be Listing 2. Code for medical appointment (simplified to ease understanding).
able to infer the speci c medicine that the user is taking, thus resulting in the leakage of sensitive health information.
Similarly, we observe another privacy issue tied to the management of medical appointments.As depicted in Listing 2, the AppointmentActivity class oversees the creation of new doctor appointments.When a user chooses to schedule an appointment, the createAppointment function is triggered.This function then prompts the onCreate method of the AppointmentEditActivity class.As a result, when createAppointment and onCreate consecutively appear in the call stack, it evidently implies that the user is arranging a new doctor's appointment.When collecting trace traces like this, the server can infer that the user is scheduling a new doctor's appointment, and may even be able to infer the speci c doctor that the user is visiting.This presumably results in the leakage of sensitive health information and also users' location information to the server side.

Formulation and Clarification on "Privacy" in the CSA Context
In this section, we present a formulation and clari cation on the notion of "privacy" in the context of CSA.We begin by presenting the threat model in our study and introducing the privacy guarantee provided by PP-CSA.Subsequently, we make a clari cation on the necessity of securing CSA to further motivate our research.Moreover, we brie y discuss other privacy techniques that may be applicable in this context and explain the distinct privacy guarantee provided by PP-CSA.
Threat Model.In this study, our goal is to address the privacy concerns highlighted in the examples presented in Sec.3.1.Our threat model is aligned with previous research in this eld [Hao et al. 2021;Zhang et al. 2020a,b] and this work is not making unrealistic assumptions.In particular, the threat model shared by ours and previous works considers a software system with a client-server architecture, where the client is a user's end device (e.g., a mobile phone) and the server is a remote entity responsible for collecting and analyzing call stacks from the client.It is assumed that the server is honest-but-curious, i.e., the server will follow the protocol but may try to infer the user's private information (his health information) from the LDP-protected call stack data submitted by the client.It is also assumed that the client is trusted, i.e., the client will follow the protocol and submit the LDP-perturbed data to the server.
Privacy Guarantee of PP-CSA.Our proposed solution, PP-CSA, ensures that the central server cannot infer the sensitive attributes related to a speci c user (e.g.whether a speci c user is taking medicine) with high con dence, while still being able to conduct various normal statistical analysis over the collected call stacks, e.g., for the purpose of software optimization.This guarantee is rigorously provided by LDP, with the degree of privacy assurance quanti ed by the privacy budget , as noted in Sec.2.1.However, we cannot prevent the server from inferring statistical facts across all users, such as nearly 60% of users regularly taking medicine.This is not the focus of our study.And it is important to note that the unit of privacy in our framework primarily focuses on instancelevel di erential privacy, where each user is expected to report only one call stack.But our work can be easily extended to support the cases of user-level privacy, where the server makes repeated queries to the client to collect a sequence of call stacks.See Sec. 7 for more details.
Despite the potential privacy risks from call stacks, we however treat other possible attacks as beyond our research scope.For example, following the standard client-server model, it is easy for the client and server to mutually authenticate each other (thus avoiding a man-in-the-middle attack), and all the communication between the client and server is encrypted.All these operations have been widely supported by existing secure communication protocols, such as HTTPS and TLS.As a result, the data con dentiality, integrity, and authenticity can be guaranteed, and our focus is on facilitating privacy guarantee of the call stacks.
Why Securing CSA is Particularly Important?Give the aforementioned privacy guarantee, careful readers might challenge the necessity on securing CSA, especially in an enterprise setting.
① First, one may question that when a user installs a health-related app, the vendor may have a prior that the user may be taking medicine or scheduling doctor appointments.However, we clarify that merely installing health-related apps does not necessarily indicate that a user has a speci c issue.In contrast, the patterns observed in call stacks related to doctor appointments and medication usage can provide valuable and concrete information, leaking highly sensitive data about users' health conditions.
② Also, one may challenge that the vendor may already be aware of the health conditions of its users, especially if all doctor appointments are coordinated through the vendor.Nevertheless, we underscore that transactional data (e.g., doctor appointment records) is often stringently protected to meet compliance standards.Thus, it is very unlikely that the vendor can access the transactional data in real-world scenarios.In contrast, software call stacks, typically stored as logs, are pervasively collected and analyzed by vendors.They were often considered less sensitive than transactional data, and thus the privacy implications tend to be overlooked.In reality, the permissions for these log data are often more permissive than those for transactional data, aiming to facilitate the debugging process for developers.This renders call stacks more susceptible, underscoring the imperative of securing CSA.
Other Privacy-Enhancing Techniques.To protect against call stack information, some other advanced techniques like secure multi-party computation (MPC) [Yao 1982] or homomorphic encryption (HE) [Gentry 2009] may be required.These techniques focus on completely hiding the data from the server while still allowing it to perform computations.However, these techniques are often not practical for real-world scenarios due to their high computational overhead, and thus are not considered in this study.We leave exploring low-cost solutions of MPC or HE in the context of program analysis as future work.Sec. 8 also reviews relevant works in this direction.Additionally, anonymization transmission services, such as the Tor network [Dingledine et al. 2004] can also be helpful in this context.These services specialize in anonymizing user identities and o er an additional layer of privacy protection.However, it is important to note that the privacy guarantees provided by anonymization transmission services are distinct from our work.
Attackers with background knowledge can still deduce sensitive information from anonymized data (e.g., de-anonymization attacks on Tor [Kwon et al. 2015]) and undermine privacy.In contrast, the DP-based approach in PP-CSA o ers a principled and distinct way to limit the abilities of such attackers and safeguard against these types of adversarial inferences [Yang et al. 2012].By leveraging DP, PP-CSA o ers a distinct privacy guarantee that cannot be solely achieved through anonymization.Furthermore, due to the di erent primary security objectives, integrating PP-CSA with anonymization transmission services can yield stronger privacy protection.We discuss this potential integration in Sec. 7.

Limitations of Existing DP-Based Solutions
Prior works [Hao et al. 2021;Zhang et al. 2020a,b] have proposed using DP to protect the privacy of individual users in software analytics.In particular, the state-of-the-art work [Hao et al. 2021] encodes each call stack via a hash function (e.g., SHA-256) to a DP-friendly xed-length binary vector, on which randomized response (RR) mechanism is applied by randomly ipping some bits.Then, on the server side, the noisy reports are aggregated to yield an estimation of the frequency of each call stack, identifying the "hot traces."Also, in the NLP community, there are emerging attempts [Habernal 2021;Igamberdiev and Habernal 2023;Krishna et al. 2021] to apply DP to protect the privacy of texts.In essence, these works employ an encoder to map the text into a xed-length numerical vector, on which Laplace or Gaussian noise is added to each dimension.Then, the noisy vectors are passed to a decoder to denoise and reconstruct the original text.Usually, pre-trained language models (e.g., BERT-family models [Kenton and Toutanova 2019; Lewis et al. 2020]) are used to initialize the encoder and decoder.
Despite the encouraging results of prior works, we clarify that the existing solutions are not practical for real-world scenarios due to the following two reasons: ➊ adherence to caller-callee relationship, and ➋ awareness of epistemic uncertainty.Adherence to Caller-Callee Relations.It is worth noting that, unlike free-form texts, call stacks in our context consist of a sequence of function calls, which have to adhere to the legitimate caller-callee relationship of the program.The randomized response mechanism (as well as the count sketch technique used in [Hao et al. 2021]) can asymptotically recover the frequency only when the noisy report also represents a valid item.Given the prior that the frequency of invalid call stacks is strictly zero, it becomes questionable whether the technique used in [Hao et al. 2021] can provide an accurate estimation of call stack frequencies.To illustrate this issue, consider a call stack that yields an encoding via a hash function ℎ.Then, the randomized response mechanism is applied to to yield a noisy encoding ′ .However, there does not exist a call stack ′ that yields ′ via ℎ.As a consequence, the occurrence of vanishes in the aggregated report, leading to an underestimation of the frequency of .Awareness of Epistemic Uncertainty.In addition, we argue that, when applying DP to protect the privacy of call stacks, it is important to be aware of the epistemic uncertainty in the decoding process.In particular, the standard text privatization pipeline [Igamberdiev and Habernal 2023] usually reconstructs the text from the noisy vector by simply maximizing the likelihood of the text.However, this approach does not consider the epistemic uncertainty in the decoding process, which may lead to a signi cant deviation from the original text.In particular, the true call stack does not necessarily correspond to the MLE (maximum likelihood estimation) over the noisy report.In contrast, the noisy report represents a distribution of possible call stacks, and the true call stack is a sample from the distribution.As shown in Figure 1, when merely taking a "point estimation" of the noisy report, we may end up with a call stack that is di erent from the true one.We present the design overview of PP-CSA in Figure 2. Overall, PP-CSA features both o line and online phases to achieve privacy-preserving CSA.During the o line phase (i.e., before clientside deployment), the developer constructs the call graph of the software of interest, instantiates rules to shrink the call stacks, and trains an encoder-decoder model to enforce LDP (details in Sec.4.1).Then, the software, together with the trained model and the rules, is deployed to the client side.During the online phase, when users execute the software, the client side collects the correspondingly logged call stacks.Note that common software ecosystems like Java and Android have already provided such logging functionality, which can be easily enabled by developers.Then, the rules obtained o ine are rst applied to shrink the call stacks (details in Sec.4.1).After that, the encoder model is applied to convert the shrunken call stacks into an LDP-protected representation.The client side then submits the LDP-protected representation to the server side, which is later decoded to recover call stacks.Finally, the server side performs CSA on the call stacks aggregated from multiple users to decide the "hot call stacks".We propose several optimizations to largely improve the e ciency of PP-CSA at this step without sacri cing the privacy guarantees (details in Sec.4.2).Our empirical evaluation, as will be reported in Sec.6, shows that PP-CSA can achieve high utility and privacy guarantees while maintaining high e ciency.Before elaborating on the technical details, we rst discuss the application scope and assumptions below.

DESIGN OF PP-CSA
Application Scope.PP-CSA is a general-purpose framework for privacy-preserving CSA.Its technical pipeline is orthogonal to particular software implementation or programming languages, and thus applicable to a wide range of call stacks and software.Our observation shows that realworld software of varying purposes (e.g., medical or nancial apps) may result in privacy concerns, and PP-CSA is applicable to all these scenarios.Aligned with previous works [Hao et al. 2021], we focus on CSA, particularly hot stack analysis, of Java and Android programs, given their real-world popularity.Also, using such tasks and languages eases an apples-to-apples comparison with existing works.However, our framework can be easily extended to other scenarios, including di erent programming languages and many CSA tasks (see our discussion on extensibility in Sec.7 for more details).Having said that, we note that PP-CSA is not suitable for CSA tasks that require the analysis of individual properties.For example, if the goal is to identify call stacks associated with uncommonly-occurred bugs [Ko and Myers 2008], which requires precisely pinpointing those rare yet buggy call stacks, then PP-CSA is not applicable.As discussed in Sec.2.1, the fundamental principle of di erential privacy is to impede adversarial inferences regarding individual properties, while still providing accurate estimations of group statistics.

PP-CSA: O line Analysis
PP-CSA starts by extracting the call graph of the software under analysis.The call graph is a directed graph that represents the function call relations between functions in the software.Each node in the call graph represents a function and each edge represents a calling relation between two functions.Formally, the call graph is de ned as follows.
Definition 4 (Call Graph).A call graph is a directed graph = ( , ), where is the set of nodes (representing all functions) and is the set of edges (calling relationship between two functions).We denote the set of entry functions of the call graph as entry ⊆ .
Given that the current prototype of PP-CSA is implemented for Java and Android programs, we use Soot [OSS 2023] to use Class Hierarchy Analysis (CHA) algorithm and soundly build the call graph.
As already noted in Sec.2.2, our focused task employs call stacks to encode the program runtime information and facilitate identifying hot call stacks.Formally, a call stack is de ned as follows.
Definition 5 (Call Stack).A call stack is a stack data structure = ( 1 , 2 , . . ., ), where each is a function involved during the current execution, 1 ∈ entry is an entry function of the program, acting as the starting point for the current execution, and ( , +1 ) denotes that function calls +1 during this execution.
Shrinking Optimizations.One challenge of CSA is that call stacks can be very long, which is particularly undesirable for privacy-preserving analysis.In general, the longer the call stack, the more challenging it becomes to safeguard its privacy while preserving utility, and also imposes more di culty in training encoder/decoder.To address this challenge, we propose two shrinking rules to reduce the size of call stacks as follows.
R 1 Pre-determinable Function Compression.Intuitively, there may exist redundancy in the call stacks.For example, if function can only be called by function , then the record ( , ) in a call stack is redundant as it can be deduced from a single record ( ) (as is the only legitimate caller of ).Overall, if there exists exactly one path between two functions in the call graph, then their intermediate functions are redundant on the trace.Therefore, we identify these pre-determinable function pairs and compress the call stacks by removing their intermediate functions.
R 2 Lengthy Stack Cuto .Our preliminary study reveals the "long-tail" phenomenon in the call stacks of a program: often, when pro ling a real-world Android program with its common workload, a small number of call stacks are exceptionally lengthy.For example, the average call stack length for Eclipse (one of the Java benchmarks used in evaluation) is 6.32, while the longest call stack contains 24 calls.As one can expect, lengthy call stacks are particularly challenging for privacypreserving analysis given the exponential growth of the domain of all possible call stacks, making encoder/decoder training generally harder.On the other hand, we note that, when considering the speci c task of identifying hot stacks, those long and rare call stacks are unlikely to be "hot." We propose to prune those long call stacks to improve the overall performance of PP-CSA.Formally, we de ne the pruning strategy as follows.Let ℎ be the user-speci ed frequency of hot call stacks and let ( ) be the cumulative distribution function of the call stack length.Then, we de ne a pruning threshold , as the inverse of the CDF function: = −1 (1 − ℎ), where −1 is the inverse function of the cumulative distribution function ( ).This threshold ensures that a fraction ℎ of the hottest call stacks (i.e., those with lengths below ) are retained, while the exceptionally lengthy and rare call stacks above are trimmed o .
Clari cation.It is evident that both R 1 and R 2 are used in the online phase (the "shrink" operator in Figure 2) when the client decides to send a call stack to the server.Nevertheless, we present them here as the information required to apply these rules is obtained in the o ine phase in advance.Also, to ease presentation, we leave presenting encoder/decoder design and training in Sec.4.2, although they are also performed in the o ine phase.

Online Call Trace Protection and Analysis
Sec. 4.1 describes a one-o preprocessing step to establish the call graph and also decide which call stacks to prune or directly discard.Those preparation steps are performed on the server side and can be done in a one-o manner.In line with Figure 2, this section presents the online activities of PP-CSA, which consist of call stacks encoding and decoding, LDP perturbation, and call stack reconstruction and analysis.
4.2.1 Encoder/Decoder Design and LDP Enforcement.At this step, we seek to encode the call stacks into numerical representations to enable the enforcement of LDP.Often, the baseline approach is to use one-hot encoding or hash functions [Hao et al. 2021], and thus converting the call stacks into a xed length bit vector.Despite the simplicity, this approach is undesirable given that call stacks are with orders by nature and are often lengthy and complex.The resulting encoding would be very sparse and would not be able to achieve high utility.
At this step, we propose to protect call stacks using a sequence-to-sequence encoder-decoder scheme.Overall, an encoder model on the client side converts a call stack into a sequence of numerical vectors where LDP is later enforced.When the server receives the encoded call stack, it employs a decoder model to decode the LDP-perturbed sequence of numerical vectors into a call stack.Below, we describe the design of the encoder and decoder in detail.
Architecture Design.The encoder-decoder model plays a pivot role in PP-CSA.Given a call stack = ( 1 , 2 , . . ., ), after being shrunk by the shrinking rules (as shown in Figure 2), the encoder Enc on the client side computes a latent numerical vector representation for the call stack, to which the LDP noise is later added (see details below).This noise-added representation ˆ is then passed to the server side.Subsequently, ˆ together with auxiliary reachability features (see Sec. 4.3) is fed to the decoder Dec, which reconstructs the call stack ˆ on the server side.
In essence, we employ an LSTM-based encoder-decoder model [Hochreiter and Schmidhuber 1997], where both encoder and decoders are uni-directional LSTM models.The encoder takes the call stack as input and outputs a context vector.Then, this numerical vector is clipped by norm and noise is added to the clipped vector.To enhance the decoder's accuracy, we accompany the decoder with several optimizations, as will be discussed in Sec.4.3.
Objective Function.The objective function of the encoder-decoder model is de ned as follows: where is the length of the call stack, is the -th function name in the call stack, and is the probability of the -th function name predicted by the decoder.All call stacks are padded to the same length with a special token.Note that , representing the maximum length of the call stack handled by PP-CSA, is decided by R 2 in Sec.4.1.
Clipping by Norm.As stated in Sec.2.1, the sensitivity of the function is a crucial parameter in determining the appropriate scale of noise to be added.However, the sensitivity of the encoder function Enc can be potentially in nite; because the encoder's output is unbounded, leading to unbounded changes in its output.As a common tactic to address this challenge, we clip the output of the encoder function Enc, namely a latent vector , by its norm and the clipping constant ∈ R.
The clipping by norm operation is de ned as follows: This way, the sensitivity of the encoder function Enc is bounded by its norm and the clipping constant, and the Gaussian mechanism can be applied.
LDP Enforcement.With this clipping mechanism in place, we can now calculate its sensitivity, in order to determine the scale of noise to add in the LDP setting.This is outlined in Theorem 2.
Theorem 2. Let : R → R be a function as de ned in Eqn. 3. The ℓ 2 sensitivity Δ 2 of this function is 2 [Krishna et al. 2021], where ∈ R : > 0 is the clipping constant and ∈ N is the dimensionality of the vector.
The noise is then added to this clipped vector, as in Eqn. 4.
where ′ is the clipped vector and is the noise vector with the same dimensionality.Each noise element, denoted as , is drawn i.i.d.from the Gaussian distribution of N (0, 2 ln( 1.25 ) 2 ) according to Theorem 1.In sum, we present the whole process occurred on the client side in Alg. 1.Following the standard procedure [Dwork et al. 2014], we note that the encoder part of the PP-CSA satis es ( , )-di erential privacy for the Gaussian mechanism.Subsequently, the encoder output is transmitted to the decoder for decoding, where the reconstruction of call stacks takes place.Notably, the entire PP-CSA ensures di erential privacy due to its robustness under composition, a property known as "post-processing invariance [Dwork 2006]." This essential property guarantees that even though the encoder output is used as input for further computation or analysis, the overall privacy guarantees remain intact, safeguarding the privacy of individuals in the dataset.

Server-Side Optimizations
Our tentative study shows that the vanilla decoder design, although already achieving encouraging results, may still be inaccurate due to distinct characteristics between program call traces and other common sequence data, such as natural language (see empirical comparison in Sec.6.2).Enforcing the di erential privacy over program call traces which have strict semantics is more challenging.In this section, we describe three optimizations (O 1 , O 2 , and O 3 ) in the decoder phase which improve the overall performance without introducing much additional cost.In particular, O 1 and O 2 are designed to improve the adherence to caller-callee relationships, while O 3 aims at increasing the awareness of epistemic uncertainty, as discussed in Sec.3.3.We clarify that these optimizations can be applied together or separately, depending on the speci c scenario.Nevertheless, our evaluations show that the combination of these optimizations can achieve the synergistic e ect and yield the best performance (see Sec. 6).
O 1 Reachability-enhanced Training.Enlightened by the algebraic property of graph adjacency matrix [De la Cruz Cabrera et al. 2019], we propose to incorporate the reachability information of the call graph into the underlying encoder-decoder model.This optimization aims to enhance the model's ability to capture nuances of the program's reachability dynamics, which illustrates how an individual function call triggers subsequent ones during the execution ow.This ability allows the model to better adhere to the caller-callee relationship, as introduced in Sec.3.3.To quantify reachability, we employ a reachability matrix ∈ R × , derived from the matrix exponential representation of an input adjacency matrix .The matrix exponential -denoted as exp( ) -is a mathematical operation that extends the concept of exponentiation to square matrices.Precisely, for any assigned matrix , its corresponding matrix exponential exp( ) can be devised utilizing its expansion in terms of the Taylor series: From an intuitive perspective, signi es the connectivity from function to function within the call graph.Considering any particular function , the associated reachability-informed feature is characterized by the respective row vector residing in the reachability matrix, as de ned below: By integrating the row vector as an auxiliary feature, the model can obtain insight about the reachability of the current function, thereby facilitating its understanding of likely subsequent function calls.The optimization e ectively improves the accuracy of the decoding process, as shown in Sec. 6.
O 2 Call Graph-Guided Decoding.In the common usage of decoder, the decoder outputs the sequence with the highest likelihood.However, a standard decoding scheme for natural language o ers no guarantee that the decoded sequence is a valid call stack sequence (i.e., ensuring adherence to the caller-callee relationship, as noted in Sec.3.3).As an illustration, given a call graph = ( , ), if the decoder produces a call stack ( 1 , 2 , . . ., ) in which ( , +1 ) ∉ , it implies that the decoderyielded call stack is not legitimate.To tackle this hurdle, we propose to incorporate the call graph information into the decoding process.In particular, the optimization employs the call graph as an enforced constraint during the decoding phase.Speci cally, we only allow the decoder to output a call stack that satis es the call graph constraints.This way, we ensure that the decoded call stack is valid and realistic to a great extent.
O 3 Probabilistic Distributional Decoding.As discussed in Sec.3.3, for conventional generative models, the decoder aims to produce a "point estimation" of the most likely result [Krishna et al. 2021].However, in the context of our speci c scenario -where di erential privacy mechanisms are enforced -an inherent epistemic uncertainty is introduced.As a result, the decoded call stack, as a maximum likelihood estimation, might not faithfully re ect the true call stack.To circumvent this issue, we adopt a "probabilistic distributional decoding" scheme to enable a more holistic estimation of the call stack distribution.Speci cally, in the decoder, we sample a collection of outcomes along with their associated probabilities using beam search and reconstruct the empirical distribution accordingly.Subsequently, by aggregating these distribution estimations from a variety of clients, we can approximate the global call stack distribution, thereby catering to the needs of real-world applications.This aggregation process contributes to more robust and reliable insights into the comprehensive estimation of function call stacks.
In sum, we outline the decoding procedure occurred on the server side in Alg. 2. The decoder receives the encoded call stack enc from the encoder as input and produces a set of decoded call stacks S dec along with their associated probabilities.The decoding process is implemented using beam search, a widely used technique for sequence generation tasks.Speci cally, we maintain a priority queue to store potential call stacks during the beam search (line 2).The priority of each call stack is determined by its probability score, guiding the exploration towards more promising sequences.
We initialize the priority queue with the start token, representing the initial state (line 1).The decoder then iteratively processes the current function and its corresponding hidden state, along with the reachability feature, to calculate the probability distribution of the next function (Optimization O 1 , line 9).Next, we expand the call stack by considering all possible next functions reachable from the current function, as dictated by the call graph (Optimization O 2 , lines 10-14).The number of the valid nished call stacks reaches , indicating that we have found the topmost probable call stacks.Finally, we normalize the probability of each call stack using the softmax, providing a probabilistic distributional estimation of the input call stack (Optimization O 3 , lines 16-21).We omit nally computing "hot trace" given it is a straightforward process and orthogonal to Alg. 2.

IMPLEMENTATION AND EVALUATION SETUP
PP-CSA is implemented in Python and Java, with a total of 5,280 lines of code.The Python code is used to implement the encoder and decoder, and the Java code is used to employ the Soot framework to generate the Java datasets from the DaCapo benchmark (version 9.10.1).PyTorch (version 2.0.1) is used to implement the encoder/decoder.We have released a snapshot of our codebase at [artifact 2023] for results reproducibility.We will maintain them and provide further documents to ease future research comparison and extension.

Hyper-parameters
To clarify, we do not heavily tune the hyper-parameters in our implementation, as we aim to show the general feasibility of PP-CSA in real-world scenarios.For the encoder/decoder, we use a single-layer LSTM with 128 hidden units.The embedding size is 300, batch size is 64, and the learning rate is 0.01.The maximum number of epochs is 10.The sample size in the decoding phase is 3.For LDP, we use the Gaussian mechanism with a clipping norm of 5.As a rule-ofthumb [Ponomareva et al. 2023], we use as the inverse of the data size in our experiments.We use a threshold ℎ = 0.01 to identify hot traces (see empirical impacts on this threshold in Sec. 7).
Overall, these hyper-parameters are common choices, and with these settings, our evaluations already report promising results.Users may further tune the hyper-parameters to achieve better results if needed.

Evaluation Setup
Baselines.We compare PP-CSA with the the state-of-the-art tool [Hao et al. 2021], which is the only existing tool that can perform CSA with privacy guarantees.To ease presentation, we refer this tool as SOTA in the rest of the paper.From another perspective, we also compare PP-CSA with DP-BART [Igamberdiev and Habernal 2023], the state-of-the-art DP-based text rewriting system.Datasets.The current PP-CSA is implemented primarily for Java and Android programs.To deliver a fair comparison with SOTA, we reuse the Android apps used in SOTA for evaluation.This dataset, referred to as Android, contains 15 real-world Android apps with various sizes and functionalities like games, location tracking, news, music, and so on.It has been analyzed in [Hao et al. 2021] that these real-world Android apps su er from considerable privacy risks when collecting call stack traces.We also use a common Java dataset, DaCapo Benchmark [Blackburn et al. 2006; University 2021], which contains a set of open-source, real-world Java programs.This dataset, referred to as DaCapo, mimics the real-world Java programs with varying complexity, and is widely used in the literature [Bond and McKinley 2005;Thiessen and Lhoták 2017].We summarize the statistics of our evaluated datasets, including the statistics of the call stacks and call graphs, in Table 1.Note that the average number of entry points in DaCapo is not applicable (NA), since these programs are normal Java programs with the main method as a sole entry point.Therefore, we only report this statistic for the Android dataset.Below, we introduce the details on call stack collection.
Call Stack Collection.For the Java programs in DaCapo, we randomly simulate the execution over its call graph starting from the program entry point and collect call stacks for each program.
To do so, we initiate the simulation process from the entry point function on the call graph of a Java program in DaCapo.We then conduct a "random walk" on the call graph, where each time we randomly pick a callee function to proceed until there are no more functions that can be called from the current function (i.e., we arrive at a leaf node on the call graph).This forms one chain of function calls, and we randomly choose a pre x on this call chain to form a call stack and include it in our dataset.We then repeat the process with di erent random seeds, and de-duplicate the collected call stacks until the de-duplicated average call stacks of each program in DaCapo approximate that of the Android dataset.On average, each program in the DaCapo dataset has 4,195 de-duplicated call stacks gathered, as shown in Table 1.This number is close to the 4,342 de-duplicated call stacks collected on average for each program in the Android dataset.
As for the Android apps, we directly reuse the call stacks shipped in this dataset.It is disclosed that these call stacks are collected using the Monkey tool [Google 2020] to mimic 1000 users' interactions with those 15 apps in the Android dataset.We sampled 100K call stacks for each app from the Android dataset, and the statistics of these call stacks are reported in Table 1.Overall, a considerable number of call stacks are collected for each program, and we show that call stacks with comparable sizes are collected and used for evaluation.Note that the SOTA work [Hao et al. 2021] only evaluates the Android dataset.We randomly picked 90% of the call stacks for training and the remaining 10% for testing.We do not particularly tune the training hyper-parameters, as noted above in this section.
Metrics.Aligned with [Habernal 2021], we use the precision, recall, and F1 scores to assess the performance of PP-CSA and the baseline methods.To clarify, "precision" denotes what portion of the reported hot traces are actually hot, whereas the "recall" denotes what portion of the true hot traces are discovered.The F1 score is the harmonic mean of precision and recall.We also report the running time of PP-CSA and the baseline methods.In RQ3, following the setup in [Hua et al. 2022], we measure the privacy bene t of PP-CSA through the lens of adversarial uncertainty.Intuitively, adversarial uncertainty indicates the accuracy of an adversary in inferring the original call stack from the released data (i.e., the noisy encoding).In Sec.6.3, we provide the formal de nition of adversarial uncertainty in our context.

EVALUATION
We have introduced the experimental setup in Sec.5.2.Our evaluation mainly focuses on three key research questions (RQs): • RQ1 what are the performances of PP-CSA under di erent con gurations?
• RQ2 what are the contributions of each optimization of PP-CSA?
• RQ3 what are the privacy bene ts of PP-CSA?

RQ1: End-to-End Evaluation
We report the end-to-end evaluation results of PP-CSA in terms of both Java and Android datasets in Table 2 and Table 3, respectively.As aforementioned, we measure the performance of PP-CSA in terms of precision, recall, and the F1 score.Note that data in Table 2 are the average of all Java programs in the DaCapo bench dataset, so are the data in We also observe that PP-CSA achieves a (nearly) perfect F1 score when the privacy budget is set to 40 and 100.Note that while = 100 can be deemed as a "lenient" privacy budget that may not be commonly adopted in practice, = 40 is realistic.In comparison, the F1 scores of SOTA and DP-BART under this setting manifest a considerable space for improvement.
It is seen that with a gradually increasing privacy budget , PP-CSA achieves higher accuracy.This is expected, because a larger privacy budget allows PP-CSA to release more information from the input trace, thus enabling the server side to more accurately reconstruct the original trace.Note that on an edge case, when = 1, PP-CSA achieves a relatively low F1 score in comparison to SOTA.Nevertheless, we argue that = 1 is generally unrealistic in practice, and the low F1 scores are presumably because the noise added by the DP mechanism is too large and the encoding may vanish.Overall, this is a too-strict setting, and all the three approaches achieve low F1 scores under this setting.Overall, we interpret the results reported in Table 2 and Table 3 as highly encouraging, showing the e ectiveness of PP-CSA in terms of accuracy.We recommend to set the privacy budget to 20 or 40, which are realistic and achieve a su ciently high F1 score.We leave further discussion on the security implication of in RQ3.E ect of Input Stack Length on PP-CSA Performance.We also measure the performance of PP-CSA in terms of the length of the input stack, whose results are in Table 4.To present a comprehensive study, we use the Eclipse dataset from DaCapo, given that it contains traces with varying lengths.In particular, given nearly all traces in this dataset are shorter than 20, we divide the traces into three categories: short (lengthy ≤ 5), medium (5 < length ≤ 10), and long (length > 10).Under a reasonable budget = 30, we observe that shorter traces lead to the highest F1 score of 0.966, indicating the algorithm's high pro ciency in protecting them.The F1 score decreases to 0.806 for medium-length traces, and experiences a further drop to 0.541 for longer traces.Overall, we interpret the results in Table 4 as reasonable, given that longer traces (similar to lengthy natural language text) contain more information and are inherently more di cult to protect under DP.On the other hand, we note that traces longer than 10 are rare in practice, which means that PP-CSA is generally e ective in protecting traces in practice.We leave it as future work to shrink the length of the input trace (see discussions in Sec.7) and improve the accuracy of PP-CSA.Applicability of PP-CSA across Di erent Test Data.Furthermore, we evaluate the applicability of PP-CSA across di erent test data.To do so, we conduct an evaluation of its e ectiveness when confronted with call stacks that exhibit di erent distributions in comparison to the training data.Speci cally, we focus on analyzing the top 5% longest call stacks.We ensure that all the test data fall within this category while maintaining the normality of the training data, thus creating an out-of-distribution scenario.
The results of this evaluation are presented in Table 5, where the red color highlights the percentage of performance change compared to the performance achieved under the same distribution settings.For instance, when the is 40, the precision on Android dataset is decreased from 1 to 0.722 ( 0.722−1 1 × 100% = −28%).Overall, we observe that PP-CSA exhibits a decrease in performance in the out-of-distribution setting, particularly in the low epsilon regime.This suggests that the model is more likely to misidentify these abnormal call stacks as common call stacks, which is re ected in the low precision values and high recall values.Our current design does not speci cally tailor PP-CSA to take into account the out-of-distribution scenarios; this indicates a potential future direction for improvement.Study the In uence of Entry Points on PP-CSA.We also study how certain characteristics of the benchmarks, speci cally the number of entry points, can in uence the e ectiveness of PP-CSA.
We report the results in Figure 3.In short, we do not observe a signi cant correlation between the number of entry points and the e ectiveness of PP-CSA.This observation may be attributed to the proper usage of a virtual entry point during the decoding phase when handling multiple entry points.This virtual entry point serves as the starting point for the decoding process and is connected to all real entry points in the call graph through edges.Consequently, a program with multiple entry points can be treated as having a single entry point.Thus, the number of entry points does not have a signi cant in uence on the overall e ectiveness of PP-CSA.Processing Time.We measure and report the average running time of PP-CSA in processing one trace is about 3.84 × 10 −3 seconds.In comparison, the average running time of SOTA and DP-BART are 6.37 × 10 −3 and 2.17 × 10 −2 seconds, respectively.We thus interpret the overall processing time Answer to RQ1: PP-CSA manifests highly encouraging performance in terms of accuracy and e ciency.Moreover, we recommend using PP-CSA with a privacy budget ∈ [20,40], which shall achieve a highly competitive accuracy with moderate cost.
6.2 RQ2: Optimizations of PP-CSA In Sec.4.1, we propose two client-side shrinking rules to reduce the size of call stacks and, in Sec.4.3, we propose three optimization strategies on the server side.This section evaluates the e ectiveness of these rules and optimizations in improving the accuracy of PP-CSA.Note that those optimizations do not a ect PP-CSA's privacy guarantee, which is decided by the privacy budget .Nevertheless, we clarify that, given the fundamental trade-o between accuracy and privacy, with a xed accuracy, PP-CSA shall become safer with these optimizations.See our security evaluations in RQ3.At this step, we consider di erent client-side and server-side optimizations (and their combinations).For each setting, we evaluate the performance of PP-CSA on di erent privacy budgets, using the entire dataset of DaCapo and the Android, respectively.
Overall, we can observe that the performance of PP-CSA is generally improved when either client/server-side optimizations are used.Moreover, we observe an encouraging "synergistic e ect" such that the accuracy of PP-CSA is further improved when all optimizations are used together (the "All" row).PP-CSA achieves the best performance in ve out of the six settings in Table 6.The remaining setting is the "extreme" case where the privacy budget is extremely large ( = 100).
In practice, we do not expect to use such an extreme privacy budget.When comparing to the "Base" setting, we observe that the client-side shrinking rules R 1 -"Pre-determinable Function Compression" -appears to be the most e ective one.This is expected, as this rule "compresses" the call stacks by removing the function calls that are pre-determinable, making it easier for the encoder/decoder to identify common patterns in call stacks.This facilitates utility without compromising privacy.Moreover, the other four client/server optimizations also manifest very close and encouraging e ectiveness.For instance, we observe that the O 2 , call graph-guided decoding, is also highly useful.O 2 e ectively "regulates" the decoding process by pruning illegal call stacks that are not in the call graph, contributing to the accuracy improvement.
Answer to RQ2: All optimizations are e ective in improving the accuracy of PP-CSA.Moreover, we observe encouraging "synergistic e ect" such that the accuracy of PP-CSA is further improved when all server and client optimizations are used together in common privacy budgets.
We recommend enabling all optimizations whenever possible in practice.Given the released (noisy) encoding enc of a call stack , the goal of the adversary A is to infer the original call stack as accurately as possible.In this regard, from a prediction perspective, the adversary A can be viewed as a classi er performing -label classi cation, where is the number of distinct call stacks in the training dataset.The adversarial uncertainty is thus de ned as the advantage of accuracy achieved by A over trivial approach.Formally, we de ne the adversarial uncertainty on accuracy AU@Acc as follows:

RQ3
where Pr[A ( enc ) = ] is the probability that the adversary A correctly infers the original call stack from the encoding and Acc base = max (Pr[ ]) is the accuracy of the trivial approach that always predicts the most frequent call stack.Likewise, we can de ne the adversarial uncertainty on precision, recall, and F1-score, denoted as AU@Prec, AU@Rec, and AU@F1, respectively.Higher AU indicates lower con dence of the adversary A and thus better privacy protection.AU ≥ 1 indicates that the adversary A is no better than the trivial approach (i.e., always predicting the most frequent call stack).AU = 0 indicates that the adversary A can perfectly infer the original call stack indicating no privacy protection.We report the evaluation results of PP-CSA in Table 8.Overall, we observe a decreasing trend of AU when the privacy budget is increased.This is expected, as a larger implies less noise added to the encoding and thus easier for the adversary to infer the original call stack.In the regime of ∈ [20, 40], we observe that PP-CSA achieves a great trade-o between accuracy and privacy, as the AU is non-trivially high (e.g., AU@Acc = 0.239 when = 40) while PP-CSA attains highly accurate estimation of hot call stacks with F1 score = 0.970 in Table 2.This implies that the adversary cannot con dently infer a notable fraction of individual call stacks while PP-CSA can still accurately estimate the global call stack distribution and thus decide the hot call stacks.This is highly desirable in practice, as we alleviate the privacy concern of individual data while still providing useful statistics to developers for debugging without compromising the utility to a large extent.
Answer to RQ3: PP-CSA achieves a highly encouraging trade-o between privacy and accuracy in the common privacy budget regime ( ∈ [20, 40]).

DISCUSSION
We present the discussion of this paper in the following aspects.
Extensibility.We analyze the extensibility of PP-CSA from the following two aspects.From the language aspect, PP-CSA is currently implemented to handle Java and Android programs.Note that, PP-CSA does not rely on any language-speci c features, and can be easily extended to other programming languages (e.g., C/C++ or JavaScript).Also, while the current implementation relies on Soot to perform static analysis, PP-CSA can leverage other static analysis tools like WALA [Santos and Dolby 2022] for JavaScript or LLVM [Lattner and Adve 2004] for C/C++.Also, from the analyzed software aspect, PP-CSA can be naturally used to protect the privacy of call stacks collected from di erent software, e.g., web applications, IoT devices, and smart contracts.
From the analysis aspect, PP-CSA is currently implemented to support hot call trace analysis where each event on the call stack is a user-level function call.Looking ahead, we envision that PP-CSA can be extended to support the analysis of other call stack types, e.g., hot system call analysis where each event on the trace is a system call.This facilitates various security applications like intrusion detection [Lu and Teng 2021;Wunderlich et al. 2020].Furthermore, recall that the core idea of privacy preservation in PP-CSA is to privatize individual user's call stacks while still providing accurate estimations of group statistics, such as the frequency of each call stack.Thus, PP-CSA can be naturally extended to support other group statistics-based call stack analysis tasks in addition to hot call stack analysis, e.g.bottleneck identi cation [Tallent et al. 2009] and resource usage analysis [Decker et al. 2018].However, PP-CSA may not be well-suited for individual property-based call stack analysis tasks, including detecting outlier behavior [Mirgorodskiy et al. 2006] (e.g., unusually long call stacks) or identifying call stacks associated with uncommonlyoccurred bugs [Ko and Myers 2008].Due to the fundamental premise of DP, PP-CSA falls short of providing accurate estimations of individual call stacks.
Moreover, while PP-CSA supports Android programs, it does not leverage domain-speci c features of Android programs, such as the Android activity lifecycle and the Android Intent mechanism.We leave it as one future work to take those domain-speci c features into account, which may be likely useful to simplify the call stack length and speed up the analysis to a reasonable extent.It is important to note that, with the shortened call stack length, we envision that the accuracy of PP-CSA can be further improved without compromising the privacy guarantees.The rationale has been clari ed in Sec.4.1.
Additional Computation Costs.In comparison to the direct transmission of call stacks to a central server, PP-CSA introduces additional computation cost in both the o ine and online phases to ensure the privacy of individual users.During the o ine phase, PP-CSA incurs additional computational expenses due to the training of the encoder-decoder model.As for the online phase, PP-CSA introduces additional computational demands stemming from two aspects: the client-side inference of the encoder model and the server-side decoding of the noisy call stacks.In the o ine phase, the added computational cost is of minimal concern as it occurs in a non-real-time context.For the online phase, we have conducted a thorough evaluation of the processing time of PP-CSA in Sec.6.1, which shows that the average running time of PP-CSA to process one call stack is around 3.84 × 10 −3 seconds, which is acceptable in real-world scenarios and even faster than the current SOTA method.Overall, we observe that PP-CSA introduces reasonable additional computation cost compared to plain CSA.Threat To Validity.There exist potential threats that the proposed framework may not adapt to other types of programs.We mitigate this threat to external validity by designing an approach that is language and platform agnostic (as discussed in the above "Extensibility").Also, PP-CSA is evaluated on a set of real-world Java and Android programs.There exist threats that our evaluation results may not generalize to other test cases.We mitigate this threat by using Java and Android programs containing a diverse set of functionalities.Evaluating PP-CSA on those programs used by prior works also eases a fair comparison with SOTA and DP-BART.
Additionally, it is important to note that the unit of privacy in our framework primarily focuses on instance-level DP, where each user is expected to report only one call stack.We presume this scenario is a reasonable setup.For instance, when being used for analyzing call stacks that trigger crashes, it is unlikely for a user to experience multiple crashes with di erent call stacks in a short period.If a user does experience multiple crashes, uploading multiple call stacks may compromise the privacy guarantee.However, it is worth highlighting that our PP-CSA framework can be adapted to support user-level privacy.In this scenario, the server can make repeated queries to the client for multiple call stack information [Wu et al. 2014].To achieve this, we can divide the total privacy budget into several portions and add noise to each individual call stack using these allocated portions.This approach ensures that the privacy guarantee is preserved for each individual call stack, maintaining user-level privacy.ℎ × where denotes the total number of users (following the procedure in [Hao et al. 2021]).Hence, we report how di erent ℎ values may in uence the utility and privacy of PP-CSA in Table 9.We report the average precision, recall, and F1 score in di erent values on the DaCapo dataset.Overall, we observe that PP-CSA is robust to di erent ℎ values, and the utility and privacy of PP-CSA are not notably sensitive to ℎ .Nevertheless, when ℎ is overly large (e.g., 0.05 in our experiments), we can hardly nd any "hot" call stack.Overall, we believe the ndings as reasonable.In our evaluation (Sec.5), we adopt a ℎ value as 0.01, and we encourage users to properly decide ℎ according to their own needs.Performance in Low Regime.To comprehensively compare PP-CSA and SOTA, we evaluated the performance of both in the range ∈ [1, 10] as shown in Table 10.PP-CSA consistently outperforms SOTA with the exception of the range ∈ [1, 3] for the Android dataset.We hypothesize that this is due to the Android dataset having a large number of functions and a skewed function call distribution.This characteristic might complicate the encoder-decoder model's capability on accurately learning the call stack distribution.In contrast, SOTA uses a count sketch to approximate the call stack distribution and is likely to be more resilient to skewed distributions at low values.Nevertheless, SOTA's e cacy in this domain remains below par, with F1 scores even falling beneath 0.4.In general, attaining meaningful utility in low scenarios is challenging, and many relevant studies, like [Igamberdiev and Habernal 2023;Krishna et al. 2021], prioritize the high range, focusing more on utility than on privacy.Following this trend, we also focus on the high range in our main evaluation.And as a general recommendation for practitioners, we discourage performing call stack analysis at low levels.More importantly, we clarify that a high value does not inherently imply severe privacy issues in practice.As evidenced in Sec.6.3, even at = 40, the level of adversarial uncertainty remains considerably high.These insights lead us to omit the low regime from our main evaluation in Sec.6.1.
Potential Integration with other Privacy Enhancing Technques.In addition to the privacypreserving approach provided by PP-CSA, it is worth considering the potential integration with other privacy-enhancing techniques, such as anonymization transmission services like the Tor network [Dingledine et al. 2004].These techniques can o er anonymization of user identities, which provides orthogonal privacy guarantees to PP-CSA.Our PP-CSA can be seamlessly built on the basis of Tor.For instance, one possible integration approach is to leverage Tor as the underlying network infrastructure for transmitting the encoded call stacks in PP-CSA.By utilizing Tor, the client in PP-CSA can securely transmit the LDP-protected call stack to the server without revealing the client's identity.This integration would strengthen the overall privacy protection provided by the system, providing both anonymized transmission of call stacks and ensuring di erential privacy guarantees.

RELATED WORK
In this section, we review the related work on analyzing software in a privacy-preserving manner, the applications of LDP and also how SE and PL techniques are used to improve privacy-preserving software systems.
Privacy-Preserving Software Analysis.Recently, there has been a surge in e orts to secure software analysis using various privacy-enhancing techniques, such as zero-knowledge proof (ZKP), trusted execution environments (TEE), and di erential privacy (DP).Fang et al. [Fang et al. 2021] introduce a ZKP-based method for intra-and inter-procedural abstract interpretation.Cheesecloth o ers a ZKP-based solution to verify real-world software vulnerabilities [Cuéllar et al. 2023].Tramer et al. [Tramer et al. 2017] propose an SGX-based method to host a bug bounty program, wherein the proof-of-concept (PoC) of a software exploit is con rmed inside a secure enclave.In the section "Applications of LDP", we delve deeper into LDP's usage in software pro ling.
Applications of LDP.LDP stands out as a key privacy protection method.For instance, RAPPOR, an LDP-based mechanism, is employed by Google to collect popular domain data from users in the Chrome browser [Erlingsson et al. 2014].Tech giants like Apple, Microsoft, and Uber have employed LDP to enhance the privacy of their data collection services [Cormode et al. 2018;Near 2018].Speci cally in software pro ling, LDP has been utilized for collecting software node coverage, mobile app event frequency, and call traces [Hao et al. 2021;Zhang et al. 2020a,b].
SE & PL for Privacy-Preserving Software.The software engineering and programming language (SE & PL) communities have delved into testing, verifying, and debugging techniques tailored for privacy-preserving software systems.Notably, recent e orts have focused on employing program veri cation and type-system based approaches to verify the correctness of ZKP [Liu et al. 2023;Pailoor et al. 2023].Most of the identi ed bugs, stemming from erroneous implementations of ZKP programs, can be exploited by adversaries.FedDebug [Gill et al. 2023]  debugging framework, enabling users to pinpoint which federated client contributes to the central model's misbehavior.Similarly, MPCDi [Pang et al. 2024] deploys di erential testing to compare a machine learning model with its privacy-preserving counterpart shielded by multi-party computation (MPC).Also, Ding et al. [Ding et al. 2018] utilize di erential testing to unearth bugs in di erential privacy (DP) programs.Tools such as DP-nder [Bichsel et al. 2018] andDP-sniper [Bichsel et al. 2021] either leverage random testing or a neural network-based testing oracle to reveal violations of DP guarantees.Apart from the analysis aspect, the PL community also explored the optimization of privacy-preserving software systems.For instance, Levy et al. [Levy et al. 2023] and Ishaq et al. [Ishaq et al. 2019] present a toolchain for compiling and vectorizing MPC programs.Roy et al. [Roy et al. 2021] and Wang et al. [Wang et al. 2021] show the usage of program synthesis techniques in generating DP-protected programs from their unprotected counterparts.Some research focuses on programming language design, testing and veri cation of oblivious RAM (ORAM) protocols [Darais et al. 2019;Liu et al. 2015;Ma et al. 2022].

CONCLUSION
We propose PP-CSA, a novel privacy-preserving CSA approach that can be deployed in real-world scenarios.PP-CSA is based on LDP and incorporates several key optimizations in its technical pipeline.These optimizations include an encoder-decoder scheme to enforce LDP and a call stack compressing and matching algorithm to mitigate utility-privacy trade-o s.Our evaluation demonstrates the e cacy of the privacy-preserving call stack analysis pipeline implemented by PP-CSA.It achieves high levels of utility and privacy guarantees while maintaining high e ciency.We further show that our proposed optimizations are e ective and have a "synergistic e ect"; enabling full optimizations o ers the highest accuracy improvement.We also mimic adversaries who aim to infer the user privacy from the PP-CSA-protected call stacks, and illustrate that attackers can only achieve a very low attack accuracy under our common settings.We conclude the paper by discussing various extensions, some design considerations, limitations, and future work directions.Overall, our work aligns with the trajectory of privacy-preserving techniques in the programming language and software engineering communities, extending the realm to the call stack analysis and introducing innovative enhancements to tackle the unique challenges in this domain.We envision that our work can inspire further investigations into privacy protection in software systems.

DATA-AVAILABILITY STATEMENT
We have released our artifacts at [artifact 2023].Our evaluation results reported in the paper can be fully reported using the released artifacts.Looking ahead, We will maintain them for future research comparison and usage.We will also enhance the artifacts by providing more detailed documents and use cases to support the usage and extension of the community.
Fig. 1.Illustration of call stack variation issue under LDP.

Fig. 3 .
Fig. 3. Performance of PP-CSA under di erent number of entry points.

Table 1 .
Statistics of our datasets, stacks and call graphs.PP-CSA, with its tailored optimizations, can outperform DP-BART in terms of both accuracy and e ciency.We use both the o cial release of and SOTA and DP-BART for evaluation.All experiments are run on a machine with Geforce RTX 3090, Intel Core i7-8700 CPU and 32GB RAM.

Table 2 .
RQ1: evaluating PP-CSA using DaCapo under di erent privacy budgets.We highlight the best F1 scores for each se ing in bold.

Table 3 .
RQ1: evaluating PP-CSA using Android under di erent privacy budgets.We highlight the best F1 scores for each se ing in bold.
Table 3 for Android apps in the Drebin dataset; we discuss how PP-CSA performs on individual traces later in Table 4.It is seen that PP-CSA achieves a highly competitive F1 score across di erent privacy budgets.PP-CSA notably outperforms two baseline works, SOTA and DP-BART, in terms of F1 score by 26.3% and 77.4% on average for the Java dataset, and by 43.7% and 206.1% on average for the Android dataset.

Table 4 .
Precision, Recall, F1 score in terms of three kinds of trace lengths.

Table 5 .
Resilience of PP-CSA to di erent test data for DaCapo and Android.

Table 6 .
RQ2: evaluating the contribution of di erent optimizations using DaCapo.We highlight the best F1 scores for each se ing in bold.

Table 7 .
RQ2: evaluating the contribution of di erent optimizations using Android.We highlight the best F1 scores for each se ing in bold.

Table 8 .
: Privacy Benefit Sec.3.2 provides a conceptual discussion on the privacy guarantee of PP-CSA.In essence, the privacy protection o ered by PP-CSA aims to impede any potential inferences made by attackers Proc.ACM Program.Lang., Vol. 8, No. OOPSLA1, Article 139.Publication date: April 2024.RQ3: Adversarial uncertainty (lower is be er).In this section, we aim to quantitatively evaluate the privacy bene t of PP-CSA.To this end, we rst formally instantiate the concept of adversarial uncertainty in the context of call stack analysis.Speci cally, we consider the following adversarial inference problem:
E ect of "Hot Call Stack" Threshold.In line with typical DP-based systems, we have evaluated how di erent values a ect the utility and privacy of PP-CSA.Besides , another important parameter in PP-CSA is the "hot call stack" threshold ℎ , as de ned in Def. 3. Recall that ℎ is used to determine whether a call stack is "hot" by comparing the frequencies of the call stack with

Table 10 .
Evaluating PP-CSA under low regime for DaCapo and Android.We report the F1 scores under each se ing in this table.