Anonymizing Test Data in Android: Does It Hurt?

Failure data collected from the field (e.g., failure traces, bug reports, and memory dumps) represent an invaluable source of information for developers who need to reproduce and analyze failures. Unfortunately, field data may include sensitive information and thus cannot be collected indiscriminately. Privacy-preserving techniques can address this problem anonymizing data and reducing the risk of disclosing personal information. However, collecting anonymized information may harm reproducibility, that is, the anonymized data may not allow the reproduction of a failure observed in the field. In this paper, we present an empirical investigation about the impact of privacy-preserving techniques on the reproducibility of failures. In particular, we study how five privacy-preserving techniques may impact reproducibilty for 19 bugs in 17 Android applications. Results provide insights on how to select and configure privacy-preserving techniques.


INTRODUCTION
Collecting bug reports and information about the failures experienced by end-users while interacting with their applications is extremely important to reveal bugs [23,24], and improve the quality and the reliability of the applications.Indeed, several problems are detected only once the software has been released [9], and the extensive collection of failure data is a key factor to enable the reproduction of the bugs, and later their correction.
Several approaches have been defined to reproduce failures from runtime data extracted from the field.For instance, failures have been reproduced starting from the flow of events executed in the app immediately before a crash [17], from executions traces with the operations performed before a crash [4,10,15], as well as from the content of the stack trace [16,19], and bug reports [21,22].Despite the benefit of collecting data from the field to reproduce failures, user data can be fairly collected only by taking the sensitivity of the data into consideration.Indiscriminately collecting data may reveal sensitive information that should not be available outside the boundary of the app.For instance, failure traces may include sensitive information such as age, gender, financial data, and personal interests.
In specific cases, data can be partially anonymized.For instance, concerning the data stored in databases, kb-Anonymity can be used to mitigate the issue of sharing sensitive information through the databases used for testing [2].When the execution path of the failure is available and symbolic execution is applicable to the target program, new synthetic executions that reproduce the failures might be derived to also mitigate issues with sensitive data [3,11,12].
The field of data mining has been investigating this challenge for several years defining a number of privacy-preserving techniques that can be used to alleviate the problem of incidentally disclosing sensitive information [8,14,20].These techniques work by applying generalization or suppression operations to the data, so that the original information is not immediately available anymore [13].For instance, a string 123456 representing an account number could be automatically rewritten as a random string of the same length, such as xhfprt.These techniques can be readily applied to the data collected from the field to prevent disclosing sensitive data to third-parties.
While privacy-preserving techniques can clearly eliminate, or reduce, privacy issues, their impact on the capability of revealing failures has not been studied so far.Indeed, using anonymized data to reproduce failures is harder than using clear text data.That is, protecting the privacy of the users and facilitating the reproduction of failures experienced in the field are two competing goals.
In this paper we propose the first, to the best of our knowledge, empirical study about the impact of privacy-preserving techniques on failure reproduction.We focused our study on failures experienced by users of mobile apps due to the popularity of mobile applications and their exposure to privacy issues [7].Our study considers 19 bugs in 17 open source Android applications, and discloses insights about the trade-off between guaranteeing the privacy of the users and easing the reproduction of failures.In particular, we show that there is no unique choice about the privacy-preserving techniques to be used.Different contexts may require different techniques depending on the aspect to privilege.
This paper is organized as follows.Section 2 introduces and rigorously defines privacy-preserving techniques.Section 3 describes the design of the experiment conducted to evaluate the impact of privacy-preserving techniques on failure reproduction.Section 4 reports the empirical results and answers our research questions.Section 5 discusses related work.Finally, Section 6 provides final remarks.

PRIVACY-PRESERVING TECHNIQUES
Privacy-preserving techniques can be used to effectively anonymize data.This section defines the techniques that we considered in our study, describing how we adapted them to the problem of failure reproduction, when necessary.
Privacy-preserving techniques are typically used in the context of data mining, especially with records of databases.In fact, data contained in tables do not usually satisfy privacy requirements, and thus they cannot be shared without applying anonymization operations [8].These operations may target individual records or sets of records.The former class of operations is useful when a third-party that accesses the data can essentially access only to individual records, and cannot compare the (anonymized) records between them.The latter class of operations is useful when a thirdparty can access the full set of records, and thus may infer facts by cross-analyzing the content of multiple (anonymized) records.In such a case, privacy-preserving techniques must consider the full set of records when anonymizing the individual records to prevent the incidental disclosure of sensitive information.
In the case of software failures, they are usually experienced once a while for the released apps, and only few failures are normally collected from a same user.To guarantee that user data is fairly collected, the application of strategies specifically designed to deal with large sets of repeated failures collected from the same users is likely not needed.For this reason, we focus on privacy-preserving techniques that can be applied to individual records.
In this paper, we consider failure traces consisting of streams of GUI events executed on the app when the failure occurred, as done in many failure reproduction techniques, such as CaRCrash [17] and ReCDroid [22].That is, a failure trace is a sequence of events (  ,   ,   ) where   is a GUI action (e.g., a click event) performed on a widget   (e.g., a button) possibly using data   (e.g., the text entered into an input field).The set of values   that occur in a failure trace are the data values subject to the anonymization process.We do not consider in our experiment the anonymization of other elements, such as the action or the widget.That is, we study how to prevent the failure trace from disclosing information such as the age, the address, or the personal income through the data values   entered into a form, while it is out of the scope of the study to hide the fact that a user has registered into an app by clicking on the Register button.
The operations performed by privacy-preserving techniques to anonymize data vary based on the type of data.In particular, it is possible to distinguish three different classes of data to be analyzed: continuous values, categorical values, and string values.Continuous values are numeric values (e.g., someone's age or income) that can be used, for instance, as part of arithmetic operations.Categorical values are enumerated values that cannot be normally used as part of arithmetic operations [6].Finally, string values are sequences of alpha-numeric characters.
We now present the privacy-preserving techniques based on the type of anonymization strategy they implement: generalization, suppression, and perturbation.Table 1 shows the specific techniques that we considered (Column Technique), classified according to the strategy they implement (Column Strategy) and associated with type of data that they can be applied to (Columns Continuous, Categorical, and String).The set of selected techniques reflects the taxonomy proposed by Mendes et al. [13].We have excluded the anatomization strategy presents in the taxonomy, since it is strictly related to databases and cannot be applied to our context, and both the Top/Bottom Coding and the Post-Randomization (PRAM) techniques since our dataset does not include cases where they can be applied.

Generalization Techniques
Techniques that belong to this strategy replace values with more general ones [13].These privacy-preserving techniques disclose some general information, while hiding the original value.
Definition: Global Recoding anonymizes a value by only disclosing information about the interval it belongs to.This technique can be applied to both categorical and continuous variables.In the former case, Global Recoding anonymizes a value by combining several categories into fewer ones.For example, if the categorical value represents different age groups (e.g., newborns, infants, toddlers, kids, and adults), Global Recoding may reduce them into two groups (e.g., 'baby' and 'kids or older').In the latter case, Global Recoding replaces a variable with its interval.For example, the numerical age value can be replaced with its categorical age group [18].
Failure Reproduction: In the context of failure reproduction, replacing a categorical or continuous value with its interval (e.g., replacing the categorical input infant or the numerical input 1 with the category baby) would make the failure trace non-executable.In fact, the new value would not be processable by the application that expects either values from a specific enumeration of categories or numerical values.To obtain a processable input, and thus to attempt to reproduce the failure from the anonymized trace, the values anonymized with Global Recoding are then replaced with random concrete values within the anonymized interval.
Example: If a variable in the range [0, 10] is anonymized according to the sub-intervals [0, 5) and [5,10], and the value to anonymize is 4.0, Global Recoding replaces the original value with the interval [0, 5).Failure reproduction shall generate random values within this interval to attempt to reproduce the failure.Similarly, if a categorical value newborns is anonymized with a more general category baby (that includes newborns, infants, and toddlers), test generation shall use values in the set newborns, infants, and toddlers to reproduce the failure.

Rounding.
Definition: This technique identifies several rounding points in the domain and maps the input value to be anonymized to the closest rounding point [6].These rounding points could be identified by dividing the domain into multiple intervals, then selecting the middle point of each interval as rounding point.
Failure Reproduction: The anonymized value is an actual domain value and thus failure reproduction simply uses the value readily available in the trace.
Example: Given an input in the range (0, 10], the rounding points can be defined as the middle points of the intervals (0, 5) and [5,10], that is, the values 2.5 and 7.5.Every value to be anonymized is mapped to one of these two values.

Suppression Techniques
Techniques that belong to this strategy entirely drop the values to be anonymized, or retain minimal information, to protect privacy [13].

Local Suppression.
Definition: This technique can be trivially applied to any data type (continuous, categorical, and string), since it replaces the input value with a missing value, whose semantics depends on the context [18].For example, considering a record in a database, the corresponding missing value is NULL.In the context of Android applications, Local Suppression simply logs the empty string for any input value.
Failure Reproduction: In this case, failure reproduction is left with no information about the original value and thus it can only generate a random value coherent with the domain of the original value.In particular, if the original value is continuous, the technique generates a random value within the allowed range.If the original value is categorical, the technique chooses a random element from the set of possible values.If the original value is a string, the technique generates a string that matches a specific regular expression (in such a case, we consider both the case the new string has a length unrelated to the original string or has a length matching the original string).
Example: In all the cases, the anonymized value is the empty value.The generation is driven by the full range of values allowed by the input field.For instance, a random number between 0 and 100 could be generated for an input field representing the age of a person.

Special Char Driven Local Suppression.
Definition: Since sometimes bugs are triggered by anomalous characters that cannot be parsed or processed correctly, we defined a version of the Local Suppression that only preserves the special characters (defined as any non-alphanumeric character, such as *, !, and ?) contained in the value to anonymize.Special characters usually reveal virtually nothing about the original input, but they might be helpful to reproduce misbehaviors.
Failure Reproduction: The generation works the same than in Local Suppression, but the special characters in the input value are copied in random places within the generated value.
Example: Given the value example! to be anonymized, the technique generates a new random string that includes the special character !, such as HQb!Ha.

Perturbation Techniques
Techniques that belong to this strategy replace the original values with synthetic values close to the original ones [8,13].

Noise Addition.
Definition: This technique is typically applied to continuous variables (i.e., to numbers).The general idea is to change the original value by adding or multiplying a stochastic or randomized number (i.e., the noise) to the original data [18].Given a domain range

EXPERIMENT DESIGN 3.1 Goals and Research Questions
The goal of this study is to investigate the impact of privacypreserving techniques on the capability to reproduce the failures experienced in the field.To this end, we framed the following research questions.RQ1 -Effectiveness: What is the failure-reproduction rate for anonymized failure traces?This research question studies how failure-reproducing test cases derived from failure traces

Selection of the Subject Android Apps and Faults
To select the subject apps and faults, we performed an extensive manual analysis to look for failures that depend on user data, that is, failures that can be observed only if certain input values are entered.We restricted our selection to open-source F-Droid [5] apps with repositories present in either GitHub or GitLab, to make sure it is possible to inspect the app and actually identify the fault responsible for a given failure.We considered apps in the Money, Science & Education, Sports & Health, Time, Internet, and Writing categories, since these apps have a better chance of exploiting user inputs than apps in categories like Theming and Connectivity.
For every category, we manually checked at least 50 apps per category to identify the ones that have fillable fields, considering both the screenshots and the descriptions on their own page in F-Droid.For every identified app, we checked all the issues labeled as "bug" (or simply all the issues when labels are not available) to select the issues that are reported to be caused by specific inputs.We identified a total of 29 potentially useful issues spanning 26 apps.
To verify the presence of these issues, we downloaded the APK file corresponding to the version with the issue and reproduced the failure as reported in the issue.When the APK was not available, we checked out the correct version from the GitLab or GitHub repository of the application and generated the compiled app ourselves with Android Studio.We also checked the code of the app to determine if two same failures of a same app were originated by a same fault.We then classified failures as reproducible or non-reproducible, depending on the possibility to reproduce the failure with either an automatic Espresso [1] test case or a failure reproducing routine.In particular, we consider a failure non-reproducible if we could neither reproduce it with Espresso nor we could establish a clear relationship between the inputs and the fault present in the app.
We found eight non-reproducible failures and two identical failures generated by faults that were already included in the selection.We ended up with 19 reproduced input-dependent failures caused by distinct bugs in 17 Android apps.Detailed information about all the considered cases is publicly available in our repository, alongside with the material needed to reproduce the experiments and the results that we obtained: https://gitlab.com/sal-unimibanonymization/experimentation.Table 2 reports the apps, their domain and version, and a description of the bugs present in the apps.
For each reproduced failure, with the exception of two cases where the app was incompatible with Espresso, we recorded an automatic Espresso test case that exposes it.For the two cases of incompatibility, we inspected the faulty code in the apps and implemented a failure reproducing routine that given an anonymized input determines if the fault is exposed.

Configuration of the Privacy-Preserving Techniques
Depending on the nature of the data that must be anonymized, the techniques presented in Section 2 may require to be properly configured.In the following, we describe the configurations that we used.
String values can be anonymized with the Local Suppression and the Special Char Driven Local Suppression techniques.In both cases, the new value that must replace the original one is obtained according to a regular expression.We use the following four regular expressions that capture the cases we encountered in our subject apps: [!-~], when all possible string values including special characters are allowed; [A-Za-z0-9], when only alphanumeric values are allowed; [0-9.,], when only numbers with any decimal separator are allowed; and [0-9,] or [0-9.] or [0-9:], when only number with specific separators are allowed.All these cases are summarized in table Table 3.

Value type Regex
All possible string values with special char.When the Special Char Driven Local Suppression technique is used and the original string includes one or more special characters, these special characters are inserted in random places within the new anonymized string.We configure the length of the generated string in two ways, experiencing both in our evaluation.That is, the length of the generated string can be random or equal to the length of the original string.In the case of random length, we use the interval  for short inputs (e.g., a note title or a loyalty card name) and  for long inputs (e.g., a description).In case the length of an input is bounded to a value lower than the maximum defined by these intervals, we set the maximum length to the maximum length accepted by the text field.
Numeric values can be anonymized with most of the privacypreserving techniques.Local Suppression anonymizes values by generating new values within a specified interval.If the value to anonymize has boundaries defined by the application (e.g., the time can be only assigned with a value in the interval [0-24]), we configure the technique with these limits.Otherwise, we set the interval depending on the nature of the value: when a small value is expected (e.g., an age), we use the interval [0-100], otherwise if bigger values can be used (e.g., a currency) we use the interval [0-1.000.000].Global Recoding and Rounding require the definition of the number of partitions to be used to split the interval of definition.
Consistently with the previous definition of small and big values, we run the techniques with three configurations (using 2, 3, and 4 partitions) when a small value is expected by the app, and we use three different configurations (using 50, 100, and 500 partitions) when a big value is expected.Finally, we experience three different noise values (30%, 40%, and 50%) for Noise Addition.
The configurations that we used for the techniques applicable to numeric values are summarized in Table 4.

Experimental Procedure
To answer RQ1-3, we follow the procedure visually illustrated in Figure 1.We start from the Espresso test case that reproduces the bug as reported from the field by the user of the application, including the data reported in the original online issue.We identify ourselves a value coherent with the description in the issue, in the few cases a specific value was not available.To study the impact of privacy-preserving techniques, we anonymize the user data that the failure is dependent on with every applicable technique configured as discussed in Section 3.3.The anonymization of the data resulted in a new Espresso test case that uses the values derived from the anonymization process rather than the original values.We then executed the new test and checked if the same failure could be reproduced after the anonymization process.Since anonymization and failure reproduction imply randomness, we repeat the anonymization process 100 times for every configuration, for a total of more than 11K test executions.All tests were executed on a Huawei P9 Lite smartphone with Android 9, except for the few cases that required a specific Android version and were tested on virtual devices.In the two cases of apps incompatible with Espresso, we executed our failure reproducing routine.
The implementation of the privacy-preserving techniques presented in this paper and the tool to run the failure reproduction process are publicly available in the following repository: https://gitlab.com/sal-unimib-anonymization/anonymization-android-tool.
The set of applications used in the study, the input that has been anonymized, the technique used for the anonymization, and the configurations used for the anonymization process are reported in detail in table Table 5.We add the labels Lo, Me, Hi next to the configurations present in the table, to identify the configurations  that retain less (Lo), medium (Me), or more (Hi) information from the original non-anonymized value.
To answer RQ1, we measure the bug reproduction frequency, that is, the ratio between the number of anonymized tests that reveal the same failure revealed by the original test and the number of anonymized tests.The more often a failure is revealed, the less impact a privacy-preserving technique has on the failure reproduction capability of the test cases.To answer RQ2, we compute the number of repetitions to reveal the original failure with a probability of 95%, that is, we estimate the number of tests that must be derived and executed from failure traces to be reasonably sure that a bug has been either reproduced or it is not feasible to reproduce it.To answer RQ3, we measure the replication frequency of the original input, that is, we measure the number of times the original non-anonymized value is generated during the failure reproduction process.

RQ1 -Effectiveness
Table 6 reports the bug reproduction frequency for the privacypreserving techniques applied to strings and numbers.
Concerning the anonymization of strings, Local Suppression severely compromises the capability to reproduce failures (the mean success rate varies between 11% and 19%), depending on the configuration.The low bug reproduction frequency for Local Suppression is expected, since almost no information is retained from the original string.Interestingly, retaining more information from the original input (configuration Hi) has, in the majority of the cases, a negligible or negative effect on the bug reproduction frequency.This happens because preserving the length of the original string is often not a relevant factor in failure reproduction, while using (longer) random strings may increase the chance of using the right combination of characters that may trigger the failure.
In line with this intuition, SCD Local Suppression has a significantly better performance than Local Suppression (mean success rate of 49%).This confirms our intuition that by just disclosing a syntactic information that is largely irrelevant on the point of view of the user (i.e., the presence of a special character), failure reproduction might be often improved.Again, retaining the length of the string has not an impact on failure reproduction.Clearly, the presence of special characters alone is not always enough to reproduce failures.In these cases Local Suppression and SCD Local Suppression have similar performance, as for the Catima Loyalty and the Binary Eye apps.In some other cases, they are helpful but not sufficient alone, since the special character(s) might have to occur at a specific position, as in the first bug of the Track & Graph app, or in the context of a specific string, as in the Task app.In some   Mean 18% 11% 49% 49% 41% 56% 58% 49% 63% 50% 38% 52% 54% 54% 'Lo', 'Me' and 'Hi' in the header refer to configurations that retain less, medium, or more information from the input other cases the fault was dependent on the semantic of the value and the mere presence of the special character was not enough to reproduce the failure.
For numeric inputs, Local Suppression is the technique with the lowest success rate (41%), with only Rounding -Hi performing worst (38%).Noise Addition and Global Recoding perform similarly: the effectiveness of Global Recoding ranged between 49% and 58%, and Noise Addition ranged between 52% and 54%.Rounding performed best in some cases, but with higher performance variance, with an effectiveness between 38% and 63%.Global Recoding and Rounding are more sensitive than Noise Addition to the choice of the configuration.In fact, Noise Addition works with intervals that are defined around the original input value.On the contrary, the intervals used in Global Recoding and Rounding are independent from the original input value, which could fall very near or on the edge of the interval, affecting the reproduction probability in cases where the fault is caused by values near to the original one.
The characteristics of the failure to be reproduced also have an impact.In fact, there are some easy cases where most of the techniques were systematically successful, for instance due to values formatted according to the Android settings that systematically generate failures if incompatible with the expectation of the app.We also had some hard cases with a failure reproduction rate below 10%.This is due to the small set of domain values that trigger the failure (e.g., in Birday only February 29 of leap years lead to the malfunction, over all the possible dates).In cases where the bug was caused by values in a range close to the original input (e.g., Simple Calendar, Did I Take My Meds, Catima Loyalty), Local Suppression is affected by its inability to preserve any information, leading to an overall lower success rate compared to other techniques.Noticeable, although if with low probability, it was always possible to replicate a bug using Global Recoding and Noise Addition, while Rounding tends to quickly reveal the failure or miss it.
For three apps, the attempt to reproduce the original failure led to the discovery of new bugs.In Binary Eye some strings used to generate a QR code differ from the ones obtained when scanning the code generated by the app (e.g., }c:+8ha when coded and then decoded becomes ∼c:+8ha).In To Don't some task names when saved cause all the other task names to be cancelled and replaced by a null value (e.g., the random string S 0O}_(' was sufficient to reveal the bug).In Track & Graph -bug 1, the bug makes the app crash while creating a new multiple-value tracked habit, when the first option ends with '|', but we also discovered that option names with '||' cause the habit to not be saved.
Answer to RQ1: SCD Local Suppression should be preferred to Local Suppression when applied to strings, since it significantly increases the failure reproduction capability, while disclosing minimal information (the presence of a special character in the original string).Local Suppression applied to numbers had a significant, but not dramatic, impact on failure reproduction (41% success rate).Techniques approximating and perturbing the original value might increase the success rate (up to 63% in our experiments) at the cost of disclosing some information about the original value.Deciding how much information preserving from the input values should be done carefully, since preserving information not correlated with the failure trigger may negatively influence reproduction.

RQ2 -Cost
To measure the cost of using anonymized data to reproduce failures instead of using actual values, we computed the number of attempts (i.e., generation of anonymized values and then generation of the concrete test cases) that must be completed to reproduce the original failure with a probability of 95%.Table 7 shows the Mean and Max number of attempts necessary across all faults for a given technique and configuration, the number of non-reproduced faults (row # NR), and the results per fault.Note that a higher average success rate does not imply fewer attempts in average since there is a logarithmic relationship between reproduction probabilities and number of attempts.
When anonymizing strings, Local Suppression introduces a cost hardly affordable in practice, with close to 60 test generations and executions attempts needed in average, and up to 299 attempts in the worst case.SCD Local Suppression is more practical since it requires between 15 and 21 attempts in average, with a maximum between 49 and 99.
When working with numbers, Noise Addition seems to be the most affordable solution, since it reproduced all failures with 95% confidence with a mean number of attempts between 17 and 21 and up to 59 in the worst case (Me).Global Recoding is a good choice too, as it performs similar to Noise Addition except with Birday, which determines the higher mean and max values for this technique.This difference is due to the necessity of preserving day and month unchanged in the original input date (29-02-1996) to reproduce the failure, which is more likely to happen with Noise Addition since it creates intervals around the input value.Rounding can be useful to reduce the failure reproduction effort, but it also severally reduces the number of reproduced failures.Local Suppression is a valid alternative to Noise Addition when no information about the original value has to be disclosed, at the risk of failing to reproduce some failures.
Answer to RQ2: SCD Local Suppression is a cost-effective solution to anonymize strings.Numbers can be feasibly addressed with Noise Addition or Local Suppression, depending on the amount of information that can be disclosed.

RQ3 -Information Disclosure
We computed how often the failure reproduction process has generated the value before the anonymization during failure reproduction.Table 8 shows the percentage of cases it happened for the various combinations of apps and techniques.In case of strings, obtaining the original value is unlikely to happen due to the size of the space of possibilities.In fact, it happened only once for one app, where the format of the string was particularly constrained by the regular expression.In case of numbers, the replication of the original value happened slightly more frequently, with Noise Addition being responsible of the highest number of cases (which anyway consists of only three cases with a probability below 1.67%).This is due to the relatively smaller size of the numeric domain and the type of perturbations introduced by Noise Addition.Interestingly, it was not strictly necessary to reconstruct the original value in any of these cases, so the reproduction of the value was incidental and the user would not be really aware of this fact.The only exception is Birday where the failure requires the date 29-2-year where year is any leap year.Thus the user would discover the date and month of the birthday, and would restrict the birthday to leap years.Answer to RQ3: All the anonymization techniques largely hide the values they are applied to.Sometime the failure reproduction process may generate the original input in the attempt to reproduce the failure.This is unlikely to happen frequently, with only Noise Addition causing the reproduction of the original input in some rare cases.

Threats to Validity
A threat is about the limited set of bugs considered in the experiments.To mitigate this threat, we systematically searched for real bugs contained in open source applications and we selected Android applications from different categories to have multiple contexts in which to experiment with privacy-preserving techniques.The construction of the experimental dataset that we publicly released is already the result of significant manual effort with hundreds of apps and reports manually inspected, as described in Section 3.2.Enlarging this dataset to address new domains is part of our future work.
Another concern is about the way we configured the privacypreserving techniques.To avoid introducing any bias, we defined a configuration policy that we described in the paper.All the configurations are finally reported in our online repository.
Finally, another concern is related to the correctness of the implementation of the privacy-preserving techniques that we used for the experiments.To mitigate this threat, we extensively tested our tools and made our artifacts publicly available.

RELATED WORK
The studies most related to our work concern with the approaches designed for the anonymization of the data collected during failures, and with solutions for bugs reproduction.
One of the first approaches designed for releasing private data in the context of testing and debugging activities, while ensuring people's privacy, is kb-Anonymity [2].This approach exploits symbolic execution and k-anonymity to generate anonymized database tuples that do not alter the behavior of the program, that is, the same program path is executed when the program uses both the original and anonymized values.The approach is limited to numbers and programs whose code is accessible and analyzable with symbolic execution.Castro et al. [3] investigated a similar approach but applied to the data included in crash reports.MultiPathPrivacy [11] and RESPA [12] investigated how to weaken the requirement about preserving the same execution path when introducing anonymized values by identifying alternative paths that shall still lead to the reproduction of the same bug.
Different from this body of work, we studied the effectiveness of privacy-preserving techniques that have been extensively applied to databases and that can be easily used to anonymize data collected during failures, without running any complicated analysis on the code of the application.The results reported in this paper provide useful insights about their effectiveness and cost, and the specific configurations that best fit the problem of anonymizing failure data.
Our work also relates to failure reproduction.We target the case of reproducing failures from (anonymized) failure traces collected from Android applications.The reproduction of failures from similar non-anonymized traces has been also considered in other works, such as CaRCrash [17] that collects and dispatches failures traces every time a failure is detected.Similarly, CrashDroid [19] can reconstruct replayable scripts from stack traces collected during failures.
Some other techniques considered reproducing failures from bug reports using NLP techniques, such as S2RMiner [21] and ReC-Droid [22].In this study, we considered the impact of anonymization techniques on failure traces and the corresponding test cases.We left to future work investigating more in details the impact of privacy-preserving techniques on test cases derived from bug reports, although in principle the artefact used to derive the test cases should not significantly affect the conclusions of our study.
Finally, some techniques addressed the problem of reproducing failures in Java, such as, BugRedux [10], JCHARMING [15], STAR [4], and EvoCrash [16].We targeted Android since apps are often used to process personal information.Investigating other technical contexts is part of our future work.

CONCLUSIONS
Analyzing and reproducing failures from failure traces is important to timely fix faults and develop reliable applications.However, failure traces may disclose sensitive information about the users of the applications, and must be properly anonymized before they can be used for failure reproduction.This paper studies how privacypreserving techniques extensively exploited in the context of database systems can be adapted to the problem of failure reproduction, and presents an empirical evaluation that discloses findings about their effectiveness and cost.In particular, our results show that the SCD Local Suppression technique introduced in this paper can be effective with the anonymization of strings, while numbers can be effectively anonymized with Local Suppression or Noise Addition, depending on the possibility to disclose some information about the original value that was anonymized.Our future work concerns with experiencing and studying privacy-preserving techniques applied to additional domains, such as Web Applications.

Figure 1 :
Figure 1: Overview of the experiment.

Table 1 :
Overview of privacy-preserving techniques.

Table 2 :
Reproducible Android application faults Some combinations of system time and edited time cause the edited time to be saved as P.M. even if it was A.M., and vice versa.Entering a value with a dot in from date or to date fields causes the app crash.
A wrong number is saved if the used decimal separator does not match the one defined in the Android settings.are impacted by privacy-preserving techniques.That is, it investigates how hard reproducing failures is, if the source traces are anonymized with the techniques presented in Section 2. RQ2 -Cost: How many runs are necessary to reproduce failures with high confidence?This research question studies the number of test executions that must be performed to establish if either the failure has been reproduced or the failure cannot be reproduced from the anonymized trace.RQ3 -Information Disclosure: How often is the original value disclosed?This research question investigates how often the anonymized value is reconstructed while reproducing a failure, thus potentially revealing sensitive information that should remain hidden.

Table 3 :
Regular expressions for string values.

Table 4 :
Configurations for techniques applied to numbers.

Table 5 :
Configurations of the techniques for each application's bugs analyzed.

Table 6 :
Bug Reproduction Frequency

Table 7 :
Iterations for Reproducing Failures with 95% probability Me' and 'Hi' in the header refer to configurations that retain less, medium, or more information from the input *mean value is calculated only over defined values (-cases are ignored in the computation of the mean)

Table 8 :
Frequency of replication of the original value.