Collection and Analysis of Sensitive Data with Privacy Protection by a Distributed Randomized Response Protocol

The data collected from personal devices is intrinsically private and should be collected through a privacy-guaranteed mechanism. Local differential privacy solves privacy problems by collecting randomized responses from each user, and it does not need to rely on a trusted data aggregator/curator. The proposed approach utilizes the randomized response technique in a novel manner: it guarantees privacy to users during the data collection and simultaneously preserves the high utility of the analysis. It can be seen as a case of synthetic data generation by producing contingency tables (marginals) in a privacy-preserving mechanism. This article describes the proposed randomized response technique and discusses the motivating applications domains. It justifies why it satisfies the property of differential privacy and utility guarantees theoretically and through experimental analysis with excellent results.


INTRODUCTION
Data collected from smart devices have become invaluable assets for product designers and application developers.Companies and research centers collect data from end-users and use them to update their knowledge and tailor their products and services.
The problem with massive data collection is that collecting sensitive personal data poses a significant risk to people's privacy rights.To get accurate information from the individuals, the data collection process should enforce robust privacy-preservation mechanisms ACM ISBN 979-8-4007-0243-3/24/04. https://doi.org/10.1145/3605098.3636024 and consider at the same time the collected data utility.We introduce a novel data collection protocol with randomized responses to achieve data collection with privacy guarantees.The protocol is client-server and occurs in a network/cloud environment where the client represents an end-user or a partner with private data and the server is the data aggregator/curator which is honest-but-curious: it correctly performs the protocol for data exchange and aggregation but might want to know the clients' private data.The risk is of privacy leakage and could occur after an attack with adversarial knowledge or a differential attack.Our proposed method provides strong privacy guarantees combined with a high data utility, as this work shows.We adopt a local differential privacy (LDP) approach rather than the weaker global differential privacy (GDP) approach, where aggregators store the actual data and are a single point of failure and a target for attacks.LDP is stronger because even if adversaries had access to the personal responses, they would still not be able to learn about individuals since the responses are randomized.
Our privacy-preservation randomized response is built on the idea of randomized response proposed by Warner in 1965 [26], a data collection technique on sensitive data, where the respondent hesitates to provide a true answer.In Section 1.1 we introduce the principles of the randomized protocol.As discussed in Section 2, surveys generated using randomized responses allow easy computations of correct population statistics while protecting the privacy of the individual and preventing reconstruction attacks.This technique can be used to inject random noise into the answers or the output of a function.Random noise injection protects from the differential attack and is one of the key components of the differential privacy model, the standard de-facto reference for privacy-preserving query answering [8].
Unfortunately, the level of privacy provided by the randomized response [26] degrades if the survey is repeated by the same respondent and does not work for multivariate answers.So, to maintain a strong privacy guarantee with a high utility, we need a better data collection mechanism, as we present in this work.
Finally, we show that the data collected at the aggregator provides a high utility value.The proposed solution gives guarantees, at certain confidence levels, that the statistical dependencies observed in the reconstructed data correspond to the true ones.The proposed solution relies on a combination of sophisticated machine learning modeling and numerical optimization with hypothesis tests, as we show in Section 6.
In Section 2 we compare the randomized response protocol with the related work.We present two improved versions of the randomized protocol and discuss its properties in Section 4 and in Section 5 we discuss its convergence.In Section 6 we present its robustness in preserving statistical associations among variables.

The principle of the randomised response protocol
The survey respondent is asked to flip two fair coins in secret; if the first coin is "Head", the respondent is asked to flip a second coin whose outcome will determine if the answer is "Yes" or "No".
Figure 1 shows the flow of the randomization protocol.It is simple to see that in a situation where both "Yes" and "No" answers can be denied (flipping two fair coins), the true number of "Yes" answers can be accurately estimated by 2( − 0.25), where  is the proportion of "Yes" responses.The unknown probability of a successful event studied on a population, represented by a random variable, is correctly and even more efficiently inferred by the randomized protocol in Figure 1 if parameter  is close to the true probability of the successful event.This observation led to our proposal that the parameter  be adjusted to its true value as the protocol evolves.The natural and more general setting is where each client has multiple attributes, and the server is interested in learning their joint distribution after observing only a sample of the population.Knowledge of the joint distribution opens the way to powerful descriptive and predictive analytical models, such as statistical inference models and Bayesian networks.In the adopted local differential privacy (LDP) approach, in the proposed protocol, a respondent's private data is generated (possibly modified by the randomised protocol itself) after selecting subsets of attributes.These values are communicated to one or more aggregators in a distributed environment.As a final step, the aggregator receiving the randomised data has the task of calculating contingency tables (CTs) with the frequencies of the observed values.Thanks to the protocol properties, we demonstrate that it is possible to reconstruct the true joint probability of the attributes from the possibly noisy values communicated by the individuals.The transmitted values do not need to correspond to the true ones for each individual, in virtue of the deniability property of the protocol.Moreover, the randomization is local to the individual users, and there is no need for a different, trusted organization to perform the randomizer or add a verified amount of noise.
Often, Machine Learning algorithms rely on low-order marginals as a building block and compute accurate approximations by the Maximum Likelihood principle and vine-copulas [4,17].Our proposed method generates low-dimensional tables on m attributes (m-way).We can generate even lower-dimensional CTs, with  <  from these m-dimensional ones by further marginalizations.We call these further CTs higher-level.We propose to apply linear programming to the m-way CTs to make consistent the marginals of the higher level ones (k-way).This approach is also followed by [2].

RELATED WORK
Privacy-preserving data statistics are often considered in a centralized setting in which the data is perturbed by adding random noise from Laplace distribution or applying the Exponential mechanism.These perturbation techniques reduce the risk for an individual to be identified [9,11].However, in this classical approach, with true data in the database, individual privacy is still not guaranteed from external attacks or internal adversaries (e.g., eavesdropping).Our approach is based instead on the decentralized setting with local differential privacy.Each client randomizes its true values using a local randomization mechanism.The noisy values are then sent on the network to the aggregator without the need to be protected and then aggregated to produce the desired statistics.
A multitude of approaches exist: they combine randomized response techniques [26] to create sophisticated noise addition mechanisms [10,12,18,21,24].Google RAPPOR [10] collects users' data in a private setting, where the responses are mapped to a Bloom filter using a hash function.RAPPOR implements a two-step randomization technique: first, by mapping the user string onto a Bloom filter using a hash function, and second by flipping each bit in the Bloom filter with given probabilities.
Apple implements privacy in their iOS to collect user statistics through users sketching [3,24].Microsoft collects users' app statistics privately using rounding and memorization techniques [7].Wang et al [25] proposed an optimization technique with asymmetric randomization response and hashing function.Kairouz et al [14] propose the optimal generalizations of randomized responses to estimate the frequency of a single categorical attribute.

PRELIMINARIES
We consider a setting where each client owns a set of attributes.The centralized server collects these attributes in a privacy-preserving manner and releases the joint distribution of their values.

Notations
We consider a dataset  with  attributes  = ( 1 ,   1b shows a CT over a set of two attributes.Table 1c shows a marginalization.

Differential Privacy
The current de facto standard of privacy protection is differential Privacy [8,9].It is interpreted as a statistical property that compares the output of a query on the database when the individual is included in the database with the alternative without the individual.To protect the individual's privacy, noise is added either on the data or in the query mechanism (M) that answers requests on the data.The privacy guarantee of the randomization mechanism is quantified by the parameter of the privacy budget  that controls how different are the probabilities that the query returns the same output in the two databases, differing for a single individual.Definition 3.2.(Differential Privacy [9]) A randomization mechanism M is − differentially private if for any two neighbouring databases  1 ∈ N | | and  2 ∈ N | | that differ for a single entry, and any subset R of the output of M, where the probability is taken over the randomness of M. In our case, the mechanism (or query) M () is represented with a collection of CTs    returned by the randomized response protocol on , with   one of the subsets of attributes in  ., we consider three error measures to evaluate the performance of the proposed randomization method (the lower the better).
In our first experiment, we calculate the  2 independence testing between the true and the noisy CT.
The second is the ℓ 2 distance between  ′   and    , in which the CTs are viewed as vectors of 2  elements.In the context of the randomization method, the error distance can be regarded as a random variable due to its dependency on the noise introduced by the method itself.Expected Squared Error (ESE) is the expected value of the square of the error distance, an aggregation of squared errors across individual cells.ESE is frequently employed to assess the utility of a given method.The third method is the Jensen-Shannon divergence between  ′   and    , both normalized by dividing each cell value with the sum of the cells (so that the probability mass is 1).It is natural to apply Kullback-Leibler divergence between  ′   and    , since ) can be undefined when ≠ 0 for some   .Thus, we use Jensen-Shannon divergence [19], which is a symmetrized and smoothed version, given as: where  =

RANDOMIZED RESPONSE BLOCK AGGREGATION
This section presents the proposed method Randomized Response Block Aggregation (RRBA).
Before querying the end-users, the aggregator generates disjoint subsets   of  attributes taken from the original set of  attributes to form a certain number of size- CTs called views V.The subsets V form separate views on the sample population.The union of the subsets in views should be as large as possible.The aggregator arbitrarily selects a combination of views from the possible ones for querying the single client whose attribute values could be randomized in his/her response.This arbitrary selection that changes for each client provides an extra layer of protection in the randomization protocol.These views privately publish a synopsis of the entire dataset.Successively, the server reconstructs any higher-order marginals from these views.To show how to assign attributes into views, we show a running example with the number of attributes  = 6 and attributes: {, , , , , }.With  = 2, we have three combinations of 2 distinct attributes per view.This is the list of the alternative views for each individual.
For the first view  1 , we left out the attribute , for the second one , and so on.Just a single one because it could not be paired with another one without allowing repetition of one attribute in the same view.If the first alternative is selected, the view is formed by the two combinations of attributes {} and { }.Both combinations are considered for the same individual.The attributes in any combination are randomized together, thus keeping intact possible statistical dependencies between them.
This step is necessary because the randomization protocol must not generate multiple times randomized values of the same attribute from the same individual.Indeed, if an eavesdropper observed the multiple outcomes of the same attribute, even if combined with others, it would observe with higher probability the true values, thus distinguishing them from the randomized ones.An alternative solution would be to maintain the value generated for each attribute in the internal memory of the clients' devices.However, this solution is not always possible for all devices and would require a large memory size for data sets with many attributes.Observe that any pair of attributes is assigned in at least one view.Since independent noise is added through these views, marginalizing two different CTs from these views to obtain the same marginals would likely give different results.To make consistent the marginalisations generated from these views, we perform the constraint optimization technique discussed in Section 4.3.
We have two different versions of our protocol.In the first version, the aggregator selects arbitrary combinations from a view   ∈ V.The aggregator sends this combination as a question, such as: "What is your age and which region do you belong to".Clients' responses are collected in the randomized mechanism to ensure that either randomly selected responses or true responses are collected by the aggregator.In the second version, we divide the clients into groups called blocks .We then perform randomized data aggregation in parallel within the blocks.Once all responses are collected, the aggregator moves to the next block.Before the next block is processed, the probability distribution used to generate random responses is updated to be closer to the true one.This is done by updating the probability distribution with the responses collected in the previous block.

Fundamentals of the Randomized Response Block Aggregation Method
Given a set of views V, the aggregator arbitrarily selects a view   ∈ V comprised of multiple combinations of attributes.On all these combinations of attributes, the responses are collected from the client in the −LDP setting.The aggregator initializes for each combination of attributes in   the joint distribution by a CT whose cell values are initialized with the uniform distribution, i.e., The random variable  is implemented by drawing a random value, between 0 and 1, uniformly distributed.This random variable controls if the user communicates the true values of the combination of queried attributes.If the random value is above , "fake" values are communicated to the aggregator, according to the second random variable , drawn between [0, 1].The outcome of this latter random variable corresponds to one of the cells (denoted by ) in the CT by their probability.In turn, each cell corresponds to some combinations of the categories of the attributes.The variable  for emitting a "fake" value is a type of Monte Carlo sampling from the given discrete joint distribution  , , such that: This "fake" response is emitted in such a way as to disclose a "controlled" amount of information about the client's true attribute values.Hence, limiting the aggregator's ability to learn with confidence the true values of the client, Monte Carlo sampling improves the utility of our protocol by emitting combinations of values based on their probability as stored in the CT.
Once the aggregator receives a response from the client, it reconstructs a noisy CT  ′ , using the CT  , used for the previous client.For the reconstruction it applies equation 4: where   is the observed number of clients who communicated those attributes values represented by  and  is the total number of clients.The above equation is justified by the fact that   is the number of observed responses corresponding to the same cell  in CT  , [] and the responses come from the execution of the randomization protocol: they are outcomes of the true probability distribution with probability  (the first coin gives "Head") and are random outcomes controlled by the probability distribution in  , (the first coin is "Tail" with probability (1 − )).
The aggregator updates its table  , =  ′ , and sends this updated table  , to the next client  +1 for the next randomized response.The next client now uses the updated probabilities  , in the Monte Carlo sampling.
Observe that the aggregator has no access to the client's true values.Thus, the proposed mechanism ensures local differential privacy.Algorithm 1 outlines the complete working of our protocol, including both client-side and aggregator procedures.4, where now  is the block size.When all the responses are collected, the aggregator publishes the noisy CT  , to the server.The overview of our proposed randomized responses protocol and the communication between the aggregator and its end-users is shown in Figure 2. It shows that multiple combinations of attributes {  } contained within a view   are sent to clients together with the corresponding noisy CTs    for the execution of the randomized protocol.The server receives the responses and aggregates them.The block size  is defined by the data aggregator/curator.With the algorithm of Section 5 and the experiments in Section 7.1, we demonstrate the selection of the optimal block size, which leads to the convergence of the estimated probabilities in the CTs to the true probabilities.

Differential Privacy of Randomized Response Block Aggregation
The proposed mechanism aims to minimize the risk of disclosure to ensure a strong privacy guarantee while satisfying the strict concept of −LDP.It promises strong privacy despite the amount of background knowledge of an adversary.Hence, with a substantial amount of auxiliary information, an adversary could not confidently identify the true responses from the clients.Since a single report from the client contributes to the count measure of a single cell  in   =  , , the privacy level  is independent of the number of cells in  , .Hence, we need to prove the satisfaction of −differential privacy for only a single CT cell.
Theorem 4.1.The proposed randomized response protocol satisfies − differential privacy, with:  ≥  1 1− where  is the probability that the first coin gives "Head".

Proof. Let us consider two CTs 𝑇 1
, ∈ and  2 , ∈, realizations of the CT   on the attribute subset , that come respectively from two databases  1 and  2 that differ for a single record.Let  , be the reported combination of attribute values returned by the proposed randomization protocol from the record   that differs in the two databases.It corresponds to the cell of the CT  , [𝑣].According to the definition of differential privacy [9] we need to consider when the proposed randomization protocol works as a randomized mechanism and transforms the input databases  1 and  2 into the same CT   , regardless of having in input the database  1 or  2 .Let us assume that  is the probability that a combination of attribute categorical values corresponding to the cell  , [] occurs in the database.According to the proposed randomized protocol, these attribute values are reported if the first coin draws "Head" and if they are the true values: this occurs with probability .In addition, the first coin could give instead "Tail", but the emitted values are drawn as a consequence of the second random event: this overall event occurs with probability (1 − ).On the other database, with a different record  ′  , the only possibility that the randomized protocol returns the same value as above is that the first coin gives "Tail" and the second random event returns those values corresponding to cell  , [𝑣], and this occurs with probability (1 − ).Mathematically, we obtain: From the opposite side, when  1 does not contain   but  2 does, we obtain  ≥  (1 − ) that is always satisfied with 0 ≥  ≥ 1. □ The equation 5 shows the relationship between the parameter  (the privacy budget that controls the privacy amount) and the parameter  of the randomized response protocol (the fraction of times clients respond trustfully).Notice that it does not depend on , the probability of the emitted value; thus, it is valid regardless of the response.
Decreasing  makes  arbitrarily low, the desired situation since it allows the randomized protocol to make stronger privacy preservation.As a drawback, with low  the convergence of the reconstruction of the true probability distribution from the observed responses becomes slower, as we will see from the experimental results.On the opposite side, as  grows, it increases the risk that true values are emitted too frequently, and  cannot be reduced to small values.The relationship between  and  is shown in Figure 4. We drive the values of  using the Equation 6: and identify at what value of  we see convergence in the observed values from the randomization protocol, using a given block size.In the experiments of Section 7.2 we discussed the effect of different values of  on the convergence at different block sizes.

Consistency between Noisy Tables
Given a set of noisy views, the server wishes to release marginals of some attributes with a privacy guarantee.Since independent noise is added in each attribute combination within a view, aggregating marginals from the different views will create inconsistencies in the marginals of the common attributes.Suppose we have   , where  ′ ⊆  ⊆  are subsets of the attributes.We use the symbol T  ′ ← [ ] to denote the marginal over  ′ calculated from   by aggregating the corresponding entries.
Consistency between views.We consider the marginal CTs T Given a set of views in V and a set of attributes , we can compute -way marginals T  .When at least one view   ∈ V includes all the attributes in , i.e  ⊆   , we can reconstruct T  by summing over the corresponding entries of   in    , that is using T ←  .However, when we have multiple views   such that  ⊆   , we need to perform a linear optimization technique to return consistent marginals from all the views   that cover all the attributes in .When  ∩   contains  attributes, then    provides exactly 2  constraints on the cells for   .We can extract all these linear constraints from all the views to generate an under-specified system of equations.
One can utilize the ℓ 1 −norm optimization technique discussed in [2] to reconstruct the marginals in   .This technique does not create a unique solution, and linear programming has no preference among different solutions.So we employ another constraint optimization technique ℓ 2 −norm (least square solution).We will follow the quadratic programming approach similar to the work in [20] to solve the under-specified system of equations as a minimizing problem: It has been shown that this is a quadratic optimization problem, and we solved it with convex optimization approaches [6].

CONVERGENCE AND BLOCK SIZE ESTIMATION
We show that the probabilities generated from  , converge to the true probabilities after we used the protocol aggregating the observations sent from the individuals in a certain number of blocks of size .The value  , []   allows to compute the probability of a cell of the CT  , created by running the randomized protocol on the users of block   , where we use the superscript   to denote the block number.The estimated probability at round sending outcomes from block   is done with  , []   .The estimation of the probabilities, done by the protocol, converges to the true probabilities by oscillating around the true value within a tolerance interval related to the error in observing a Bernoulli variable.The tolerance interval is given by the width of the confidence interval of the Bernoulli variable, with the success probability equal to the true but unknown value , and the interval width is estimated as follows.
If the approximation of the Bernoulli distribution with the Normal distribution holds (i.e., if  > 5, with  =  , [] and / the probability estimation), we can use a symmetrical interval, where the confidence interval size can be estimated by 2 1−/2 with the standard deviation of the Bernoulli distribution.Otherwise, maximum likelihood confidence intervals must be used with the log odds.We set the  confidence level equal to the standard values, e.g., 0.05 or 0.01.This latter means that the estimated probability value will remain within the confidence interval with a probability equal to 1 − .
The convergence algorithm proceeds as follows: (1) Initialization with  = 0: At iteration  =  + 1: run the RRBA protocol and estimate which is the average between the two consecutive observed values in consecutive blocks.
where  * is the size of the confidence interval.

TESTING FOR ASSOCIATION
One of the first questions posed while dealing with categorical attributes is whether they are independent.The test of independence  2 [1] is one of the most common statistical tests with categorical attributes that mainly compares the observed frequencies of the combined attribute values with the estimated frequencies, assuming the attribute are independent.This latter estimation is obtained by the Maximum Likelihood Estimation, denoted by  , for cell (, ) in the CT  , .To perform a similar test of independence for a noisy version of the table, we need to determine an estimation for  where we do not have access to the true cell counts in the CT.Suppose we only have access to the noisy cell values in  , , where noise is added in each cell independently, for instance, using our randomization protocol.To find the best estimates for  given the noisy cells we perform a two-step MLE calculation similar to the work of [15,16].
In a two-step MLE procedure, we first find the most likely CT  , given the noisy  3: Summary of the selected datasets which means it has an optimal solution but may not be unique and sensitive to an initial guess.To overcome this problem, we add a strongly convex function in the objective function: where  is a mixing parameter in the range [0, 1].The above objective function is in the form of elastic net regularize [27] function proposed by [16].The solution of this objective function converges to the solution provided by the ℓ 1 norm when  is sufficiently large.For the test of independence, in the two-step MLE calculation, if any cell value in  , [] < 5, we follow the commonly chosen rule of thumb to Accept  0 .

EXPERIMENTS
For experimental reproducibility, we use three publicly available datasets (for Bayesian networks)1 : Survey [22], Alarm [5], and Child [23].They vary in the number of instances and attributes as described in the overview of Table 3.All attributes are discrete.

Monte Carlo simulation: Convergence of the randomization protocol
To perform a test of convergence of the second version of the proposed randomized response protocol, we test with any of the values of the attributes whose probability of occurrence is in 0.0285, 0.072, 0.116, 0.224, 0.356, 0.446, 0.524, and 0.732 and let vary the block size  = {18, 50, 150, and 250}.We perform 40 trials on 200 blocks on each probability value and block size.We average the number of tuples emitted when the condition holds  , []   −   []   −1 <  * , and remains valid throughout the blocks.

Convergence Results
We perform the test of convergence in the datasets (Survey, Alarm, and Child).We plot the results of the experiments in Figure 5, where the x-axis represents the block size, and the y-axis shows the number of tuples emitted when the convergence is reached.The behavior of convergence of the proposed randomized method is similar in all three datasets.A smaller block size makes it easier to achieve early convergence at both low and high probability values.Hence, it is sufficient to have a block size equal to the dimension of the CT.We perform similar experiments on convergence with different values of  (the probability the first coin is "Head").Due to the computational limitations, we focused on a few probability values to analyze convergence on the varying value of .The selected probabilities of the true attribute values  (   ) = {0.072,0.116, 0.356, 0.446}, block size  = {18, 50, 150, 250} and the probability of the first coin "Head"  = {0.009,0.048, 0.095, 0.139, 0.221}.
At  = {0.009,0.0480, 0.095} none of the processes of reconstruction of the probabilities converges at given block size .Instead, at  = 0.1390, the reconstruction process of the higher probability values  (   ) (set at 0.356 and 0.446) converges with higher block sizes, i.e., 150 and 250.At  = 0.221, the reconstruction process of all the probability values converges with the higher block sizes, as shown in the graph of Figure 6.In the graph, there is no convergence for all the probability values when the block size is 18 and 50.If we increase the block size, the reconstruction processes converge for all the probabilities  (   ).A similar behavior is observed at  = 0.295.The results of Figure 6 show that if we have a smaller value of  we must select a larger block size so that the reconstruction process of the probabilities converges; if we select a higher value of  we see the convergence at smaller block sizes, as shown in Figure 5.

Monte Carlo Simulation: Test of Independence
We want to test if the addition of noise destroys independence (null hypothesis rejected).We generate a −way noisy CT  , using the proposed randomization technique.We calculated the estimations  , of the cells using the two-step MLE procedure.Using these estimates, we sample  > 1/ many CTs (where  is the significance level, 0.05).We then add noise to these sampled tables using the randomized response protocol.Using the same two-step MLE calculation, we obtain  different  2 values from these sampled noisy tables.We rank these statistics by choosing ⌈( + 1) (1 − )⌉ as threshold   .If  2 >   we Reject  0 else, we Accept  0 .If at any point the two-step MLE calculation outputs any cell count < 5 then we Accept  0 .

Significance Results
. We show how the tests of Independence perform on real-world data when  0 is both rejected or accepted.We perform the independence testing on 2−way, 3−way, and 4−way CTs with binary attributes.Note that the independence tests can also be performed on arbitrary  ×  noisy CTs generated by the proposed method.Notice that as soon as the number of values increases, the proposed protocol is more robust than the others and succeeds in the tests a higher number of times.
In the above experiments with Laplace distribution, since it does not provide critical values, we used the true values of the attributes as the values for the comparison with noisy data (they are known in advance).If this was not possible, one could also find the critical values of simulated data using R package "CompQuadForm".
Table 4 compares the performance of the proposed method with state-of-the-art competitors (Laplace noise and MCIndep [13]) using a confusion matrix.We perform 100 trials for  0 rejected and 100 trials for  0 accepted with CTs generated parametrically.The accuracy of the proposed method is excellent (96.5%, 94%, and 93.5%) in all −way CTs.These results are better than both Laplace and MCIndep methods.Further, our block randomization protocol is

Performance using ℓ 2 norm and Jensen-Shannon distance
We evaluate the performance of the proposed randomization protocol using ℓ 2 norm.For evaluation purposes, we use the noisy 2, 3, 4way CTs that are compared with the ground truth.The Laplace noise is drawn from  (0, ) with zero mean and a scale that depends on the privacy budget  = 2| , |  .We performed 100 trials on Survey and Alarm datasets and reported the average performance in Table 5 and Table 6. Figure 7 shows the distribution of the performance metrics.
From Table 5 and Table 6, the proposed randomization protocol has the lowest average ℓ 2 distance on Survey and Child datasets.The proposed protocol has the lowest average distance on higher dimensional tables when the noise variance is large  = 0.35 and  = 0.4.When  = 0.5 and  = 0.5 our protocol wins on all CTs.These tables 0.0107 0.0129 0.0304 0.0074 0.0156 0.0380 Laplace 0.0142 0.0582 0.1417 0.0118 0.0633 0.1580 also conclude that our proposed randomization model compared with Laplace noise has a lower distance on the Jensen-Shannon distance scale (a lower scale means the noisy distribution is similar to the ground truth).The results from the experiments (performance metric using independence test, ℓ 2 distance, and Jensen-Shannon divergences) show that the proposed randomization method wins over Laplace noise.The proposed privacy protocol maximizes utility in the released CTs while ensuring -differential privacy.

CONCLUSION
In this work, we systematically explore the problem of collecting and analyzing data from smart devices under −local differential privacy, in which the aggregator/server is honest-but-curious, has access to randomized responses from users, and reconstructs statistical models based on perturbed data.The server computes accurate statistics from the released joint distributions.With the experiments, we showed that our protocol achieves high utility in reconstructing the probabilities of attribute values, committing a low error bound.In future work, we will use the hash function to store CTs to reduce computation and communication overheads.

Figure 1 :
Figure 1: The flow of the randomized protocol and two flips of coins, with a binary attribute Att

3. 2 . 1
Utility Goal of Our Randomization Method.The utility of our randomization protocol stems from the possibility of reconstructing k-way CTs whose values are close to the true ones    .Given a reconstructed noisy k-way CT ′

Figure 3 :
Figure 3: The intervals of the cumulative probability distribution function that make each probability interval correspond to a cell  of the CT

4. 1 . 1
The improved version of the protocol.The second improved version of the randomized response data aggregation works similarly to the first version, except now, the clients are divided into groups called blocks .The aggregator now executes the collection of responses from each client in parallel within the blocks.The aggregator aggregates the responses from the blocks and updates the CT using equation

Figure 4 :
Figure 4: Graph of the relationship between the protocol parameter  of the first coin "Head" and the privacy budget

Table 2 .
2 , • • • ,   ).We use V  to denote the domain of the values of   and    to represent a possible value in V  .A subset of attributes in  is denoted by   .A contingency table (CT) involving the attributes in   is called    .We use  , to represent the attributes values (entry points) in a CT with the values for a subset of attributes  in   as rows and another subset  as columns in the CT.We use  , [] to represent the cell value of that CT at those entry points.|   | denotes the cardinality of the CT.The probability of an attribute value    is denoted by  (   ).Each row in  represents a single user or client .The notations are summarized in Example 3.1.Database  in Table 1a has six attributes: A = {adult, old}; R = {big, small}; E = {high, uni}; O = {emp, self}; S = {M, F}; and T = {car, train, other}.,It is aggregated with count function applied

Table 1 :
Example of a dataset, CT, and the marginals

Table 2 : Summary of notations to subsets of their values. Table
of the random variables, drawn with the predefined probabilities  and .Probability  is tunable to adjust the privacy and utility of the responses.Probability  is randomly drawn between 0 and 1: it represents the value of the cumulative joint probability function of the attributes values.It makes each combination of categorical attribute values represented in the multivariate CT correspond to a continuous probability value that these categorical values are observed.Monte Carlo sampling exploits it to draw first the probability value and then returns the corresponding combination of attribute categorical values.
1 | , | .Figure 2: Overview of communication between data aggregator and mobile clients to generate noisy CT on views   Algorithm 1: Randomized response on single client Input: Set of attributes  , probability (first coin is head)  Output: Noisy table  ′ , 1 Function Aggregator( ): 2 make views  =  1 ,  2 , • • • ,   ; 3 randomly generate the views and check that the combinations of attributes are not repeated in the views ; 4 generate uniform distribution in    of all views; 5 while exists a client that has not yet communicated do 6 select arbitrary view   ∈ V ; 7 o ←  (T , , query(, )) ; /* Call client procedure */ 8 reconstruct  ′ , from  and  , using equation 4 ; 9 update:  , ←  ′ , 10 end 11 Function Client( , ,  (, )): 12 Sample a Bernoulli variable B ; 13 if B = "Head" then 14 Respond true value  ∈  ,

1
and T 2 with a common attribute  coming from two noisy views  ∈   and  ∈   .The two marginal CTs T 1  and T 2  are consistent if and only if the marginal table over the common attributes in   ∩   reconstructed from view   is the same as reconstructed from view   .
table  , , and in the second step, we calculate MLE given a table of counts  , .For the first step, we need to minimize  , −  , subject to   , [] =  and  , [] ⩾ 0. Note that if we add independent noise in each cell of a table  , , the above optimization problem gives multiple solutions.The ℓ 1 norm in our objective function in Equation 7 is not strongly convex, Convergence in probabilities  (  =    ) = {0.072,0.116, 0.224, 0.356, 0.446, 0.524, 0.732} and block size  = {18, 50, 150, 250} on Survey, Alarm and Child

Table 5 :
Comparison of ℓ 2 and Jensen-Shannon distance between noisy and original CTs (Survey and Alarm); noise is added using randomization protocol and Laplace noise with parameters  = 0.4,  = 0.35,  = 8000, and Block size  = 250

Table 6 :
Comparison of ℓ 2 and Jensen-Shannon distance between noisy and original CTs (Survey and Alarm); noise is added using the randomization protocol and Laplace noise with parameters:  = 0.5,  = 0.5,  = 8000, and Block size  = 250.