Turbo: Effective Caching in Differentially-Private Databases

Differentially-private (DP) databases allow for privacy-preserving analytics over sensitive datasets or data streams. In these systems, user privacy is a limited resource that must be conserved with each query. We propose Turbo, a novel, state-of-the-art caching layer for linear query workloads over DP databases. Turbo builds upon private multiplicative weights (PMW), a DP mechanism that is powerful in theory but ineffective in practice, and transforms it into a highly-effective caching mechanism, PMW-Bypass, that uses prior query results obtained through an external DP mechanism to train a PMW to answer arbitrary future linear queries accurately and "for free" from a privacy perspective. Our experiments on public Covid and CitiBike datasets show that Turbo with PMW-Bypass conserves 1.7 -- 15.9× more budget compared to vanilla PMW and simpler cache designs, a significant improvement. Moreover, Turbo provides support for range query workloads, such as timeseries or streams, where opportunities exist to further conserve privacy budget through DP parallel composition and warm-starting of PMW state. Our work provides a theoretical foundation and general system design for effective caching in DP databases.


Introduction
ABC collects lots of user data from its digital products to analyze trends, improve existing products, and develop new ones.To protect user privacy, the company uses a restricted interface that removes personally identifiable information and only allows queries over aggregated data from multiple users.Internal analysts use interactive tools like Tableau to examine static datasets and run jobs to calculate aggregate metrics over data streams.Some of these metrics are shared with external partners for product integrations.However, due to data reconstruction attacks on similar "anonymized" and "aggregated" data from other sources, including the US Census Bureau [29] and Aircloak [17], the CEO has decided to pause external aggregate releases and severely limit the number of analysts with access to user data statistics until the company can find a more rigorous privacy solution.
The preceding scenario, while fictitious, is representative of what often occurs in industry and government, leading to obstacles to data analysis or incomplete privacy solutions.In 2007, Netflix withdrew "anonymized" movie rating data and canceled a competition due to de-anonymization attacks [52].In 2008, genotyping aggregate information from a clinical study led to the revelation of participants' membership in the diagnosed group, prompting the National Institutes of Health to advise against the public release of statistics from clinical studies [4].In 2021, New York City excluded demographic information from datasets released from their CitiBike bike rental service, which could reveal sensitive user data [3].The city's new, more restrained data release not only remains susceptible to privacy attacks but also prevents analyses of how demographic groups use the service.
Differential privacy (DP) provides a rigorous solution to the problem of protecting user privacy while analyzing and sharing statistical aggregates over a database.DP guarantees that analysts cannot confidently learn anything about any individual in the database that they could not learn if the individual were not in the database.Industry and government have started to deploy DP for various use cases [22], including publishing trends in Google searches related to Covid [10], sharing LinkedIn user engagement statistics with outside marketers [53], enabling analyst access to Uber mobility data while protecting against insider attacks [36], and releasing the US Census' 2020 redistricting data [5].To facilitate the application of DP, industry has developed a suite of systems, ranging from specialized designs like the US Census TopDown [5] and LinkedIn Audience Engagements [53] to more general DP SQL systems, like GoogleDP [62], Uber Chorus [36], and Tumult Analytics [12].
DP systems face a significant challenge that hinders their wider adoption: they struggle to handle large workloads of queries while maintaining a reasonable privacy guarantee.This is known as the "running out of privacy budget" problem and affects any system, whether DP or not, that aims to release multiple statistics from a sensitive dataset.A seminal paper by Dinur and Nissim [23] proved that releasing too many accurate linear statistics from a dataset fundamentally enables its reconstruction, setting a lower bound on the necessary error in queries to prevent such reconstruction.Successful reconstructions of the US Census 2010 data [29] and Aircloak's data [17] from the aggregate statistics released by these entities exemplify this fundamental limitation.DP, while not immune to this limitation, provides a means of bounding the reconstruction risk.DP randomizes the output of a query to limit the influence of individual entries in the dataset on the result.Each new DP query increases this limit, consuming part of a global privacy budget that must not be exceeded, lest individual entries become vulnerable to reconstruction.
Recent work proposed treating the global privacy budget as a system resource that must be managed and conserved, similar to traditional resources like CPU [45].When computation is expensive, caching is a go-to solution: it uses past results to save CPU on future computations.Caches are ubiquitous in all computing systems -from the processor to operating systems and databases -enabling scaling to much larger workloads than would otherwise be afforded with fixed resources.In this paper, we thus ask: How should caching work in DP systems to significantly increase the number of queries they can support under a privacy guarantee?While DP theory has explored algorithms to reuse past query results to save privacy budget in future queries, there is no general DP caching system that is effective in common practical settings.
We propose Turbo, the first general and effective caching layer for DP SQL databases that boosts the number of linear queries (such as sums, averages, counts) that can be answered accurately under a fixed, global DP guarantee.In addition to incorporating a traditional exact-match cache that saves past DP query results and reuses them if the same query reappears, Turbo builds upon a powerful theoretical construct, known as private multiplicative weights (PMW) [31], that leverages past DP query results to learn a histogram representation of the dataset that can go on to answer arbitrary future linear queries for free once it has converged.While PMW has compelling convergence guarantees in theory, we find it ineffective in practice, being overrun even by an exact-match cache.
We make three main contributions to PMW design to boost its effectiveness and applicability.First, we develop PMW-Bypass, a variant of PMW that bypasses it during the privacyexpensive learning phase of its histogram, and switches to it once it has converged to reap its free-query benefits.This change requires a new mechanism for updating the histogram despite bypassing the PMW, plus new theory to justify its convergence.The PMW-Bypass technique is highly effective, significantly outperforming both the exact-match cache and vanilla PMW in the number of queries it can support.Second, we optimize our mechanisms for workloads of range queries that do not access the entire database.These types of queries are typical in timeseries databases and data streams.For such workloads, we organize the cache as a tree of multiple PMW-Bypass objects and demonstrate that this approach outperforms alternative designs.Third, for streaming workloads, we develop warm-starting procedures for tree-structured PMW-Bypass histograms, resulting in faster convergence.
We formally analyze each of our techniques, focusing on privacy, per-query accuracy, and convergence speed.Each technique represents a contribution on its own and can be used separately, or, as we do in Turbo, as part of the first general, effective, and accurate DP-SQL caching design.We prototype Turbo on TimescaleDB, a timeseries database, and use Redis to store caching state.We evaluate Turbo on workloads based on Covid and CitiBike datasets.We show that Turbo significantly improves the number of linear queries that can be answered with less than 5% error (w.h.p.) under a global (10, 0)-DP guarantee, compared to not having a cache and alternative cache designs.Our approach outperforms the best-performing baseline in each workload by 1.7 to 15.9 times, and even more significantly compared to vanilla PMW and systems with no cache at all (such as most existing DP systems).These results demonstrate that our Turbo cache design is both general and effective in boosting workloads in DP SQL databases and streams, making it a promising solution for companies like ABC that seek an effective DP SQL system to address their user data analysis and sharing concerns.We make Turbo available open-source at https: //github.com/columbia/turbo,part of a broader set of infrastructure systems we are developing for DP, all described here: https://systems.cs.columbia.edu/dp-infrastructure/.
2 Background Threat model.We consider a threat model known as centralized differential privacy: one or more untrusted analysts query a dataset or stream through a restricted, aggregate-only interface implemented by a trusted database engine of which Turbo is a trusted component.The goal of the database and Turbo is to provide accurate answers to the analysts' queries without compromising the privacy of individual users in the database.The two main adversarial goals that an analyst may have are membership inference and data reconstruction.Membership inference is when the adversary wants to determine whether a known data point is present in the dataset.Data reconstruction involves reconstructing unknown data points from a known subset of the dataset.To achieve their goals, the adversary can use composition attacks to single out contributions from individuals, collude with other analysts to coordinate their queries, link anonymized records to public datasets, and access arbitrary auxiliary information except for timing side-channel information.Previous research demonstrated attacks under this threat model [17,21,28,29,33,52].Differential privacy (DP).DP [25] randomizes aggregate queries over a dataset to prevent membership inference and data reconstruction [24,61].DP randomization (a.k.a.noise) ensures that the probability of observing a specific result is stable to a change in one datapoint (e.g., if user  is removed or replaced in the dataset, the distribution over results remains similar).More formally, a query  is (, )-DP if, for any two datasets  and  ′ that differ by one datapoint, and for any result subset  we have: P( () ∈ ) ≤   P( ( ′ ) ∈ ) + . quantifies the privacy loss due to releasing the DP query's result (higher means less privacy), while  can be interpreted as a failure probability and is set to a small value.
Two common mechanisms to enforce DP are the Laplace and Gaussian mechanisms.They add noise from an appropriately scaled Laplace/Gaussian distribution to the true query result, and return the noisy result.As an example, for counting queries and a database of size , adding noise from Laplace(0, 1/), ensures (, 0)-DP (a.k.a.pure DP); adding noise from Gaussian(0, √︁ 2 ln(1.25/)/)ensures (, )-DP.The accuracy for such queries can be controlled probabilistically by converting it into the (, ) parameters.
Answering multiple queries on the same data fundamentally degrades privacy [23].DP quantifies this over a sequence of DP queries using the composition property, which in its basic form states that releasing two ( 1 ,  1 )-DP and ( 2 ,  2 )-DP queries is ( 1 +  2 ,  1 +  2 )-DP.When queries access disjoint data subsets, their composition is (max( 1 ,  2 ), max( 1 ,  2 ))-DP and is called parallel composition.Using composition, one can enforce a global (  ,   )-DP guarantee over a workload, with each DP query "consuming" part of a global privacy budget that is defined upfront as a system parameter [54].
Good values of the global privacy budget in interactive DP SQL systems remain subject for debate [34], but generally, an ideal value for strong theoretical guarantees is   = 0.1, while   = 1 are considered acceptable.Larger values are often considered vacuous semantically, since individuals' privacy risk grows with    .In this paper, we aim to achieve values of   = 1 or smaller over a query workload.Private multiplicative weights (PMW).PMW is a DP mechanism to answer online linear queries with bounded error [31].We defer detailed description of PMW, plus an example illustrating its functioning, to §4 and only give here an overview.PMW maintains an approximation of the dataset in the form of a histogram: estimated counts of how many times any possible data point appears in the dataset.When a query arrives, PMW estimates an answer using the histogram and computes the error of this estimate against the real data in a DP way, using a DP mechanism called sparse vector (SV) [26] (described shortly).If the estimate's error is low, it is returned to the analyst, consuming no privacy budget (i.e., the query is answered "for free").If the estimate's error is large, then PMW executes the DP query on the data with the Laplace/-Gaussian mechanism, consuming privacy budget as needed.It returns the DP result and also uses it to update the histogram for more accurate estimates to future queries.
An additional cost in using PMW comes from the SV, a well-known DP mechanism that can be used to test the error of a sequence of query estimates against the ground truth with DP guarantees and limited privacy budget consumption [26].We refer the reader to textbook descriptions of SV for detailed functioning [26] and provide here only an overview of its semantics.SV is a stateful mechanism that receives queries and estimates for their results one by one, and assesses the error between these estimates and the ground-truth query results.While the estimates have error below a preset threshold with high probability, SV returns success and consumes zero privacy.However, as soon as SV detects a large-error estimate, it requires a reset, which is a privacy-expensive operation that re-initializes state within the SV to continue the assessments.In common SV implementations, a reset costs as much as 3× the privacy budget of executing one DP query on the data.
The theoretical vision of PMW is as follows.Under a stream of queries, PMW first goes through a "training" phase, where its histogram is inaccurate, requiring frequent SV resets and consuming budget.Failed estimation attempts update the histogram with low-error results obtained by running the DP query.Once the histogram becomes sufficiently accurate, the SV tests consistently pass, thereby ameliorating the initial training cost.Theoretical analyses provide a compelling worst-case convergence guarantee for the histogram, determining a worst-case number of updates required to train a histogram that can answer any future linear query with low error [32].However, no one has examined whether this worst-case bound is practical and if PMW outperforms natural baselines, such as an exact-match cache.

Turbo Overview
Turbo is a caching layer that can be integrated into a DP SQL engine, significantly increasing the number of linear queries that can be executed under a fixed, global (  ,   )-DP guarantee.We focus on linear queries like sums, averages, and counts (defined in §4), which are widely used in interactive analytics and constitute the class of queries supported by approximate databases such as BlinkDB [6].These queries enable powerful forms of caching like PMW, and also allow for accuracy guarantees, which are important when doing approximate analytics, as one does on a DP database.

Design Goals
In designing Turbo, we were guided by several goals: (G1) and (G2) are strict requirements.(G3) and (G4) are driven by our belief that DP systems should not only possess meaningful theoretical properties but also be optimized for practice.(G5) is our main objective.(G6) requires further attention, given shortly.(G7) is driven by the limited guidance from PMW literature on parameter tuning.PMW meets goals (G1-G3) but falls significantly short for (G4-G7).Turbo achieves all goals; we provide theoretical analyses for (G1-G3) in §4 and empirical evaluations for (G4-G7) in §6.

Use Cases
The DP literature is fragmented, with different algorithms developed for different use cases.We seek to create a general system that supports multiple settings, highlighting three here: (1) Non-partitioned databases are the most common use case in DP.A group of untrusted analysts issue queries over time against a static database, and the database owner wishes to enforce a global DP guarantee.Turbo should allow a larger workload of queries compared to existing approaches.
(2) and (3) Partitioned databases are less frequently investigated in DP theory literature, but important to distinguish in practice [50,57].When queries tend to access different data ranges, it is worth partitioning the data and accounting for consumed privacy budget in each partition separately through DP's parallel composition.This lowers privacy budget consumption in each partition and permits more non-or partially-overlapping queries against the database.This kind of workload is inherent in timeseries and streaming databases, where analysts typically query the data by windows of time, such as how many new Covid cases occurred in the week after a certain event, or what is the average age of positive people over the past week.We distinguish two cases: (2) Partitioned static database, where the database is static and partitioned by an attribute that tends to be accessed in ranges, such as time, age, or geo-location.All partitions are available at the beginning.Queries arrive over time and most are assumed to run on some range of interest, which can involve one or more partitions.Turbo should provide significant benefit not only compared to the baseline caching techniques, but also compared to not having partitioning.
(3) Partitioned streaming database, where the database is partitioned by time and partitions arrive over time.In such workloads, queries tend to run continuously as new data becomes available.Hence, new partitions see a similar query workload as preceding partitions.Turbo should take advantage of this similarity to further conserve privacy.
For all three use cases, we aim to support online workloads of queries that are not all known upfront.As §8 reviews, most works on optimizing global privacy budget consumption operate in the offline setting, where all queries are known upfront.For that setting, algorithms are known to answer all queries simultaneously with optimal use of privacy budget.However, this setting is unrealistic for real use cases, where analysts adapt their queries based on previous results, or issue new queries for different analyses.In such cases, which correspond to the online setting, we require adaptive algorithms that accurately answer queries on-the-fly.Turbo does this by making effective use of PMW, as we next describe.

Turbo Architecture
Fig. 1 shows the Turbo architecture.It is a caching layer that can be added to a DP SQL engine, like GoogleDP [62], Uber Chorus [36], or Tumult Analytics [12], to boost the number of linear queries that can be answered accurately under a fixed global DP guarantee.The filled components indicate our additions to the DP SQL engine, while the transparent components are standard in DP SQL engines.Here is how a typical DP SQL engine works without Turbo.Analysts issue queries against the engine, which is trusted to enforce a global (  ,   )-DP guarantee.The engine executes the queries using a DP query executor, which adds noise to query results with the Laplace/Gaussian mechanism and consumes a part of the global privacy budget.A budget accountant tracks the consumed budget; when it runs out, the DP SQL engine either stops responding to new queries (as do Chorus and Tumult Analytics) or sacrifices privacy by "resetting" the budget (as does LinkedIn Audience Insights).We assume the former.
Turbo intercepts the queries before they go into the DP query executor and performs a very proactive form of caching for them, reusing prior results as much as possible to avoid consuming privacy budget for new queries.Turbo's architecture is organized in two types of components: caching objects (denoted in light-orange background in Fig. 1) and functional components that act upon them (denoted in grey background).Caching objects.Turbo maintains several types of caching objects.First, the Exact-Cache stores previous queries and their DP results, allowing for direct retrieval of the result without consuming any privacy budget when the same query is seen again on the same database version.Second, the PMW-Bypass is an improved version of PMW that reduces privacy budget consumption during the training phase of its histogram ( §4.3).Given a query, PMW-Bypass uses an effective heuristic to judge whether the histogram is sufficiently trained to answer the query accurately; if so, it uses it, thereby spending no budget.Critically, PMW-Bypass includes a mechanism to externally update the histogram even when bypassing it, to continue training it for future, free-budget queries.Turbo aims to enable parallel composition for workloads that benefit from it, such as timeseries or streaming workloads, by supporting database partitioning.In theory, partitions could be defined by attributes with public values that are typically queried by range, such as time, age, or geo-location.In this paper, we will focus on partitioning by time.Turbo uses a tree-structured PMW-Bypass caching object, consisting of multiple histograms organized in a binary tree, to support linear range queries over these partitions effectively ( §4.4).This approach conserves more privacy budget and enables larger workloads to be run when queries access only subsets of the partitions, compared to alternative methods.Functional components.When Turbo receives a linear query through the DP SQL engine's query parser, it applies its caching objects to the query.If the database is partitioned, Turbo splits the query into multiple sub-queries based on the available tree-structured caching objects.Each sub-query is first passed through an Exact-Cache, and if the result is not found, it is forwarded to a PMW-Bypass, which selects whether to execute it on the histogram or through direct Laplace/Gaussian.For sub-queries that can leverage histograms, the answer is supplied directly without execution or budget consumption.For sub-queries that require execution with Laplace/Gaussian, the (, ) parameters for the mechanism are computed based on the (, ) accuracy parameters, using the "calibrate (, ) for (, )" functional component in Fig. 1.Then, each sub-query and its privacy parameters are passed to the DP query executor for execution.
Turbo combines all sub-query results obtained from the caching objects to form the final result, ensuring that it is within  of the true result with probability 1 −  (functional component "combine results").New results computed with fresh noise are used to update the caching objects (functional component "update histograms and Exact-Caches").Additionally, Turbo includes cache management functionality, such as "warm-start of histograms," which reuses trained histograms from previous partitions to warm-start new histograms when a new partition is created ( §4.5).This mechanism is effective in streams where the data's distribution and query workload are stable across neighboring partitions.Theoretical and experimental analyses show that external histogram updates and warm-starting give convergence properties similar to, but slightly slower than, vanilla PMW.

Detailed Design
We next detail the novel caching objects and mechanisms in Turbo, using different use cases from §3.2 to illustrate each concept.We describe PMW-Bypass in the static, nonpartitioned database, then introduce partitioning for the treestructured PMW-Bypass, followed by the addition of streaming to discuss warm-start procedures.We focus on the Laplace mechanism and basic composition, thus only discussing pure (, 0)-DP and ignoring .We also assume  is small enough for Turbo results to count as -accurate w.h.p.Appendix A.6 extends all our theoretical results to (, )-DP, non-zero , the Gaussian mechanism, and Rényi composition; in theory, all these should help to further conserve privacy budget, so we speculate they will be important for practice, but we leave their implementation and evaluation for future work.

Notation
Our algorithms require some notation.Given a data domain X, a database  with  rows can be represented as a histogram ℎ ∈ N X as follows: for any data point  ∈ X, ℎ() denotes the number of rows in  whose value is .ℎ() is the bin corresponding to value  in the histogram.We denote  = |X| the size of the data domain and  the size of the database.When X has the form {0, 1}  , we call  the data domain dimension.Example: a database with 3 binary attributes has domain X = {0, 1} 3 of dimension  = 3 and size  = 8; ℎ(0, 0, 1) is the number of rows that are equal to (0, 0, 1).§4.2 exemplifies a database, its dimensions, and its histogram.We define linear queries as SQL queries that can be transformed or broken into the following form: SELECT AVG( * ) FROM ( SELECT q(A, B, C, ...) FROM Table ), where q takes  arguments (one for each attribute of Table, denoted , , , ...) and outputs a value in [0, 1].When q has values in {0, 1}, a query returns the fraction of rows satisfying predicate q.To get raw counts, we multiply by , which we assume is public information.PMW (and hence Turbo) is designed to support only linear queries.Examples of non-linear queries are: maximum, minimum, percentiles, top-k.

Running Example
Fig. 2 gives a running example inspired by our evaluation Covid dataset.Analysts run queries against a database consisting of Covid test results over time.Fig. 2(a) shows a simplified version of the database, with only three attributes: the test's date, T; the outcome, P, which can be 0 or 1 for negative/positive; and subject's age bracket, A, with one of four values as in the figure.The database could be either static or actively streaming in new test data.Initially, we assume it static and ignore the T attribute.Our example database has  = 100 rows and data domain size  = 8 for P and A.
Fig. 2(b) shows two queries that were previously executed.While queries in Turbo return the fraction of entries satisfying a predicate, for simplicity we show raw counts.1 requests the positivity rate and 2 the fraction of tested minors.Fig. 2(c) illustrates the histogram representation corresponding to the dataset, as estimated by the PMW algorithm, whose execution we discuss shortly.Fig. 2(d) shows the next query that will be executed, 3, requesting the fraction of positive minors.3 is not identical to either 1 or 2, but it is correlated with both, as it accesses data that overlaps with both queries.Thus, while neither 1's nor 2's DP results can be used to directly answer 3, intuitively, they both should help.That is the insight that PMW (and PMW-Bypass) exploits through its query-by-query build-up of a DP histogram representation of the database that becomes increasingly accurate in bins that are accessed by more queries.
Fig. 2(c) shows the state of the histogram after executing 1 and 2 but before executing 3.Each bin in the histogram stores an estimation of the number of rows equal to (, ).This is the ℎ(, ) field in the figure, for which we show the sequence of values it has taken following updates due to 1 and 2.Initially, ℎ(, ) in all bins is set assuming a uniform distribution over  × ; in this case the initial value was / = 12.5.The figure also shows the real (non-private) count for each bin (denoted real), which is not part of the histogram, but we include it as a reference.As queries are executed, ℎ(, ) values are updated with DP results, depending on which bins are accessed.1 and 2 have already been executed, and both are assumed to have resorted to the Laplace mechanism, so they both contributed DP results to specific bins (we specify the update algorithm later when discussing Alg. 1).1 accessed, and hence updated, data in the  = 1 bins (the bottom row of the histogram).2 did so in the  = 0 bins (the left column of the histogram).Through a renormalization step, t hese queries have also changed the other bins, though not necessarily in a query-informed way.The  variable in each bin shows the number of queries that have purposely updated that bin.We can see that estimates in the  > 0 bins are a bit more accurate compared to those in the  = 0 bins.The only bin that has been updated twice is ( = 1,  = 0), as it lies at the intersection of both queries; that bin has diverged from its neighboring, singly-updated bins and is getting closer to its true value.(Bin ( = 1,  = 2), updated only once, is even more accurate purely by chance.) Our last query, 3, which accesses ( = 1,  = 0), may be able to leverage its estimation "for free," assuming the estimation's error is within  w.h.p.Assessing that the error is within  -privately, and without consuming privacy budget if it is -is the purview of the SV mechanism incorporated in a PMW.The catch is that the SV consumes privacy budget, in copious amounts, if this test fails.This is what makes vanilla PMW impractical, a problem that we address next.

PMW-Bypass
PMW-Bypass addresses practical inefficiencies of PMW, which we illustrate with simple demonstration.Demo experiment.Using a four-attribute Covid dataset with domain size 128 (so a bit larger than in our running example), we generate a query pool of over 34K unique queries by taking all possible combinations of values over the four attributes.From this pool, we sample uniformly with replacement 35K queries to form a workload; there is therefore some identical repetition of queries but not much.This workload is not necessarily realistic, but it should be an ideal showcase for PMW: there are many unique queries relative to the small data domain size (giving the PMW ample chance to train), and while most queries are unique, they tend to overlap in the data they touch (giving the PMW ample chance to reuse information from previous queries).We evaluate the cumulative privacy budget spent as queries are executed, comparing the case where we execute them through PMW vs. directly with Laplace, with and without an exact-match cache.Fig. 3 shows the results.As expected for this workload, the PMW works, as it converges after roughly the first 10K queries and consumes very little budget afterwards.However, before converging, the PMW consumes enormous budget.In contrast, direct execution through Laplace grows linearly, but more slowly compared to PMW's beginning.The PMW eventually becomes better than Laplace, but only after ≈ 27 queries.
Moreover, if instead of always executing with Laplace, we trivially cached the results in an exact-match cache for future reuse if the same query reappeared -a rare event in this workload -then the PMW would never become notably better than this simple baseline!This happens for a workload that should be ideal for PMW.§6 shows that for other workloads, less favorable for PMW but more realistic, the outcome persists: PMWs underperform even the simplest baselines in practice.
We propose PMW-Bypass, a re-design for PMWs that releases their power and makes them very effective.We make multiple changes to PMWs, but the main one involves bypassing the PMW while it is training (and hence expensive) and instead executing directly with Laplace (which is less expensive).Importantly, we do this while still updating the histogram with the Laplace results so that eventually the PMW becomes good enough to switch to it and reap its zeroprivacy query benefits.The PMW-Bypass line in Fig. 3 shows just how effective this design is in our demo experiment: PMW-Bypass follows the low, direct-Laplace curve instead the PMW's up until the histogram converges, after which it follows the flat shape of PMW's convergence line.In this experiment, as well as in others in §6, the outcome is the same: our changes make PMWs very effective.We thus believe that PMW-Bypass should replace PMW in most settings where the latter is studied, not just in our system's design.PMW-Bypass.Fig. 4 shows the functionality of PMW-Bypass, with the main changes shown in blue and bold.Without our changes, a vanilla PMW works as follows.Given a query , PMW first estimates its result using the histogram (1) and then uses the SV protocol to test whether it is -accurate w.h.p.The test involves comparing 1 to the exact result of the query executed on the database.If a noisy version of the  absolute error between the two is within a threshold comfortably far from , then 1 is considered accurate w.h.p. and outputted directly.This is the good case, because the query need not consume any privacy.The bad case is when the SV test fails.First, the query must be executed directly through Laplace, giving a result 2, whose release costs privacy.But beyond that, the SV must be reset, which consumes privacy.In total, if the Laplace execution costs , then releasing 2 costs 4 * !This is what causes the extreme privacy consumption during the training phase for vanilla PMW, when the SV test mostly fails.Still, in theory, after paying handsomely for this histogram "miss," 2 can be used to update the histogram (the arrow denoted "update (R2)" in Fig. 4), in hopes that future correlated queries "hit" in the histogram.PMW-Bypass adds three components to PMW: (1) a heuristic that assesses whether the histogram is likely ready to answer  with the desired accuracy; (2) a bypass branch, taken if the histogram is deemed not ready and direct query execution with Laplace instead of going through (and likely failing) the SV test; and (3) an external update procedure that updates the histogram with the bypass branch result.Given , PMW-Bypass first consults the heuristic, which only inspects the histogram, so its use is free.Two cases arise: Case 1: If the heuristic says the histogram is ready to answer Q with -accuracy w.h.p., then the PMW is used, 1 is generated, and the SV is invoked to test 1's actual accuracy.
If the heuristic's assessment was correct, then this test will succeed, and hence the free, 1 output branch will be taken.Of course, no heuristic that lacks access to the raw data can guarantee that 1 will be accurate enough, so if the heuristic was actually wrong, then the SV test will fail and the expensive 2 path is taken.Thus, a key design question is whether there exist heuristics good enough to make PMW-Bypass effective.We discuss heuristic designs below, but the gist is that simple and easily tunable heuristics work well, enabling the significant privacy budget savings in Fig. 3.
Case 2: If the heuristic says the histogram is not ready to answer Q with -accuracy w.h.p., then the bypass branch is taken and Laplace is invoked directly, giving result 3.Now, PMW-Bypass must pay for Laplace, but because it bypassed the PMW, it does not risk an expensive SV reset.A key design question here is whether we can still reuse 3 to update the histogram, even though we did not, in fact, consult the SV to ensure that the histogram is truly insufficiently trained for Q.We prove that performing the same kind of update as the PMW would do, from outside the protocol, would break its theoretical convergence guarantee.Thus, for PMW-Bypass, we design an external update procedure that can be used to update the histogram with 3 while preserving the PMW's worst-case convergence, albeit at slower speed.Heuristic ISHISTOGRAMREADY.One option to assess if a histogram is ready to answer a query accurately is to check if it has received at least  updates, for some global threshold .However, this approach is often imprecise as it fails to detect histogram regions that might still be untrained.Thus, we use a separate threshold value per bin, raising the question of how to configure all these thresholds.To keep configuration easy (goal (G6)), we use an adaptive per-bin threshold.For each bin, we initialize its threshold  with a value  0 and increment  by an additive step  0 every time the heuristic errs (i.e., predicts it is ready when it is in fact not ready for that query).While the threshold is too small, the heuristic gets penalized until it reaches a threshold high enough to avoid mistakes.For queries that span multiple bins, we only penalize the leastupdated bins to prevent a single, inaccurate bin from setting back the histogram from queries using accurate bins only.
With these thresholds, we only configure initial parameters  0 and  0 , which we find experimentally easy to do ( §6.2).
External updates.While we want to bypass the PMW when the histogram is not "ready" for a query, we still want to update the histogram with the result from the Laplace execution (R3); otherwise, the histogram will never get trained.That is the purpose of our external updates (lines 33-34 in Alg. 1).They follow a similar structure as a regular PMW update (lines 24-25 in Alg.1), with a key difference.In vanilla PMW, the histogram is updated with the result 2 from Laplace only when the SV test fails.In that case, PMW updates the relevant bins in one direction or another, depending on the sign of the error 2 − (ℎ).For example, if the histogram is underestimating the true answer, then R2 will likely be higher than the histogram-based result, so we should increase the value of the bins (case 2 > (ℎ) of line 24 in Alg. 1).
In PMW-Bypass, external updates are performed not just when the authoritative SV test finds the histogram estimation inaccurate, but also when our heuristic predicts it to be inaccurate even though it may actually be accurate.In the latter case, performing external updates in the same way as PMW updates would add bias into the histogram and forfeit its convergence guarantee.To prevent this, in PMW-Bypass, external updates are executed only when we are quite confident, based on the direct-Laplace result 3, that the histogram overestimates or underestimates the true result.Line 33 shows the change: the term  is a safety margin that we add to the comparison between the histogram's estimation and 3, to be confident that the estimation is wrong and the update warranted.This lets us prove worst-case convergence akin to PMW.Finally, like regular PMW updates, external updates reuse the already DP result 3, hence they do not consume any additional privacy budget beyond what was already consumed to generate 3.Learning rate.In addition to the bypass option, we make another key change to PMW design for practicality.When updating a bin, we increase or decrease the bin's value based on a learning rate parameter, lr, which determines the size of the update step taken (line 3 in Alg. 1).Prior PMW works fix learning rates that minimize theoretical convergence time, typically /8 [58].However, our experiments show that larger values of lr can lead to much faster convergence, as dozens of updates may be needed to move a bin from its uniform prior to an accurate estimation.However, increasing lr beyond a certain point can impede convergence, as the updates become too coarse.Taking cue from deep learning, PMW-Bypass uses

Tree-Structured PMW-Bypass
We now switch to the partitioned-database use cases, focusing on time-based partitions, as in timeseries databases, whether static or dynamic.Rather than accessing the entire database, analysts tend to query specific time windows, such as requesting the Covid positivity rate over the past week, or the fraction of minors diagnosed with Covid in the two weeks following school reopening.This allows the opportunity to leverage DP's parallel composition: the database is partitioned by time (say a week's data goes in one partition), and privacy budget is consumed at the partition level.Queries can run at finer or coarser granularity, but they will consume privacy against the partition(s) containing the requested data.With this approach, a system can answer more queries under a fixed global (  ,   )-DP guarantee compared to not partitioning [40,45,50,56].We implement support for partitioning and parallel composition in Turbo through a new caching object called a tree-structured PMW-Bypass.Example.Fig. 5 shows an extension of the running example in §4.2, with the database partitioned by week.Denote   the size of each partition.A new query, , asks for the positivity rate over the past three weeks.How should we structure the histograms we maintain to best answer this query?One option would be to maintain one histogram per partition (i.e., just the leaves in the figure).To resolve , we query the histograms for weeks 2, 3, 4. Assume the query results in an update.Then, we need to update histograms, computing the answer with DP within our  error tolerance.Updating histograms for weeks 2, 3, and 4 requires querying the result for each of them with parallel composition.Given that Laplace ( Instead, our approach is to maintain a binary-tree-structured set of histograms, as shown in Fig. 5.For each partition, but also for a binary tree growing from the partitions, we maintain a separate histogram.To resolve , we split the query into two sub-queries, one running on the histogram for week 2 ([2,2]) and the other running on the histogram for the range week 3 to week 4 ( [3,4]).That last sub-query would then run on a larger dataset of size  3 +  4 , requiring a smaller budget consumption to reach the target accuracy.Design.Fig. 6 shows our design.Given a query , we split it into sub-queries based on the histogram tree, applying the min-cuts algorithm to find the smallest set of nodes in the tree that covers the requested partitions.In our example, this gives two sub-queries,  ′ and  ′′ , running on histograms [2,2] and [3,4], respectively.For each sub-query, we use our heuristic to decide whether to use the histogram or invoke Laplace directly.If both histograms are "ready," we compute their estimations and combine them into one result, which we test with an SV against an accuracy goal.In our example, there are only two sub-queries, but in general there can be more, some of which will use Laplace while others use histograms.We adjust the SV's accuracy target to an (  ,   ) calibrated to the aggregation that we will need to do among the results of these different mechanisms.We pay for any Laplace's and SV resets against the queried data partitions and finally combine Laplace results with histogram-based results.Each subquery updates the corresponding histograms of the tree (details in Alg. 2) and increments  for updated nodes.Guarantees.(G1) Privacy and (G2) accuracy are unchanged (Thm.A.5, A.6). (G3) Worst-case convergence: For  partitions, if lr/ <  ≤ 1/2, then w.h.p. we perform at most updates (Thm.A.8).

Histogram Warm-Start
An opportunity exists in streams to warm-start histograms from previously trained ones to converge faster.Prior work on PMW initialization [44] only justifies using a public dataset close to the private dataset to learn a more informed initial value for histogram bins than a uniform prior.We prove that warm-starting a histogram by copying an entire, trained histogram preserves the worst-case convergence.In Turbo, we use two procedures: for new leaf histograms, we copy the previous partition's leaf node; for non-leaf histograms, we take the average of children histograms.We also initialize the per-bin thresholds and update counters of each node.Guarantees.(G1) Privacy and (G2) accuracy guarantees are unchanged.(G3) Worst-case convergence: If there exists  ≥ 1 such that the initial histogram ℎ 0 in Alg. 1 satisfies ∀ ∈ X, ℎ 0 () ≥ 1  | X | , then we show that each PMW-Bypass converges, albeit at a slower pace (Thm.A.9).The same properties hold for the tree.

Prototype Implementations
We prototype Turbo in three components that we release open-source: (1) turbo-lib, a library that contains Turbospecific functionality, notably the caching objects and functional components in the Turbo architecture (Fig. 1); (2) turbo-tumult, a library that connects turbo-lib with Tumult Analytics, to add caching functionality into that existing DP system; and (3) turbo-sql, a basic standalone library to run a select subset of DP SQL queries through turbo-lib directly against a traditional, non-DP database, such as TimescaleDB or PostgreSQL.The reason for both (2) and ( 3) is that Tumult provides a more complete database query engine, supporting a wide variety of Spark-SQL-like queries while having significant limitations with respect to parallel composition on partitioned databases.Our integration with Tumult (2) shows that Turbo can be integrated with a real, existing DP system, while our standalone querying library (3) can let us experiment with both non-partitioned and partitioned databases, in both static-and streaming-DB settings.We use a version of (3) (released through the SOSP'23 artifact) throughout our evaluation, but describe here predominantly our integration with Tumult, which can serve as a blueprint for integration with other existing DP systems in the future.Finally, we separately release the artifact that we used in our evaluation and which was evaluated by the SOSP'23 artifact evaluation committee.All are available from the repository: https://github.com/columbia/turbo.Fig. 7(a) shows the architecture of our Turbo-Tumult integration.The grey boxes are Turbo-specific while the clear boxes are unchanged Tumult components.Tumult overview.Without Turbo, Tumult functions as follows.It consists of two main components: (1) Tumult Core, a library that implements primitive DP mechanisms and privacy accounting; and (2) Tumult Analytics, a layer on top of Tumult Core that exports a higher-level, Spark-SQL-like query interface on top of one or more static datasets or databases.Tumult Core exports a low-level API consisting of a privacy accountant and a measurement abstraction, which is the Tumult terminology for a DP computation.It implements the necessary methods to "evaluate" a measurement on top of a dataset and deduct its privacy budget against the accountant.Tumult Analytics implements two main abstractions: (1) a session, which represents the context against which Tumult will enforce a global privacy budget across all queries issued against this session and (2) a Spark-SQL-like interface for analysts to construct queries that consists of multiple transformations chained one after another (such as filters, projections, joins, etc.) against one or more datasets, followed by a single aggregate function (such as an average, count, sum, median, percentiles, stdev, etc.), with potential for splitting and grouping the results by one or more attributes.Compared to other DP SQL databases, it is our impression that Tumult supports a fairly wide range of SQL that can be handled with DP.
For the purposes of this paper, we will assume that an administrator creates a session upfront, specifying a global privacy budget to be enforced and hosts this session as a service to guard analysts' accesses to a sensitive dataset (or datasets) underneath.Analysts, which can be many and are untrusted, send their query expressions for execution against the session.The session is then responsible for executing each query by first compiling it into a measurement and then evaluating it through the Tumult Core, which will deduct the necessary privacy budget.While the measurement abstraction is a quite general representation of a DP computation, Tumult Core and Analytics assume a Spark DataFrame-based API for interacting with the dataset(s) underneath.Thus, measurements compiled through Tumult Analytics, will be Spark DataFrame queries -to be executed through Spark -in which Tumult Analytics transparently includes an additional operation that adds an appropriately scaled amount of noise to the result of the aggregation.A Tumult measurement encapsulates this compiled Spark DataFrame query, along with information regarding the privacy budget it is programmed to expend upon its execution.Tumult Analytics hands over this measurement for Tumult Core, which executes the DataFrame query through Spark and deducts the measurement's reported privacy consumption through its privacy accountant.Turbo-Tumult.The preceding describes Tumult and its main abstractions (relevant for this paper) without Turbo.Tumult itself has no caching capabilities, so our integration aims to add Turbo as a caching layer in Tumult.The integration consists of two components, denoted in grey in Fig. 7(a).First is turbo-lib, which contains the core Turbo functionality we described in this paper.Turbo-lib exposes an API, Turbo API, consisting of the functions Turbo exports to and requires from any user of Turbo, such as turbo-tumult and turbo-sql.Second is turbo-tumult, a small library that incorporates Turbo into Tumult by invoking and implementing different parts of the Turbo API.Turbo-tumult takes a light-touch approach to incorporating Turbo into Tumult, which ensures that our system is easily adoptable.It manifests in two ways.First, we only extend, but do not modify, certain classes within Tumult Analytics and implement new types of measurements to extend, but not change, Tumult Core functionality.Specifically, turbotumult provides one type of externally visible abstraction: a new type of session, called TurboSession, which overrides the original's query evaluation function to: (1) incorporate a set of hooks into the query compiler such that certain information necessary for Turbo is extracted from the query, such as the dataset ID, the type of aggregate function, and the filtering conditions; and (2) if the query can be handled by Turbo, TurboSession passes it through turbo-lib instead of executing it directly on Tumult Core.Turbo-lib then checks its own caching objects for an answer, but resorts to Tumult Core -which it accesses back through the Turbo API, discussed shortly -for execution of the query and for privacy budget deduction in the Tumult Core accountant.
Second, we take a fail-to-Tumult approach for all queries.Turbo supports a small subset of all queries supported by Tumult: e.g., we do not support joins, medians, percentiles, and a number of transformation functions allowed in Tumult.Moreover, Turbo aims to control accuracy of the queries, and presently that accuracy must be specified upfront, when the cache (e.g., through TurboSession) service is created.Yet, analysts may wish to vary their accuracy targets per query, and in some cases may wish to specify privacy budgets rather than accuracies for a query.Finally, we support only certain types of DP mechanisms and definitions in our prototype, specifically those relying on Laplace, whereas Tumult supports more.Our approach to address these limitations without restricting analysts' interaction with Tumult is to consult the Turbo caches only when the queries exhibit properties we can handle, while resorting to Tumult-based execution when they do not.As a result, an analyst interacting with a TurboSession will not be restricted in terms of their queries or accuracy demands compared to interacting with a vanilla Tumult session, but Turbo will only conserve privacy budget for those queries that it can handle.
The preceding two approaches for light-touch integration of Turbo into Tumult ensure that our system can be easily adopted.Turbo API.The Turbo API is the central component for integrating Turbo into real DP systems.Shown in Fig. 7(b), it consists of two components: functionality that Turbo-lib implements and users invoke to take advantage of its caches (specifically, the run function) and (2) several classes that users implement to provide Turbo with services it needs from the DP system with which it is integrated.Turbo needs three types of services from the DP system.First, it needs the ability to extract certain information about the query, such as: the type of aggregation and filter chain; a unique ID for the dataset (or partition or view over the dataset or partition) on which the query is run, as Turbo's state is tied to a dataset/partition/view; and the number of records in that dataset (recall that our design assumes that the dataset size is public information).This information is supplied by implementing the TurboQuery interface, which wraps the original, DP-system-specific query structure into one that supplies the necessary information.For example, our turbo-tumult library wraps query expressions into a TurboQuery with this enhanced functionality.
Second, Turbo is not a query engine, so it needs the ability to execute a query through the original DP system.This is provided by implementing the QueryExecutor interface.A peculiarity of Turbo in this context, which was easy to implement in Tumult but which we anticipate may be non-trivial to implement in other DP systems, is that Turbo needs not only the ability to execute the query in a DP way, but also the ability to execute it without DP.Recall that Turbo's SV checks compares the histogram-based result to the true result of invoking the same query on the data without DP.Turbo thus needs access to this true result, a piece of functionality that typically DP databases do not offer publicly, for good, safety-related reasons.Still, in Tumult, due to its highly modular structure, we find that this functionality can be implemented without having to modify its code base.Specifically, turbo-tumult implements QueryExecutor.executeNPQuery(.) by defining a special type of measurement that does not, in fact, incorporate randomness into its aggregate and which reports as zero the privacy budget being used.This measurement is executed against the Tumult Core and returns the true result of the query.In turbo-lib, we take care to only leverage this sensitive result internally during the SV check in a DP way.Moreover, to optimize query execution in the case that the SV fails, turbotumult implements QueryExecutor.executeDPQuery(.) with the optional ability to reuse a non-private, true result previously obtained for the SV check.This is achieved by implementing another type of measurement, which, when executed by Tumult Core, will only apply the randomness operation to the given true_result and report the appropriate amount of privacy budget to be deducted by the Tumult Core's accountant.
Third, Turbo needs the ability to deduct the privacy budget consumed by the SV reset.This is supplied by implementing the Turbo PrivacyAccountant interface, with one function: consume.In turbo-tumult, we implement this interface by defining a third type of measurement, which does not perform any computation but just consumes privacy.We believe that DP systems should export this kind of functionality to more naturally support extensions.Turbo-lib.Turbo-lib implements the Turbo design described in this paper, with some notable restrictions.First, we do not yet support partitioning in the turbo-lib implementation, though that support exists in our SOSP artifact release, as used in our evaluation.Second, our implementation only supports count queries presently, although our histograms and exact-match caches can be extended to support other types of linear aggregations, such as sums, averages, standard deviation.Third, we use Redis to store all state in Turbo, including the exact-match caches, PMW histograms, and SV state.Redis can be replaced with a persistent, consistent and durable storage service for production use.Turbo-sql.In addition to incorporating Turbo into the Tumult Analytics engine, we are also creating a basic, standalone, SQL DP database ourselves, which only supports the types of queries that Turbo supports, but which can support streaming and partitioning.At the time of this writing, the most mature version of this library can be found in our SOSP artifact release, but we are working on a more modular version of this library that presently lacks support for streaming and partitioning.The full-featured version of this library, which is what we use in our evaluation, receives simple linear SQL queries as strings, parses them, implements the Turbo API to first check for answers to them in the Turbo cache, and execute the queries -DP through Laplace or non-DP (as needed by Turbo) -using TimescaleDB, a streaming version of PostgreSQL.

Evaluation
We evaluate Turbo using the SOSP artifact version of our own, dedicated DP SQL database with Turbo support incorporated in it.We use two public timeseries datasets -Covid and CitiBike -to evaluate Turbo in the three use cases from §3.2.Each use case lets us do system-wide evaluation, answering the critical question: Does Turbo significantly improve privacy budget consumption compared to reasonable baselines for each use case?This corresponds to evaluating our §3.1 design goals (G5) and (G6).In addition, each setting lets us evaluate a different set of caching objects and mechanisms: (1) Non-partitioned database: We configure Turbo with a single PMW-Bypass and Exact-Cache, letting us evaluate the PMW-Bypass object, including its empirical convergence and the impact of its heuristic and learning rate parameters.
(2) Partitioned static database: We partition the datasets by time (one partition per week) and configure Turbo with the tree-structured PMW-Bypass and Exact-Cache.This lets us evaluate the tree-structured cache.
(3) Partitioned streaming database: We configure Turbo with the tree-structured PMW-Bypass, Exact-Cache, and histogram warm-up, letting us evaluate warm-up.
As highlighting, our results show that PMW-Bypass unleashes the power of PMW, enhancing privacy budget consumption for linear queries well beyond the conventional approach of using an exact-match cache (goal (G5)).Moreover, Turbo as a whole seamlessly applies to multiple settings, with its novel tree-structured PMW-Bypass structure scoring significant benefit for timeseries workloads where database can be partitioned to leverage parallel composition (goal (G6)).Configuration of our objects and mechanisms is straightforward (goal (G7)), and we tune them based on empirical convergence rather than theoretical convergence, boosting their practical effectiveness (goal (G4)).Finally, we provide a basic runtime and memory evaluation, which shows PMW-Bypass has similar empirical convergence to PMW, and both converge faster with much larger lr than anticipated by worst-case convergence.
that while Turbo performs reasonably for our datasets, further research is needed for larger-domain data.

Methodology
For each dataset, we create query workloads by (1) generating a pool of linear queries and (2) sampling queries from this pool based on a Zipfian distribution.Covid uses a completely synthetic query pool.CitiBike uses a pool based on real-user queries from prior CitiBike analyses.We use the former as a microbenchmark, the latter as a macrobenchmark.Covid.Dataset: We take a California dataset of Covid-19 tests from 2020 that provides daily aggregate information of the number of Covid tests and their positivity rates for various demographic groups defined by age × gender × ethnicity.We combine this data with US Census data to generate a synthetic dataset that contains  = 50, 426, 600 per-person test records, each with the date and four attributes: positivity, age, gender, and ethnicity.These attributes have domain sizes of 2, 4, 2 and 8, respectively, so the dataset domain size is  = 128.The dataset spans 50 weeks, so in partitioned use cases we have up to 50 partitions.Query pool: We create a synthetic and rich pool of correlated queries comprising all possible count queries that can be posed on Covid.This gives 34, 425 unique queries, plenty for us to microbenchmark Turbo.CitiBike.Dataset: We take a dataset of NYC bike rentals from 2018-2019, which includes information about individual rides, such as start/end date, start/end geo-location, and renter's gender and age.The original data is too granular with 4,000 geo-locations and 100 ages, making it impractical for PMWs.Since all the real-user analyses we found consider the data at coarser granularity (e.g.broader locations and age brackets), we group geo-locations into ten neighborhoods and ages into four brackets.This yields a dataset with  = 21, 096, 261 records, domain size  = 604, 800, and spanning 50 weeks.Query pool: We collect a set of pre-existing CitiBike analyses created by various individuals and made available on Public Tableau [2].An example is here [1].We extract 30 distinct queries, most containing 'GROUP BY' statements that we decompose into multiple primitive queries that can interact with Turbo histograms.This gives us a pool of 2, 485 queries, which is smaller than Covid's but more realistic and suitable as a macrobenchmark.Workload generation.As is customary in caching literature [8,18,63], we use a Zipfian distribution to control the skewness of query distribution, which affects hit rates in the exact-match cache.From a pool of  queries, a query of type  ∈ [1, ] is sampled with probability ∝  − zipf , where  zipf ≥ 0 is the parameter that controls skewness.We evaluate with several  zipf values but report only results for  zipf = 0 (uniform) and  zipf = 1 (skewed) for Covid.For CitiBike, we evaluate only for  zipf = 0 to avoid reducing the small query pool further with skewed sampling.For streaming, queries arrive online with arrival times following a Poisson process; they request a window of certain size over recent timestamps.Metrics.• Average cumulative budget: the average budget consumed across all partitions.• Systems metrics: traditional runtime, process RAM.• Empirical convergence: We periodically evaluate the quality of Turbo's histogram by running a validation workload sampled from the same query pool.We measure the accuracy of the histogram as the fraction of queries that are answered with error ≥ /2 by the histogram.We define empirical convergence as the number of histogram updates necessary to reach 90% validation accuracy.Default parameters.Unless stated otherwise, we use the following parameter values: privacy (  = 10,   = 0); accuracy ( = 0.05,  = 0.001); for Covid: {learning rate  starts from 0.25 and decays to 0.025, heuristic ( 0 = 100,  0 = 5), external updates  = 0.05}; for CitiBike: {learning rate  = 0.5, heuristic ( 0 = 5,  0 = 1), external updates  = 0.01}.Turbo is instantiated with tree-structured PMW-Bypass and Exact-Cache.Turbo significantly improves budget consumption compared to both a single Exact-Cache and a tree-structured set of Exact-Caches.
6.2 Use Case (1): Non-partitioned Database System-wide evaluation.Question 1: In a non-partitioned database, does Turbo significantly improve privacy budget consumption compared to vanilla PMW and a simple Exact-Cache?Fig. 8(a)-8(c) show the cumulative privacy budget used by three workloads as they progress to 70 queries.Two workloads correspond to Covid, one uniform ( zipf = 0) and one skewed ( zipf = 1), and one uniform workload for CitiBike.Turbo surpasses both baselines across all three workloads.The improvement is enormous when compared to vanilla PMW: 15.9 − 37.4×!PMW's convergence is rapid but consumes lots of privacy; Turbo uses little privacy during training and then executes queries for free.Compared to just an Exact-Cache, the improvement is less dramatic but still significant.The greatest improvement over Exact-Cache is seen in the uniform Covid workload: 16.7× (Fig. 8(a)).Here, queries are relatively unique, resulting in low hit rate for the Exact-Cache.That hit rate is higher for the skewed workload (Fig. 8(b)), leaving less room for improvement for Turbo: 9.7× better than Exact-Cache.For CitiBike (Fig. 8(c)), the query pool is much smaller (< 2.5 queries), resulting in many exact repetitions in a large workload, even if uniform.Nevertheless, Turbo gives a 1.7× improvement over Exact-Cache.And in this workload, Turbo outperforms PMW by 37.4× (omitted from figure for visualization reasons).Overall, then, Turbo significantly reduces privacy budget consumption in non-partitioned databases, achieving 1.7 − 15.9× improvement over the best baseline for each workload (goal (G5)).PMW-Bypass evaluation.Using Covid  zipf = 1, we microbenchmark PMW-Bypass to understand the behavior of this key Turbo component.Question 2: Does PMW-Bypass converge similarly to PMW in practice?Through theoretical analysis, we have shown that PMW-Bypass achieves similar worst-case convergence to PMW, albeit at slower speed ( §4.3).Fig. 8(d) compares the empirical convergence (defined in §6.1) of PMW-Bypass vs. PMW, as a function of the learning rate  .We make three observations, two of which agree with theory, and the last differs.First, the results confirm the theory that (1) PMW-Bypass and PMW converge similarly, but (2) for "good" values of  , vanilla PMW converges slightly faster: e.g., for  = 0.025, PMW-Bypass converges after 1853 updates, while PMW after 944.Second, as theory suggests, very large values of lr (e.g.,  ≥ 0.4) impede convergence in practice.Third, although theoretically,  = /8 = 0.00625 is optimal for worst-case convergence, and it is commonly hard-coded in PMW protocols [58], we find that empirically, larger values of  (e.g.,  = 0.05, which is 8× larger) give much faster convergence.This is true for both PMW and PMW-Bypass, and across all our workloads.This justifies the need to adapt and tune mechanisms based on not only theoretical but also empirical behavior (goal (G4)).
Question 3: How do PMW-Bypass heuristic, learning rate, and external update parameters impact consumed budget?
We experimented with all parameters and found that the two most impactful are (a)  0 , the initial threshold for the number of updates each bin involved in a query must have received to use the histogram, and (b) the learning rate.Fig. 9 shows their effects.Heuristic  0 (Fig. 9(a)): Higher  0 results in a more pessimistic assessment of histogram readiness.If it's too pessimistic ( 0 = 1), PMW is never used, so we follow a direct Laplace.If it's too optimistic ( 0 = 1), errors occur too often, and the histogram's training overpays. 0 = 100 is a good value for this workload.Learning rate lr (Fig. 9

(b)):
Higher  leads to more aggressive learning from each update.Both too aggressive ( = 0.125) and too timid ( = 0.00625) learning slow down convergence.Good values hover around  = 0.025.Overall, only a few parameters affect performance, and even for those, performance is relatively stable around good values, making them easy to tune (goal (G7)).
Question 4: How does Turbo's adaptive, per-bin heuristic compare to alternatives?We experimented with three alternative ISHISTOGRAMREADY designs that forgo either (1) the per-bin granular thresholds, or (2) the adaptivity property, or (3) both.We make two observations.First, the coarse-grained heuristics consume more privacy budget than the fine-grained heuristics, especially on more skewed workloads, such as  zipf = 1.5, which have less diversity so they tend to train histogram bins less uniformly.For example, a coarse-grained heuristic that uses a histogram-level count of the number of updates, with a threshold  0 to determine when the histogram is ready to receive any query, consumes at best 0.7 global privacy budget on a Covid workload with  zipf = 1.5; this is achieved when  0 is optimally configured to a value of 2070 updates.In contrast, a fine-grained heuristic, which uses a per-bin update count with a threshold  0 for each bin, consumes at best 0.44 global privacy budget, achieved when  0 is set to 160 updates.Second, the adaptive heuristics consume similar budget as the optimally-configured, nonadaptive ones, but the former are much easier to configure, as they offer stable performance around wide ranges of the  0 parameter.For example, when  0 varies in range [20,200], the non-adaptive per-bin heuristic's budget consumption varies in range [0.44, 0.81] for the  zipf = 1.5 workload, and in range [0.31, 0.76] for  zipf = 1 workload.In contrast, Turbo's adaptive, per-bin heuristic's budget consumption varies in much tighter ranges under the same circumstances: [0.44, 0.52] and [0.28, 0.48] for the  zipf = 1.5 and  zipf = 1 workload, respectively.Thus, Turbo's heuristic is the best of these options.

Use Case (2): Partitioned Static Database
System-wide evaluation.Question 5: In a partitioned static database, does Turbo significantly improve privacy budget consumption, compared to a single Exact-Cache and a treestructured set of Exact-Caches?We divide each database into 50 partitions and select a random contiguous window of 1 to 50 partitions for each query.We adjust the ( 0 ,  0 ) heuristic parameters to (50, 1) for Covid and (1, 1) for CitiBike.Fig. 10(a)-10(c) show the average budget consumed per partition up to 300K queries.Compared to the static case, Turbo can now support more queries under   = 10 thanks to parallel composition: each query only consumes privacy from the accessed partitions.Turbo further divides privacy budget consumption by 1.9 − 4.7× compared to the best-performing baseline for each workload, demonstrating its effectiveness as a caching strategy for the static partitioned use case.Tree structure evaluation.Question 6: When does the tree structure for histograms outperform a flat structure that maintains one histogram per partition?We vary the average size of the windows requested by queries from 1 to 50 partitions based on a Gaussian distribution with std-dev 5. We find the tree structure for histograms is beneficial when queries tend to request more partitions (25 partitions or more).Because the tree structure maintains more histograms than the flat structure, it fragments the query workload more, resulting in fewer histogram updates per histogram and more use of direct-Laplace.The tree's advantage in combining fewer results makes up for this privacy overhead caused by histogram maintenance when queries tend to request larger windows of partitions, while the linear structure is more justified when queries tend to request smaller windows of partitions.

Use Case (3): Partitioned Streaming Database
System-wide evaluation.Question 7: In streaming databases partitioned by time, does Turbo significantly improve privacy budget consumption compared to baselines?Does warm-start help?Fig. 11(a)-11(c) show Turbo's budget consumption compared to the baselines.The experiments simulate a streaming database, where partitions arrive over time and queries request the latest  partitions, with  chosen uniformly at random between 1 and the number of available partitions.Turbo outperforms both baselines significantly for all workloads, particularly when warm-start is enabled.Without warm-start, Turbo improves performance by 1.5 − 3.5× at the end of the workload.With warm-start, Turbo gives 1.9 − 5.4× improvement over the best baseline for each workload, showing its effectiveness for the streaming use case.When there is a large variety of unique queries the tree-structured Exact-Cache has a significantly better hit-rate than the Exact-Cache baseline and performs better (Fig. 11(a)).In Fig. 11(b) and 11(c) the query pool is considerably smaller.Both baselines have a good enough hit-rate while the tree-structured Exact-Cache needs to consume more privacy budget to compensate for the aggregation error which makes it perform worse.This concludes our evaluation across use cases (goal (G6)).

Runtime and Memory Evaluation
Question 8: What are Turbo's runtime and memory bottlenecks?We evaluate Turbo's runtime and memory consumption to identify areas of improvement.Fig. 11(d) shows the average runtime of Turbo's main execution paths in a nonpartitioned database.The Exact-Cache hit path is the cheapest and the other paths are more expensive.Histogram operations are the bottlenecks in CitiBike due to the larger domain size ( ), while query execution in TimescaleDB is the bottleneck in Covid due to the larger database size ().The 1 path is similar across the two datasets because their distinct bottlenecks compensate.Failing the SV check (output path 2) is the costliest path for both datasets due to the extra operations needed to update the heuristic's per-bin thresholds.We also conduct an experiment in the partitioned streaming case and find the same bottlenecks: TimescaleDB for Covid, histogram operations for CitiBike.Finally, we report Turbo's memory consumption in the streaming case with 50 partitions: 5.21MB for Covid and 1.43GB for CitiBike.For context, the raw datasets occupy on disk 600MB and 795MB, respectively.Thus, Turbo's memory overhead is significant and it is caused by the PMWs.The next section discusses this limitation and proposes potential directions to address it.

Discussion
We discuss several of Turbo's strengths and weaknesses.Turbo provides benefits when queries overlap in the data they access, i.e., new queries access histogram bins that have been accessed by past queries.The functions computed atop these bins can differ among queries (e.g., the new query can compute an average while all the past ones computed count fractions).If there is no data overlap in the queries, then Turbo does not give any benefit and comes with memory/computational costs.This is typical for caching systems: they only help if the workload has some level of locality.
A key strength in Turbo is its support for dynamic workloads, both new queries and new data arriving in the system.First, Turbo adapts seamlessly to changing queries.In the worst case, the new queries will access completely "untrained" regions within a histogram.Our heuristic will detect this and trigger a new cycle of external updates.In more moderate cases, the workload will touch a mix of "trained" and "untrained" regions.This will yield a mix of hits and misses in the  heuristic, and Turbo will use just the right amount of privacy budget to adapt to these slower workload changes.Second, thanks to histogram warm-start, Turbo adapts to new data partitions arriving into the system with minimal privacy budget consumption: as new partitions arrive, their histograms are initialized from past ones and then fine-tuned for the new data by a few external updates.This way, the new histograms will quickly start serving query answers for free, conserving privacy budget.Still, there is a limitation: while we support new data arriving in the system, we do not support updates on past data; such updates would result in our heuristics predicting less accurately when the histogram can answer a query, and thus in more expensive SV failures.
By far, Turbo's biggest limitation is the memory consumed to maintain the PMW histograms.Each histogram is a Redi-sAI vector whose size grows with data domain size  , i.e., exponentially in data domain dimension  ( and  are defined in Section 4.1).With  partitions and  queries, Turbo maintains a binary tree of such histograms, which means it stores ≈ 2  scalar values.By comparison, the Tree Exact-Cache baseline stores at most log( ) scalar values, a much lower memory consumption.This impacts not only the scale of the datasets that can be handled with Turbo, but also the runtime performance of Turbo-mediated queries.Indeed, as shown in the preceding section, histogram operations for CitiBike are the bottleneck in runtime due to the relatively high domain size.Some techniques have previously been proposed to address this rather fundamental challenge for PMW [30].However, for even larger-scale deployments, we believe that it will be worth considering PMW alternatives that may not offer as compelling convergence guarantees as PMW but which are much more lightweight.One example may be the relaxed adaptive projection (RAP) [9], which builds a lightweight representation of the dataset by learning a small subset of representative data points using gradient-descent.One would have to be willing to forfeit the theoretical convergence guarantees to use this mechanism, and to develop an adaptive version of RAP to support realistic systems settings involving dynamic workloads and data.Even so, some of the core concepts we have proposed in this paper may transfer to this new design, including passing RAP-based estimations through an SV to ensure result accuracy while incorporating a heuristic-based bypass to avoid expensive failures in the SV.
We also touch on several potential vulnerabilities.First, an adversary may craft queries that consume budget by generating cache misses.The convergence proofs in §A.5 provide a bound on how much such queries can affect budget consumption when a straightforward cutoff parameter is configured upfront.Second, response time can be a side-channel, which we leave out of scope but should be addressed in the future.Third, , the number of elements in the database (or in each partition), is considered public knowledge.This can leak information and should be addressed by consuming some of the budget to compute  privately, as done in [40].
Regarding integration of Turbo with a real system, Tumult, we find that it can be done with ease, thanks to Tumult Core's extensible measurement API.We anticipate that such integration will not be as easy or "light touch" in other DP systems we have seen, and in general we see a gap in the core primitives that DP systems (SQL or not) should implement to support extensions such as Turbo; these might include providing direct access to the privacy accountant, decoupling the accountant from the query executor, and others.We encourage the community to work to articulate this set of key primitives, which we suspect will be useful in other extensions beyond Turbo.

Related Work
This paper presents the first design, implementation, and evaluation for a general, effective, and accurate DP-caching system for interactive DP-SQL systems.In computer systems, caching is a heavily-explored topic, with numerous algorithms and implementations [11,16,64], some pervasively used in processors, operating systems, databases, and more.However, traditional forms of caching differ significantly from DP caching, justifying the need for a specialized approach for DP.The primary purposes of traditional caching are to conserve CPU and to improve throughput and latency; for these purposes, existing caches can be readily reused in DP systems.However, DP caching aims to conserve privacy budget, which requires a new design to be truly effective.For example, layering Redis on a DP database to cache query results would save CPU, but for privacy it would be equivalent to the "Exact-Cache" baseline that our evaluation shows is less effective than Turbo.This paper thus builds upon general traditional caching concepts -such as the two-layer design, the principle of generality in supporting multiple workloadsbut develops a cache specialized in conserving DP budget.
To our knowledge, no existing DP system incorporates such a specialized caching system.Most DP systems do not incorporate caching capabilities at all [7,12,37,50,53,55]; [62] explicitly leaves the design of an effective DP cache for future work.Some DP systems incorporate what amounts to an Exact-Cache by deterministically generating the same noise upon the arrival of the same query.Three systems consider more sophisticated mechanisms for DP result reuse: PrivateSQL [38], Chorus [36], and CacheDP [48].But the result reuse components in these systems suffer from such significant limitations that they cannot be considered general and effective caching designs.PrivateSQL [38] takes a batch of "representative" offline queries and precomputes a private synopsis that answers them all.If new queries arrive (online), PrivateSQL uses the synopsis to answer them in a best-effort way, without accuracy guarantees.It does not learn on-the-fly from them, so it is unsuited for online workloads and does not support data streams.Chorus [36] provides a trivialized implementation of MWEM, a variant of PMW, however the implementation only works for databases with a single attribute.The paper does not evaluate the MWEMbased implementation, nor integrates it as a caching layer.CacheDP [48] is an interactive DP query engine and has a built-in DP cache that answers queries using the Matrix Mechanism [43].Our experience with the CacheDP code suggests that it is not a general, effective, or accurate caching layer for DP databases.First, CacheDP's implementation only scales to a few attributes and does not support parallel composition on data partitions; this suggests that it is not general enough to support a variety of workloads.Second, the "Tree Exact-Cache" baseline with which we compare in evaluation matches, to our understanding, the CacheDP design while scaling to the higher-dimension datasets and streaming workloads we evaluate against.Our evaluation shows Turbo more effective than Tree Exact-Cache.
While DP caching are under-explored in systems, the topic of optimizing global privacy budget for a query workload is heavily explored in theory.Approaches include generating synthetic datasets or histograms that can answer certain classes of queries, such as linear queries, with accuracy guarantees and no further privacy consumption [9,13,30,31,44,60]; and optimizing privacy consumption over a batch of queries by adapting the noise distribution to properties of the queries [42,43,49].Apart from PMW [31], all these methods operate in the offline setting, where queries are known upfront.This setting is unrealistic, as discussed in §3.2.
All of the theory works cited above, including PMW, suffer from another limitation: they operate on static datasets and do not support new data arriving into the system.PMWG [20] is an extension of PMW for dynamic "growing" databases, but operates in a setting where all queries request the entire database.This precludes the use of parallel composition for queries that access less than the entire database, such as queries over windows of time.Other algorithms focus on continuously releasing specific statistics over a stream, such as the streaming counter [35] that inspired our tree structure, and extensions to top-k and histogram queries [14].These works do not support arbitrary linear queries, and they answer all predefined queries at every time step while we only pay budget for queries that are actually posed by analysts.

Conclusion
Turbo is a caching layer for differentially-private databases that increases the number of linear queries that can be answered accurately with a fixed privacy guarantee.It employs a PMW, which learns a histogram representation of the dataset from prior query results and can answer future linear queries at no additional privacy cost once it has converged.To enhance the practical effectiveness of PMWs, we bypass them during the privacy-expensive training phase and only switch to them once they are ready.This transforms PMWs from ineffective to very effective compared to simpler cache designs.Moreover, Turbo includes a tree-structured set of histograms that supports timeseries and streaming use cases, taking advantage of fine-grained privacy budget accounting and warm-starting opportunities to further increase the number of answered queries.
Note: This appendix has not been peer-reviewed.

A Theorems and proofs A.1 Notation
In this section we introduce the following notation, in addition to the notation from Section 4: • For two distributions , ℎ over X, we note  ( ||ℎ) their relative entropy: • For a linear query , () ∈ [0, 1] is the result of  on a datapoint  ∈ X.For a histogram or distribution ℎ we note  • ℎ the dot product: This is also the result of the query  on a normalized database histogram ℎ, so with a slight abuse of notation we alternatively write (ℎ) for  • ℎ. • Given a true distribution , an estimate ℎ and a query , we generally note  * :=  •  the true value of the query, (ℎ) the estimate returned by the histogram, and q a random variable denoting the answer to  returned by a randomized algorithm such as PMW.• We use  as a shorthand for lr and   for variable learning rates, where  is the index of an update.• For streaming databases, we use the standard definition of DP on streams [15,20]: two streams of rows X N are adjacent if they differ exactly at one time (index)  ∈ N.
Note that since time is a public attribute we can group indices by timestamp.• In Alg. 1, we note  the noise added to the threshold at Line 12,  the noise added to the error at Line 22,  the noise potentially added to the true result in the PMW branch at Line 22 and  ′ the noise added to the true result in the Bypass branch at Line 31.We have  ∼  ∼  ∼  ′ ∼  () with  = 1/.A.2 PMW-Bypass Theorem A.1.PMW-Bypass preserves   -DP, for a global privacy budget set upfront in the PRIVACYACCOUNTANT.
Proof.First, we clarify that each call to PRIVACYACCOUNT-ANT.PAY in Algorithm 1 is a call to a privacy filter [54] operating with pure differential privacy and basic composition.More precisely, we use a type of privacy filter that is suitable for interactive mechanisms such as the SV protocol.We provide a definition in §B and a privacy proof in Thm.B.2.The filter takes a sequence of adaptively chosen DP mechanisms.At each turn  ∈ N, it receives a new mechanism with budget   and uses a stopping rule to decide whether it should run the mechanism.In our case, the filter stops answering when  =0   >   , where   is the predetermined global privacy budget.
Second, we show that each of the three calls to PAY spends the right budget before running each DP mechanism.
1. We use the sparse vector mechanism from [46] on queries with sensitivity Δ = 1/, with cut-off parameter  = 1 and with  1 = ,  2 = 2,  3 = 0.In this case, each SV mechanism is 3-DP.2. After a hard query, we initialize a new SV mechanism (for 3) followed by a Laplace mechanism [27] scaled to the right sensitivity (costing ).The composition of these two steps is 4-DP, thanks to the basic composition theorem for pure DP [27].i.e.  = 1 if the sparse vector thinks the query is accurate, otherwise  = 0. Then the error of this PMW answer q is q −  * = ((ℎ) −  * ) +  (1 − ), and: Fix  > 0. Note  the pdf of Lap().The pdf of the convolution of  and  ′ is: Hence: In both cases we have: , PMW-Bypass is (, )-accurate for each query it answers.
We can also achieve (, )-accuracy with the following slightly smaller , computable with a binary search: Proof.Consider an incoming query .We show that the output q of each branch of Alg. 1 respects the accuracy guarantees.First, we simplify the expression for  from Lemma A.2. Since 1 +  ≤ exp() for  > 0, for any  > 0 we have exp(−) + We first upper bound ln(  ) depending on the sign   of the update.
• First, suppose   = −1.For  > 0, we have exp(−) ≤ 1 −  +  2 /2.We also have () 2 ≤ 1 so: After multipliying by exp (−) we can reuse the same reasoning as above:  Finally, we take a union bound over all the random variables in the system, i.e. at most  queries, not all of them giving updates.After  updates, with probability 1 −  we have: A.3 Tree-structured PMW-Bypass Algorithm 2 formalizes our tree-structured PMW-Bypass caching object and its use.At a high level, we build a binary tree over our data stream, in which each leaf is a partition, and each node maintains a histogram responsible for all the data it spans (which is a superset of nodes lower in the hierarchy).A query is split into sub-queries to minimize the number of nodes, and hence histograms, to query (l.4).We first find the largest consecutive subset of nodes with histograms ready for guesses (that we will not bypass), ll.5-9.This subset will be handled with a PMW-Bypass query, at a DP cost that depends on its size (l.10, see §4.4 for details), and "using" half the failure probability for accuracy (/2 failure probability).This is done ll.11-25.The remaining queries are computed as a bypass query for each node independently (including an histogram update if warranted), combining to  accuracy with the remaining /2 failure probability, ll.26-33.Finally all results are aggregated to answer the original query with  accuracy with probability at least 1 − .Notation.We define functions and data structures used in Alg.2: • For  timestamps, we note I the tree structure introduced in Section 4.4: Proof.The binary search for   gives that with probability 1 −   ( ) over the Monte Carlo randomness we have: Since  ∼ Lap() =⇒  ∼ Lap() for ,  > 0 we have, with the triangle inequality: A union bound gives that with probability 1 − (  ( ) + /2 −   ( )) over the simulation and the Laplace randomness we have: However, we can't reuse Equation 2 here to show that the potential  (  ∥ ℎ   ) of every single node decreases.Indeed, a single SV is used to update all the nodes in   in the same direction, so We • For  ∈ N we note I  the binary tree of depth  + 1 whose leftmost leaf is  (thus covering [ , ( + 2) − 1]):  A.5 Bounding the total privacy budget Thm.A.4's convergence bound is expressed as a maximum number of updates, i.e. number of queries that alter the state of the histogram.However, this does not directly bound the total privacy budget used to answer a workload, because some queries can cost budget without updating the histogram.Indeed, apart from the first SV initialization, there are two ways a query  can cost budget: either  goes through the histogram branch in Alg. 1 and triggers an SV reset, or  goes through the Bypass branch and pays for a Laplace query.While an SV reset always yields an update, some Laplace queries cost budget without triggering an external update (Alg. 1, l.33).Thus, it is theoretically possible to craft a workload that consumes an unbounded amount of budget (up to the privacy filter enforced maximum), by issuing queries that go through the Bypass branch and cost budget without yielding external updates.We do not observe this phenomenon in our evaluation, but it is straightforward to prevent this problem by simply deactivating the Bypass branch after a predetermined number of queries.More precisely, in Alg. 1, l.16, we can modify HEURISTIC.ISHISTOGRAMREADY to always return TRUE after  queries, for some parameter .Therefore, at most  queries can cost budget without yielding an update, and the result from Thm. A.4 directly bounds the number of times we spend budget, up to a constant factor.The same reasoning applies to tree-structured PMW-Bypass.A.6 Gaussian mechanism, Rényi DP and Approximate DP The Gaussian mechanism [26] has desirable properties to answer workloads of DP queries.We show how to modify Alg. 1 to use it as an alternative for the Laplace mechanism.Since the Gaussian mechanism does not sastify (, 0)-DP, we first need to introduce a more general privacy accountant.RDP accounting.Alg. 1 can be modified to use Rényi DP (RDP) [51], a form of accounting that offers better composition than basic composition under pure (, 0)-DP guarantees (even for workloads of pure DP mechanisms such as the Laplace or the SV mechanisms).RDP accounting works particularly well for the Gaussian mechanism.The modifications are as follows: • PRIVACYACCOUNTANT is an RDP filter (Thm.B.2) instead of a pure DP filter.• For a Laplace mechanism Lap(1/) on a query with ℓ 1 sensitivity 1/ (like our linear queries), instead of paying , the budget for each RDP order  > 1 is: This comes directly from the RDP curve of the Laplace mechanism [51].
• For an SV initialization where each internal Laplace uses Lap(1/), the budget for each RDP order  > 1 is: This comes from the RDP analysis of the SV mechanism [65].More precisely, we use Algorithm 2 of [65] with  = 1, Δ = 1/,  1 =  and  2 = 2.  = Lap(Δ/ on a query with ℓ 2 sensitivity 1/, the budget for each RDP order  > 1 is  2 2 .The default version of Alg. 1 does not use any Gaussian mechanism, but we will add some below.Finally, we can use the RDP-to-DP conversion formula [51] to obtain (, )-DP guarantees for PMW-Bypass for ,  > 0. Gaussian mechanism.Alg. 1 can now be modified to use the Gaussian mechanism to answer queries directly.We keep the internal Laplace random variables used by the SV protocol, although there exists SV protocols with purely Gaussian noise [65].The modifications are as follows: • Line 9 becomes  ← CALIBRATEBUDGET(, );  ← CALIBRATEBUDGETGAUSSIAN(, , , ) for a function defined next.• CALIBRATEBUDGETGAUSSIAN(, , , ) returns: 18 ln 2 + 3 • In Lines 22 and 31, we replace Lap(1/) by N (0,  2 / 2 ) At a high level, we need to calibrate the noise of DP queries answered with the Gaussian mechanism so that they are compatible with the failure probabilities of the Laplace queries.That is, Gaussian mechanism queries need to have the same (or lower) error bound  with the same (or lower) failure probability .The following result shows the Gaussian noise variance  2 / 2 to use for this calibration: Proof.First, we recall that for any  > 0 the upper deviation inequality gives: filter for concurrent composition of interactive mechanisms.That is, for all adversaries A,  ∈ N,  ∈ {0, 1},  > 0,   > 0, Alg. 3 defines an (,   )-RDP mechanism, i.e. if we note   := PrivacyFilterConComp(A, , , ,   ) we have   (  ∥ 1− ) ≤   .
We recall that in practice  can be taken arbitrarily large, as usual in privacy filter proofs [54].If the adversary wants to run less than  mechanisms, they can pass mechanisms that return an empty answer and use  () = 0.
Proof.Take ,   > 0 and  ∈ N. Let's note Ψ the function that parses a view  of the adversary and returns the  mechanisms and their requested budget, ordered by start time: Ψ() = (M 1 ,  1 , . . ., M  ,   ).Note that  2 can depend on the result of the first interactions with M 1 and so on, but once a view is fixed we can extract the underlying privacy parameters.We have: * := exp(( − 1)  ( 0 ∥ 1 )) Pr[ 0 = ] Pr[ 1 = ] where the first equality comes from the definition of the Rényi divergence and the second from the law of iterated expectations.
1.The proof is the same as [39].
2. The Rényi difference can be extended by continuity to  = +∞, which corresponds to pure differential privacy [51].In that case, the additive composition rule for RDP becomes the basic composition theorem for pure DP.

C Laplace Histogram baseline
We can consider another simple baseline, Laplace Histogram, that works as follows.First, we compute a noisy estimate for every single bin in X.Then, we can answer any linear query by taking the combination of the noisy estimates.Consider a static, non-partitioned, dataset with  datapoints for domain X.Suppose that we want to answer linear queries with absolute error  with probability 1 − .We are using Pure DP for simplicity.
• Direct Laplace.If we answer each query separately like in the Laplace baseline of Fig. 3 That means that after 146 queries it is more advantageous to use the Laplace Histogram rather than Direct Laplace.However, for a larger domain such as CitiBike (|X| = 604, 800), the same calculation shows that we need more than 10,069 queries for Laplace Histogram to outperform Direct Laplace.For this number of queries, Turbo is already close to convergence using much less budget than Exact-Cache (Fig. 8(c)), itself an improvement over Direct Laplace.Partioning the dataset (e.g.across 50 timestamps) has the same effect as increasing |X|.
Finally, these sketches used basic composition, which is suboptimal for Direct Laplace: using advanced composition would make Direct Laplace more competitive, as the privacy budget grows only in the square root of the number of queries (instead of linearly).

Fig. 9 .
Fig. 9. Impact of parameters (Question 3).Uses Covid  zipf = 1.Being too optimistic or pessimistic about the histogram's state (a), or too aggressive or timid in learning from each update (b), give poor performance.

Fig. 10 .
Fig.10.Partitioned static database: system-wide evaluation (Question 5).Turbo is instantiated with tree-structured PMW-Bypass and Exact-Cache.Turbo significantly improves budget consumption compared to both a single Exact-Cache and a tree-structured set of Exact-Caches.

Fig. 11 .
Fig. 11.(a-c) Partitioned streaming database: system-wide consumed budget (Question 7); (d) PMW-Bypass runtime in non-partitioned setting (Question 8).(a-c) Turbo is instantiated with tree-structured PMW-Bypass and Exact-Cache, with and without warm-start.(d) Uses Covid  zipf = 1 and one Exact-Cache and PMW-Bypass.Shows execution runtime for different execution paths.Most expensive is when the SV test fails.
1/) has for week 4 for instance, we need noise scaled to 1/ 4 .Thus, we consume a fairly large  for an accurate query to compensate for the smaller  4 .Another option would be to use one histogram per range (i.e.set of contiguous partitions), but that involves maintaining a large state that grows quadratically in the number of partitions.
7 (Convergence of Tree-structured PMW-Bypass on a bounded number of partitions).Consider a partitioned dataset with  = 2  timestamps, with  ∈ N * , containing  datapoints.We note  min the size of the smallest partition.At each update , each node  = (, ) ∈ I of the Tree-structured PMW-Bypass contains a histogram ℎ   approximating the true distribution   of the data on timestamp range [, ].For  ∈ I we note   :=   First, we have  ∈ I   =    (+1) = 1 because each of the  + 1 layers of the binary tree contains nodes covering the  datapoints of the range [0, − 1].Consider an update .Any node  belongs to one of three sets: 1.If  hasn't been updated, ℎ   +1 = ℎ   and  (  ∥ ℎ   +1 ) =  (  ∥ ℎ   ). 2. If  has been updated by a Bypass branch with sign    , Equation 3 shows that with probability 1 −  we have:    is a lower bound on   because Pr[  ∈ I  | Lap(    ln(2/ ) )| >   ] = /2.3. Otherwise,  has been updated by the SV branch.We detail this case below.
the sign of the global update.For each  ∈   , Equation1gives: −  • ℎ   ) can have either sign.In other words, if one node  ∈   is close to the answer but others are far from it,  might witness an increase in potential because the SV check considers only the aggregated state of the nodes.Instead, we show that the combined potential of   decreases.We note  :=  ∈    , ∀ ∈   ,  ′  :=   /,   :=  ∈   ′    and ℎ :=  ∈   ′  ℎ   .Thanks to Equation 6 we have: * and  is a linear query.Since  •   is the true result of the query on   ,  • ℎ   is the combined histogram estimate and  < 1/2, we can apply Equation 2. With probability 1 −  we have:  ∈   ′   (  ∥ ∈   ′   (  ∥ ℎ   ) −  ( − ln(1/)/     ) +  2 /2 i.e.: have 1/4 ≤ 2 and |  | ≤ 2, because 2 is the maximum number of nodes required to cover a contiguous range of [0, 2  − 1].Since we have  − 2 where the last inequality comes from   +   ≥  min .Now, let's use this per-update potential drop to obtain a global bound on the number of updates.By convexity and positivity of the relative entropy [19], we have:     (  ∥ ℎ  0 ) ≤ ln |X|. on the number of contiguous partitions a query can request (but not necessarily any bound on the total number of partitions in the database).Take  min the smallest fraction of datapoints in one partition from any contiguous set of 2 partitions ( min = 1/2 if all the partitions hold the same number of points).We change SPLITQUERY and the histogram structure in Alg. 2 as follows: □Theorem A.8 (Convergence of Tree-structured PMW-Bypass on an unbounded number of partitions).Suppose that we have only a bound  = 2 Note that these trees overlap: for (, ) ∈ [ ,  + − 1] 2 with  ≥ 1 we have (, , ) ∈ I but also( − 1, , ) ∈ I  −1.•Consider a query requesting a range (, ) ∈ [ ,  +  − 1] × [ ,  + 2 − 1] with  −  + 1 ≤  .As noted above, (, ) might be covered by 2 trees (some windows are covered by a single tree, namely if  < ( + 1) ≤ ), so by convention we pick the rightmost tree.SPLITQUERY returns  ⊆ I  the smallest set of nodes covering [, ].Then, for all  ∈ N, for  an upper bound on the number of queries allocated to I  , with probability 1 − 2 ( + 1) the number of updates we perform on I  is at most: Proof.Consider  ∈ N. Consider the set of queries allocated to I  .They all request data from a partitioned dataset with 2 timestamps.Moreover, any update to a histogram in I  must come from that set of queries.Hence we can apply Thm.A.7 on a tree of size 2 .□A.4 Warm-start Theorem A.9 (Warm-start).Consider Alg. 1 using for histogram initialization a distribution ℎ 0 instead of uniform.Suppose there exists  ≥ 1 such that ∀ ∈ X, ℎ 0 () ≥ 1  | X | . Ten if / <  ≤ 1/2 the number of updates we perform is at most: ( + 2) ln |X|  min ( − 2( + 1) ln(1/ ) ln(2/ ) − /2) 1) is   () =   1 ()-RDP for queries with sensitivity Δ, with  ()the Laplace RDP curve with 1 = 1/ 1 .= Lap(2Δ/ 2 ) is   () =   2 ()-RDP for queries with sensitivity 2Δ, with  2 = 1/ 2 .Since   (∞) =  2 < ∞ we can use Point 3 of Theorem 8, so SV is   () +  2 -RDP.•For a Gaussian mechanism adding noise from N (0,  2 / 2 ) when the mechanisms are non-interactive.□ Theorem B.2 (Pure DP and filter over the RDP curve).1.If we modify Alg. 3 to take   ∈ R * + A for a set of RDP orders  ⊆ R * + instead of (,   ) ∈ R * + 2 , and if we replace the condition in Line 13 by ∀ ∈ ,  =1   ()+ ′  () >