GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made EasyPrashanth MohanUC Berkeleyprmohan@cs.berkeley.eduAbhradeep Thakurta ∗Pennsylvania State Universityazg161@cse.psu.eduDawn SongUC Berkeleydawnsong@cs.berkeley.eduDavid E. CullerUC Berkeleyculler@cs.berkeley.eduElaine ShiUC Berkeleyelaines@cs.berkeley.eduABSTRACTIt is often highly valuable for organizations to have theirdata analyzed by external agents. However, any programthat computes on potentially sensitive data risks leaking informationthrough its output. Differential privacy providesa theoretical framework for processing data while protectingthe privacy of individual records in a dataset. Unfortunately,it has seen limited adoption because of the loss in outputaccuracy, the difficulty in making programs differentiallyprivate, lack of mechanisms to describe the privacy budgetin a programmer’s utilitarian terms, and the challengingrequirement that data owners and data analysts manuallydistribute the limited privacy budget between queries.This paper presents the design and evaluation of a newsystem, GUPT, that overcomes these challenges. Unlikeexisting differentially private systems such as PINQ andAiravat, it guarantees differential privacy to programs notdeveloped with privacy in mind, makes no trust assumptionsabout the analysis program, and is secure to all knownclasses of side-channel attacks.GUPT uses a new model of data sensitivity that degradesprivacy of data over time. This enables efficient allocationof different levels of privacy for different user applicationswhile guaranteeing an overall constant level of privacy andmaximizing the utility of each application. GUPT also introducestechniques that improve the accuracy of outputwhile achieving the same level of privacy. These approachesenable GUPT to easily execute a wide variety of data analysisprograms while providing both utility and privacy.Categories and Subject DescriptorsD.4.6 [Operating Systems]: Security and Protection; H.3.5[Information storage and retrieval]: Online InformationServicesKeywordsAlgorithms, Differential Privacy, Data Mining, Security∗ Part of this work was done while visiting UC Berkeley.Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGMOD ’12, May 20–24, 2012, Scottsdale, Arizona, USA.Copyright 2012 ACM 978-1-4503-1247-9/12/05 ...$10.00.1. INTRODUCTIONOrganizations frequently allow third parties to performbusiness analytics and provide services using aggregated data.For instance, social networks have enabled many applicationsto be built on social data providing services such asgaming, car-pooling and online deals. Retailers often sharedata about their customers’ purchasing behavior with productmerchants for more effective and targeted advertisingbut do not want to reveal individual customers. While sharinginformation can be highly valuable, companies providelimited access to this data because of the risk that an individual’sprivacy would be violated. Laws such as the HealthInsurance Portability and Accountability Act (HIPAA) haveconsidered the dangers of privacy loss and stipulate thatpatient’s personally identifiable information should not beshared. Unfortunately, even in datasets with “anonymized”users, there have been cases where user privacy was breached.Examples include the deanonymization of AOL search logs [3],the identification of patients in a Massachusetts hospital bycombining the public voters list with the hospital’s anonymizeddischarge list [23] and the identification of the users in ananonymized Netflix prize data using an IMDB movie ratingdataset [18].Often the value in sharing data can be obtained by allowinganalysts to run aggregate queries spanning a largenumber of entities in the dataset while disallowing analystsfrom being able to record data that pertains to individual entities.For instance, a merchant performing market researchto identify their next product would want to analyze customerbehavior in retailers’ databases. The retailers mightbe willing to monetize the dataset and share aggregate analyticswith the merchant, but would be unwilling to allowthe merchant to extract information specific to individualcustomers in the database. Researchers have invented techniquesthat ranged from ad-hoc obfuscation of data entries(such as the removal of Personally Identifiable Information)to more sophisticated anonymization mechanisms satisfyingprivacy definitions like k-anonymity [25] and l-diversity [15].However, Ganta et al. [8] and Kifer [13] showed that practicalattacks can be mounted against all these techniques. Arecent definition called differential privacy [5] formalizes thenotion of privacy of an individual in a dataset. Unlike earliertechniques, differentially private mechanisms use statisticalbounds to limit the privacy loss of an individual incurredby running statistical queries on datasets. It is designed toperturb the result of a computation in a manner that has littleeffect on aggregates, yet obscures the data of individualconstituents.

While differential privacy has strong theoretical properties,the shortcomings of existing differentially private dataanalysis systems have limited its adoption. For instance, existingprograms cannot be leveraged for private data analysiswithout modification. The magnitude of the perturbationintroduced in the final output is another cause ofconcern for data analysts. Differential privacy systems operateusing an abstract notion of privacy, called the ‘privacybudget’. Intuitively a lower privacy budget implies betterprivacy. However, this unit of privacy does not easily translateinto the utility of the program and is thus difficult fordata analysts who are not experts in privacy to interpret.Further, analysts would also be required to efficiently distributethis limited privacy budget between multiple queriesoperating on a dataset. An inefficient distribution of theprivacy budget would result in inaccurate data analysis andreduce the number of queries that can be safely performedon the dataset.We introduce GUPT 1 , a platform that allows organizationsto allow external aggregate analysis on their datasetswhile ensuring that data analysis is performed in a differentiallyprivate manner. It allows the execution of existingprograms with no modifications, eliminating the expensiveand demanding task of rewriting programs to be differentiallyprivate. GUPT enables data analysts to specify adesired output accuracy rather than work with an abstractprivacy budget. Finally, GUPT automatically parallelizesthe task across a cluster ensuring scalability for concurrentanalytics. We show through experiments on real datasetsthat GUPT overcomes many shortcomings of existing differentialprivacy systems without sacrificing accuracy.1.1 ContributionsWe design and develop GUPT, a platform for privacypreservingdata analytics. We introduce a new model fordata sensitivity which applies to a large class of datasetswhere the privacy requirement of data decreases over time.As we will explain in Section 3.3, using this model is appropriateand allows us to overcome significant challengesthat are fundamental to differential privacy. This approachenables us to analyze less sensitive data to get reasonableapproximations of privacy parameters that can be used fordata queries running on the newer data.GUPT makes the following technical contributions thatmake differential privacy usable in practice:1. Describing privacy budget in terms of accuracy:Data analysts are accustomed to the idea of workingwith inaccurate output (as is the case with data samplingin large datasets and many machine learning algorithmshave probabilistic output). GUPT uses theaging model of data sensitivity, to allow analysts todescribe the abstract ‘privacy budget’ in terms of expectedaccuracy of the final output.2. Privacy budget distribution: GUPT automaticallyallocates a privacy budget to each query in orderto match the data analysts’ accuracy requirements.Further, the analyst also does not have to distributethe privacy budget between the individual data operationsin the program.3. Accuracy of output: GUPT extends a theoreticaldifferential privacy framework called “sample and aggregate”(described in Section 2.1) for practical ap-1 GUPT is a Sanskrit word meaning ‘Secret’.plicability. This includes using a novel data resamplingtechnique that reduces the error introduced bythe framework’s data partitioning scheme. Further,the aging model of data sensitivity allows GUPT toselect an optimal partition size that reduces the perturbationadded for differential privacy.4. Prevent side channel attacks: GUPT defends againstside channel attacks such as the privacy budget attacks,state attacks and timing attacks described in [10].2. BACKGROUNDDifferential privacy places privacy research on a firm theoreticalfoundation. It guarantees that the presence or absenceof a particular record in a dataset will not significantlychange the output of any computation on a statisticaldataset. An adversary thus learns approximately the sameinformation about any individual record, irrespective of itspresence or absence in the original dataset.Definition 1 (ɛ-differential privacy [5]). Arandomized algorithm A is ɛ-differentially private if for alldatasets T, T ′ ∈ D n differing in at most one data record andfor any set of possible outputs O ⊆ Range(A), Pr[A(T ) ∈O] ≤ e ɛ Pr[A(T ′ ) ∈ O] . Here D is the domain from whichthe data records are drawn.The privacy parameter ɛ, also called the privacy budget [16],is fundamental to differential privacy. Intuitively, a lowervalue of ɛ implies stronger privacy guarantee and a highervalue implies a weaker privacy guarantee while possibly achievinghigher accuracy.2.1 Sample and AggregateAlgorithm 1 Sample and Aggregate Algorithm [24]Input: Dataset T ∈ R n , length of the dataset n, privacyparameters ɛ, output range (min, max).1: Let l = n 0.42: Randomly partition T into l disjoint blocks T 1, · · · , T l .3: for i ∈ {1, · · · , l} do4: O i ← Output of user application on dataset T i.5: If O i > max, then O i ← max.6: If O i < min, then O i ← min.7: end for∑8: A ← 1 l | max − min |l i=1Oi + Lap( )l·ɛGUPT leverages and extends the “sample and aggregate”framework [24, 19] (SAF) to design a practical and usablesystem which will guarantee differential privacy for arbitraryapplications. Given a statistical estimator P(T ) , where Tis the input dataset , SAF constructs a differentially privatestatistical estimator ˆP(T ) using P as a black box. Moreover,theoretical analysis guarantees that the output of ˆP(T ) convergesto that of P(T ) as the size of the dataset T increases.As the name “sample and aggregate” suggests, the algorithmfirst partitions the dataset into smaller subsets; i.e.,l = n 0.4 blocks (call them T 1, · · · , T l ) (see Figure 1). Theanalytics program P is applied on each of these datasets T iand the outputs O i are recorded. The O i’s are now clampedto within an output range that is either provided by the analystor inferred using a range estimator function. (Refer to

DATASETBLOCKSPROGRAMTT 1 T 2 T 3… T lfAverage…f f ff(T 1 ) f(T 2 ) f(T 3 ) f(T l )+ Laplacian noisePrivate outputFigure 1: An instantiation of the Sample and AggregateFramework [24].Section 4.1 for more details.) Finally, a differentially privateaverage of the O i’s is calculated by adding Laplace noise(scaled according to the output range). This noisy final outputis now differentially private. The complete algorithm isprovided in Algorithm 1. Note that the choice of number ofblocks l = n 0.4 is from [24], used here for completeness. Forimproved choices of l, see Section 4.3.GUPT extends the conventional SAF described above inthe following ways: i) Resampling: GUPT introduces theuse of data resampling to improve the experimental accuracyof SAF, without degrading the privacy guarantee; ii) Optimalblock allocation: GUPT further improves experimentalaccuracy by finding the better block sizes (as comparedto the default choice of n 0.6 ) using the aging of sensitivitymodel explained in Section 3.3.2.2 Related WorkA number of advances in differential privacy have soughtto improve the accuracy of very specific types of data queriessuch as linear counting queries [14], graph queries [12] andhistogram analysis [11]. A recent system called PASTE [21]allows queries on time series data where the data is storedon distributed nodes and no trust is laid on the central aggregator.In contract to PASTE, GUPT trusts the aggregatorwith storing all of the data but provides a flexiblesystem that supports many different types of data analysisprograms.While systems tailored for specific tasks could potentiallyachieve better output accuracy, GUPT trades this for thegenerality of the platform. We show through experimentalresults that GUPT achieves reasonable accuracy for problemslike clustering and regression, and can even performbetter than the existing customized systems.Other differential privacy systems such as PINQ [16] andAiravat [22] have also attempted to operate on a wide varietyof data queries. PINQ (Privacy INtegrated Queries)proposed programming constructs which enable applicationdevelopers to write differentially private programs using basicfunctional building blocks of differential privacy (e.g.,exponential mechanism [17], noisy counts [5] etc.). It doesnot consider the application developer to be an adversary.It further requires the developers to rewrite the applicationto make use of the PINQ primitives. On the other hand,Airavat was the first system that attempted to run unmodifiedprograms in a differentially private manner. It howeverrequired the programs to be written for the Map-Reduceprogramming paradigm [4]. Further, Airavat only considersthe map program to be an “untrusted” computation whilethe reduce program is “trusted” to be implemented in a differentiallyprivate manner. In comparison, GUPT allows forthe private analysis of a wider range of unmodified programs.GUPT also introduces techniques that allow data analyststo specify their privacy budget in units of output accuracy.Section 7.3 presents a detailed comparison of GUPT withPINQ, Airavat and the sample and aggregate framework.Similar to iReduct [28], GUPT introduces techniques thatreduce the relative error (in contrast to absolute error). Bothsystems use a smaller privacy budget for programs that producelarger outputs, as the relative error would be smallas compared programs that generate smaller values for thesame absolute error. While iReduct optimizes the distributionof privacy budget across multiple queries, GUPTmatches the relative error to the privacy budget of individualqueries.3. PROBLEM SETUPThere are three logical parties:1. The analyst/programmer, who wishes to perform aggregatedata analytics over sensitive datasets. Ourgoal is to make GUPT easy to use for an average programmerwho is not a privacy expert.2. The data owner, who owns one or more datasets, andwould like to allow analysts to perform data analyticsover the datasets without compromising the privacy ofusers in the dataset.3. The service provider, who hosts the GUPT service.The separation between these parties is logical; in reality,either the data owner or a third-party cloud service providercould host GUPT.Trust assumptions: We assume that the data owner andthe service provider are trusted, and that the analyst is untrusted.In particular, the programs supplied by the analystmay act maliciously and try to leak information. GUPTdefends against such attacks using the security mechanismsproposed in Section 6.3.1 GUPT Overview1. Data Set2. Privacy↵Budget (ε)Data Owner1. Computation2. Accuracy3. Output RangeWeb Frontend Data Set Manager Comp Mgr XML RPC Layer Untrusted Computa4on Data AnalystDifferentiallyPrivate AnswerIsolated Execu.on Chambers Computa4on Manager Isolated Execu.on Chambers Isolated Execu.on Chambers Figure 2: Overview of GUPT’s ArchitectureThe building blocks of GUPT is shown in Figure 2:• The dataset manager is a database that registers in-

stances of the available datasets and maintains theavailable privacy budget.• The computation manager instantiates computationsand seamlessly pipes data from the dataset to the appropriateinstances.• The isolated execution chambers isolate and preventany malicious behavior by the computation instances.These building blocks allow the principals of the systemto easily interact with many parameters being automaticallyoptimized or are left as optional arguments for experts.Interface with the data owner: (a) A multi-dimensionaldataset (such as a database table) that for the purposeof our discussion, we assume is a collection of real valuedvectors , (b) a total privacy budget that can be allocatedfor computations on the dataset and (c) [Optional] inputattribute ranges, i.e. the lower and upper bound for eachdimension of the data. A detailed discussion on these boundsis presented in Section 4.1.For privacy reasons, the input ranges provided should notcontain any sensitive information. For example, it is reasonableto expect that a majority of the household’s annualincome should fall within the range [0; 500,000] dollars, thusit is not considered sensitive. On the other hand if the richesthousehold’s income is used as the upper bound privateinformation could be leaked. In this case, a public informationsource such as the national GDP could be used.Interface with the analyst: (a) Data analytics program,(b) a reference to the data set in the datasetmanager, (c) either a privacy budget or the desired accuracyfor the final answer and finally (d) an output rangeor a helper function for estimating the output range.A key requirement for the analytics program is that itshould be able to run on any subset of the original dataset.As GUPT executes the application in a black-box fashion,the program may also be provided as a binary executable.Privacy budget distribution: In order to guarantee ɛ-differential privacy for a dataset T , the privacy budget shouldbe distributed among the various applications (call themA 1, · · · , A k ) that operate on T . A composition lemma [5]states that if A 1, · · · , A k guarantee ɛ 1, · · · , ɛ k -differentialprivacy respectively, then T is ɛ-differential private, whereɛ = ∑ ki=1ɛi. Thus judicious allocation of the privacy budgetis extremely important. Unlike existing differential privacysolutions, GUPT relieves the analyst and the data ownerfrom that task of distributing this limited privacy budgetbetween multiple data analytics programs.3.2 Privacy and Utility GuaranteesGUPT guarantees ɛ-differential privacy to the final output.It provides similar utility guarantees as the originalsample and aggregate algorithm from [24]. This guaranteeapplies to the queries satisfying “approximate 2 normality”condition defined by Smith [24], who also observed that awide class of queries satisfy this normality condition. Someof the examples being various maximum-likelihood estimatorsand estimators for regression problems. The formalutility guarantees are deferred to Appendix A.GUPT provides the same level of privacy for queries thatare not approximately normal. Reasonable utility could be2 By approximately normal statistic we refer to the genericasymptotically normal statistic in Definition 2 of [24].expected even from queries that are not approximately normal,even though no theoretical guarantees are provided.3.3 Aging of SensitivityIn real life datasets, the potential privacy threat for eachrecord is different. A privacy mechanism that considers thiscan obtain good utility while satisfying the privacy constraints.We introduce a new model called aging of sensitivity ofdata where older data records are considered to have lowerprivacy requirements. GUPT uses this model to optimizesome parameters of the sample and aggregate frameworklike block size (Section 4.3) and privacy budget allocation(Section 5.1). Consider the following motivating example:Example 1. Let T 70 yrs and T now be two datasets containingthe ages of citizens in a particular region 70 years earlierand at present respectively. It is conceivable that the privacythreat to T 70 yrs is much lower as many of the participatingpopulation may have deceased. Although T 70 yrs might not beas useful as T now for learning specific trends about the currentpopulation, it can be used to learn some general conceptsabout T now. For example, a crude estimate of the maximumage present in T now can be obtained from T 70 yrs.GUPT estimates such general trends in data distributionand uses them to optimize the performance of the system.The optimization results in a significant reduction in error.More precisely, the aged data is used for the following: i)to estimate an optimal block size for use in the sample andaggregate framework, ii) to identify the minimum privacybudget needed to estimate the final result within a givenaccuracy bound, and iii) to appropriately distribute the privacybudget ɛ across various tasks and queries.For simplicity of exposition, the particular aging model inour analysis is that a constant fraction of the dataset hascompletely aged out, i.e. the privacy of the entries in thisconstant fraction is no more of a concern. In reality, if theaged data is still weakly privacy sensitive, then it is possibleto privately estimate these parameters by introducing anappropriate magnitude of noise into these calculations. Theweak privacy of aged data allows us to keep the noise lowenough such that the estimated parameters are still useful.Existing techniques [1, 26] have attempted to use progressiveaggregation of old data in order to reduce its sensitivity. Theuse of differentially private operations for aggregation canpotentially be exploited to generate our training datasets.The use of these complementary approaches offer excitingopportunities that have not been explored in this paper.It is important to mention that GUPT does not requirethe aging model for default functionality. The default parameterchoices allow it work well in a generic setting. However,our experiments show that the aging model providesan additional improvement in performance.4. ACCURACY IMPROVEMENTSThe sample and aggregate framework (Algorithm 1) introducestwo sources of error:• Estimation error: This arises because the query is evaluatedon a smaller data block, rather than the entiredataset. Typically, the larger the block size, thesmaller the estimation error.• Noise: Another source of error is due to the Laplacenoise introduced to guarantee differential privacy.

Intuitively, the larger the number of blocks, the lowerthe sensitivity of the aggregation function – since theaggregation function has sensitivity s , where s denoteslthe output range of each block, and l denotes the numberof blocks. As a result, given a fixed privacy parameterɛ, with a larger number of blocks, the magnitudeof the Laplace noise is lowered.GUPT uses two strategies (resampling and selecting theoptimal block size) to reduce these types of errors. Beforedelving into details of the techniques, the following sectionexplains how the output range for a given analysis programis computed. This range is used to decide the amount ofnoise to be added to the final output.4.1 Output Range EstimationThe sample and aggregate framework described in Algorithm1 does not describe a mechanism to obtain the rangewithin which the output can lie. This is needed to estimatethe noise that should be added for differential privacy.GUPT implements this requirement by providing the followingmechanisms:1. GUPT-tight: The analyst specifies a tight range for theoutput.2. GUPT-loose: The analyst only provides a loose rangefor the output. In this case, the computation is run oneach data block and their outputs are recorded. A differentiallyprivate percentile estimation algorithm [24]is then applied on the set of outputs to privately computethe 25-th and the 75-th percentile values. Thesevalues are used as the range of the output and aresupplied to Algorithm 1.3. GUPT-helper: The analyst could also provide a rangetranslation function. If either (a) input range is notpresent or (b) only very loose range for the input (e.g.,using the national GDP as an upper-bound on annualhousehold income) is available, then GUPT runs thesame differentially private percentile estimation algorithmon the inputs to privately compute the 25-th andthe 75-th percentile (a.k.a, lower and upper quartiles)of the inputs. This is used as a tight approximationof the input range. The analyst-supplied range translationfunction is then invoked to convert the “tight”input range into an estimate of the output range.Our experiments demonstrate that one can get good resultsfor a large class of problems using the noisy lower andupper quartiles as approximations of the output range. If theinput dataset is multi-dimensional, the range estimation algorithmis run independently for each dimension. Note thatthe choice of 25-th and 75-th percentile above is somewhatarbitrary. In fact, one can choose a larger inter-percentilerange (e.g., 10-th and 90-th percentile) if there are moredata samples. However, this does not affect the asymptoticbehavior of the algorithm.4.2 ResamplingThe variance in the final output is due to two sourcesof randomness: i) partitioning the dataset into blocks andii) the Laplace noise added. The following resampling technique3 can be used to reduce the variance due to partitioningthe dataset into blocks. Instead of requiring each data entryto reside in exactly one block (as described in the originalsample and aggregate framework [24]), each data entry can3 A variant of this technique was suggested by Adam Smith.now reside in multiple blocks. The resampling factor γ denotesthe number of blocks in which each data entry resides.If the number of records in the dataset is n, block size is βand each data entry resides in γ blocks, it is easy to see thatthe number of blocks l = γn/β. To incorporate resampling,we make the following modifications to Algorithm 1. Lines1 and 2 are modified as follows. Consider l = γn/β bins ofsize β each. The i th entry from the dataset T is picked andrandomly placed into γ bins that are not full. This processis performed for all the entries in the dataset T . In Line 8,β| max − min |n·ɛthe Laplace noise is changed to Lap( ). The restof Algorithm 1 is left intact.The main benefit of using resampling is that it reduces thevariance due to partitioning the dataset into blocks withoutincreasing the noise needed for the same level of privacy.Claim 1. With the same privacy level ɛ, resampling withany γ ∈ Z + , does not increase the Laplace noise being added(for fixed block size β).Proof. Since each record appears in γ blocks, a Laplacenoise of magnitude O( γs sβ) = O( ) should be added to preserveɛ-differential privacy. This means that once the blockɛl ɛnsize is fixed, the noise is independent of the factor γ.Intuitively, the benefit from resampling is that the variancedue to partitioning of the dataset into blocks is reducedwithout increasing the Laplace noise (added in Step 8 of Algorithm1) needed for the same level of privacy with theinclusion of γ > 1. Consider the following example to get abetter understanding of the intuition.Example 2. Let T be a dataset (with n records) of theages of a population and max be the maximum age in thedataset. The objective is to find the average age (Av) inthis dataset. Let Av ˆ be the average age of a dataset formedwith n 0.6 uniformly random samples drawn from T with replacement.The expectation of Av ˆ equals the true averageAv. However, the variance of Av ˆ will not be zero (unless all∑the entries in T are same). Let O = 1 ψ ˆψ i=1 Av(i), whereAv(i) ˆ is the i-th independent computation of Av ˆ mentionedabove and ψ is some constant. Notice that O has the sameexpected value as Av ˆ but the variance has reduced by a factorof ψ. Hence, resampling reduces the variance in the finaloutput O without introducing bias.The above example is a simplified version of the actualresampling process. In the actual resampling process eachdata block of size n 0.6 is allowed to have only one copy ofeach data entry of T . However, even with an inaccuraterepresentation, the above example captures the essence ofthe underlying phenomenon.In practice, the resampling factor γ is picked such that it isreasonably large, without increasing the computation overheadsignificantly. Notice that the increase in accuracy withthe increase of γ becomes insignificant beyond a threshold.4.3 Selecting the Optimal Block SizeIn this section, we address the following question: Givena fixed privacy budget, how do we pick the optimal block sizeto maximize the accuracy of the private output?Observe that increasing the block size β increases the noisemagnitude, but reduces the estimation error. Therefore, thequestion boils down to: how do we select an optimal block

size that will allow us to balance the estimation error andthe noise? The following example elucidates why answeringthe above question is important.Example 3. Consider the same age dataset T used inExample 2. If our goal is to find the average of the entriesin T while preserving privacy, then it can be observed that(ignoring resampling) the optimal size of each block is onewhich attains the optimal balance between the estimation errorand noise. If the block size was one, then the expectederror will be O(1/n), where n is the size of the dataset. However,if we use the default block size ( i.e., n 0.6 ), the expectederror will be O(1/n 0.4 ) which is much higher.As a result getting the optimal block size based on thespecific task helps to reduce the final error to a large extent.The optimal block size varies from problem to problem. Forexample, in k-means clustering or logistic regression the optimalblock size has to be much larger than one.Let l = n α be the optimal number of blocks, where α isa parameter to be ascertained. Hence, n 1−α is the blocksize. (For the simplicity of exposition we do not considerresampling.) Let f : R k×n → R be the query which isto be computed on the dataset T . Let the data blocks berepresented as T 1, · · · , T l . Let s be the sensitivity of thequery, i.e., the absolute value of maximum change that anyf(T i) can have if any one entry of T is changed. With theabove parameters in place, the ɛ-differentially private outputfrom the sample and aggregate framework isˆf(T ) = 1 ∑n αn α f(T i ) + Lap(sɛni=1α ) (1)Assume that the entries of the dataset T are drawn i.i.d,and that there exists a dataset T np (with entries drawn i.i.d.from the same distribution as T ) whose privacy we do notcare for under the aging of sensitivity model. Let n np be thenumber of entries in T np . We will use the aged dataset T npto estimate the optimal block size. Specifically, we partitionT np into blocks of size β = n 1−α . The number of blocksl np in T np is therefore l np = nnp. Notice that ln 1−α np is afunction of α, whose optimal value has to be found. Alsonote that the minimum value of α must satisfy the followinginequality: n np ≥ n 1−α .One possible approach for achieving a good value of α isby minimizing the empirical error in the final output. Theempirical error in the output of ˆf is defined asl np1 ∑ f(T np i ) − f(T np )∣ l +npi=1∣} {{ }A√2sɛn α} {{ }BHere A characterizes the estimation error, and B is due tothe Laplace noise added. We can minimize Equation 2 w.r.t.α when α ∈ [1 − log n np/ log n, 1]. Conventional techniqueslike hill climbing can be used to obtain a local minima.The α computed above is used to obtain the optimal numberof blocks. Since, the computation involves only the nonprivatedatabase T np , there is no effect on overall privacy.5. PRIVACY BUDGET MANAGEMENTIn differential privacy, the analyst is expected to specifythe privacy goals in terms of an abstract privacy budgetɛ. The analyst performs the data analysis task optimizing(2)it for accuracy goals and the availability of computationalresources. These metrics do not directly map onto the abstractprivacy budget. It should be noted that even a privacyexpert might be unable to map the privacy budget into accuracygoals for arbitrary problems. In this section we describemechanisms that GUPT use to convert the accuracy goalsinto a privacy budget and to efficiently distribute a givenprivacy budget across different analysis tasks.5.1 Estimating Privacy Budget for AccuracyGoalsIn this section we seek to answer the question: How canGUPT pick an appropriate ɛ, given a fixed accuracy goal?Specifically, we wish to minimize the ɛ parameter to maximallypreserve the privacy budget. It is often more intuitiveto specify an accuracy goal rather than a privacy parameterɛ, since accuracy relates to the problem at hand.Similar to the previous section, we assume the existenceof an aged dataset T np (drawn from the same distributionas the original dataset T ) whose privacy is not a concern.Consider an analyst who wishes to guarantee an accuracyρ with probability 1 − δ, i.e. the output should be withina factor ρ of the true value. We wish to estimate an appropriateɛ from an aged data set T np of size n np. Let βdenote the desired block size. To estimate ɛ, first the permissiblestandard deviation in the output σ is calculatedfor a specified accuracy goal ρ and then the following optimizationproblem is solved. Solve for ɛ, under the followingconstraints: 1) the expression in Equation 3 equals σ 2 , 2)α = max{0, log(n/β)}.⎛ ⎛⎞1⎝ 1 ∑l np⎝f(T npn α i ) − 1 ∑l npl np li=1npi=1} {{ }Cf(T npi ) ⎠2 ⎞ ⎠+ 2s2ɛ 2 n 2α } {{ }DIn Equation 3, C denotes the variance in the estimationerror and D denotes the variance in the output due to noise.To calculate σ from the accuracy goal, we can rely onChebyshev’s inequality: Pr[| ˆf(T ) − E(f(T i))| > φσ] < 1 .φ 2Furthermore, assuming that the query f is a approximatelynormal statistic, we have |E(f(T i)) − Truth| = O (1/β).Therefore: Pr[| ˆf(T ) − Truth| > φσ + O (1/β)] < (1/φ 2 ) Tomeet the output accuracy goal of ρ with probability 1 − δ,we set σ ≃ √ δ|1 − ρ|f(T np ). Here, we have assumed thatthe true answer is f(T np ) and 1/β ≪ σ/ √ δ.Since in the above calculations we assumed that the trueanswer is f(T np ), an obvious question is “why not outputf(T np ) as the answer?”. It can be shown that in a lot ofcases, the private output will be much better than f(T np ).If the assumption that 1/β ≪ σ/ √ δ does not hold, thenthe above technique for selecting privacy budget would producesuboptimal results. This however does not compromisethe privacy properties that GUPT wants to maintain, as itexplicitly limit the total privacy budget allocated for queriesaccessing a particular dataset.5.2 Automatic Privacy Budget DistributionDifferential privacy is an alien concept for most analysts.Further, the proper distribution of the limited privacy budgetacross multiple computations require significant mathematicalexpertise. GUPT eliminates the need to manuallydistribute privacy budget between tasks. The following examplewill highlight the requirement of an efficient privacy(3)

udget distribution rather than distributing equally amongvarious tasks.Example 4. Consider the same age census dataset T fromExample 2. Suppose we want to find the average age andthe variance present in the dataset while preserving differentialprivacy. Assume that the maximum possible humanage is max and the minimum age is zero.∑Assume that thenon-private variance is computed as 1 nn i=1 (T (i)−Avpriv)2 ,where Av priv is the private estimate of the average and n isthe size of T . If an entry of T is modified, the average Avchanges by at most max/n, however the variance can changeby at most max 2 /n.Let ɛ 1 and ɛ 2 be the privacy level expected for average andvariance respectively, with the total privacy budget being ɛ =ɛ 1 + ɛ 2. Now, if it is assumed that ɛ 1 = ɛ 2, then the errorin the computation of variance will be in the order of maxmore than in the computation of average. Whereas if privacybudget were distributed as ɛ 1 : ɛ 2 = 1 : max, then the noisein both the average and variance will roughly be the same.Given privacy budget of ɛ and we need to use it for computingvarious queries f 1, · · · , f m privately. If the privateestimation of query f i requires ɛ i privacy budget, then thetotal privacy budget spent will be ∑ mi=1ɛi (by compositionproperty of differential privacy [5]). The privacy budget isdistributed as follows. Let ζ iɛ ibe the standard deviation ofthe Laplace noise added by GUPT to ensure privacy level ɛ i.ζAllocate the privacy budget by setting ɛ i = ∑mi=1 iζ iɛ. Therationale behind taking such an approach is that usually thevariance in the computation by GUPT is mostly due to thevariance in the Laplace noise added. Hence, distributing ɛacross various tasks using the technique discussed above ensuresthat the variance due to Laplace noise in the privateoutput for each f i is the same.6. SYSTEM SECURITYGUPT is designed as a hosted platform where the analystis not trusted. It is thus important to ensure thatthe untrusted computation should not be able to access thedatasets directly. Additionally, it is important to prevent thecomputation from exhausting resources or compromising theservice provider. To this end, the “computation manager”is split into a server component that interacts with the userand a client component that runs on each node in the cluster.The trusted client is responsible for instantiating thecomputation in an isolated execution environment. The isolatedenvironment ensures that the computation can onlycommunicate with a trusted forwarding agent which sendsthe messages to the computation manager.6.1 Access ControlGUPT uses a mandatory access control framework (MAC)to ensure that (a) communication between different instancesof the computation is disallowed and (b) each instance of thecomputation can only store state (or modify data) within itsown scratch space. This is the only component of GUPTthat depends upon a platform dependent implementation.On Linux, the LSM framework [27] has enabled many MACframeworks such as SELinux and AppArmor to be built.GUPT defines a simple AppArmor policy for each instanceof the computation, setting its working directory to a temporaryscratch space that is emptied upon program termination.AppArmor does not yet allow fine grained control tolimit network activity to individual hosts and ports. Thusthe “computation manager” is split into a server and clientcomponent. The client component of the computation managerallows GUPT to disable all network activity for theuntrusted computation and restrict IPC to the client.We determined an empirical estimate of the overhead introducedby the AppArmor sandbox by executing an implementationof k-means clustering on GUPT 6, 000 times. Wefound that the sandboxed version of GUPT was only 0.012times slower than the non-sandboxed version (overhead of1.26%).6.2 Protection against side-channel attacksHaeberlen et al. [10] identified three possible side-channelattacks against differentially private systems. They are i)state attack, ii) privacy budget attack, and iii) timing attack.GUPT is not vulnerable to any of these attacks.State attacks: If the adversarial program can modify someinternal state (e.g., change the value of a static variable)when encountered with a specific data record. An adversarycan then look at the state to figure out whether therecord was present in the dataset. Both PINQ (in it’s currentimplementation) and Airavat are vulnerable to stateattacks. However, it is conceivable that operations can beisolated using .NET AppDomains in PINQ to isolate datacomputations. Since GUPT executes the complete analysisprogram (which may be adversarial) in isolated executionchambers and allows the analyst to access only the final differentiallyprivate output, state attacks are automaticallyprotected against.Privacy budget attack: In this attack, on encountering aparticular record, the adversarial program issues additionalqueries that exhausts the remaining privacy budget. [10]noted that PINQ is vulnerable to this attack. GUPT protectsagainst privacy budget attacks by managing the privacybudget itself, instead of letting the untrusted programperform the budget management.Timing attacks: In a timing attack, the adversarial programcould consume an unreasonably long amount of timeto execute (perhaps get into an infinite loop) when encounteredwith a specific data record. GUPT protects againstthis attack by setting a predefined bound on the number ofcycles for which the data analyst program runs on each datablock. If the computation on a particular data block completesbefore the predefined number of cycles, then GUPTwaits for the remaining cycles before producing an outputfrom that block. In case the computation exceeds the predefinednumber of cycles, the computation is killed and a constantvalue within the expected output range is produced asthe output of the program running on the data block underconsideration.Note that with the scheme above, the runtime of GUPTis independent of the data. Hence, the number of executioncycles does not reveal any information about the dataset.The proof that the final output is still differentially privateunder this scheme follows directly from the privacy guaranteeof the sample and aggregate framework and the factthat a change in one data entry can affect only one datablock (ignoring resampling). Thus GUPT is not vulnerableto timing attacks. Both PINQ and Airavat do not protectagainst timing attacks [10].

7. EVALUATIONFor each data analysis program, the program binary andinterfaces with the GUPT “computation manager” shouldbe provided. For arbitrary binaries, a lean wrapper programcan be used for marshaling data to/from the format of thecomputation manager.In this section, we show using results from running commonmachine learning algorithms (such as k-means clusteringand logistic regression on a life sciences dataset) thatGUPT does not significantly affect the accuracy of dataanalysis. Further, we show that GUPT not only relieves theanalysts from the burden of distributing a privacy budgetbetween data transformation operations, it also manages toprovide superior output accuracy. Finally, we show throughbenchmarks the scalability of the GUPT architecture andthe benefits of using aged data to estimate optimal values ofprivacy budget and block sizes.7.1 Case Study: Life Sciences datasetWe evaluate the efficacy of GUPT using the ds1.10 lifesciences dataset taken from http://komarix.org/ac/ds as amotivating example for data analysis. This dataset containsthe top 10 principal components of chemical/biological compoundswith each of the 26, 733 rows representing differentcompounds. Additionally, the reactivity of the compound isavailable as an additional component. A k-means clusteringexperiment enables us to cluster compounds with similar featurestogether and logistic regression builds a linear classifierfor the experiment (e.g., predicting carcinogens). It shouldbe noted that these experiments only provide estimates asthe final answer, e.g., the cluster centroids in the case ofk-means. We show in this section that the perturbation introducedby GUPT only affects the final result marginally.7.1.1 Output AccuracyAccuracy1.00.90.80.70.60.5GUPT-tightNon private baseline2 4 6 8 10Privacy Budget (ɛ)Figure 3: Effect of privacy budget on the accuracyof prediction using Logistic Regression on the lifesciences datasetAs mentioned in Section 4, any analysis performed usingGUPT has two sources of error – (a) an estimation error,introduced because each instance of the computation workson a smaller subset of the data and (b) Laplace noise thatis added in order to guarantee differential privacy. In thissection, we show the effect of these errors when runninglogistic regression and k-means on the life sciences dataset.GUPT can be used to run existing programs with no modifications,thus drastically reducing the overhead of writingprivacy preserving programs. Analysts using GUPT are freeto use their favorite software packages written in any language.To demonstrate this property, we evaluate black boxNormalized Intra Cluster Variance1008060402000.40.50.60.70.80.91.02.03.04.0Privacy Budget (ɛ)Baseline ICVGUPT-looseGUPT-tightFigure 4: Intra-cluster variance for k-means clusteringon the life sciences datasetimplementations of logistic regression and k-means clusteringon the life sciences dataset.Logistic Regression: The logistic regression software packagefrom Microsoft Research (Orthant-Wise Limited-memoryQuasi-Newton Optimizer for L 1-regularized Objectives ) wasused to classify the compounds in the dataset as carcinogensand non-carcinogens. Figure 3 shows the accuracy of GUPTfor different privacy budgets.When the package was run on the dataset directly, a baselineaccuracy of 94% was obtained. The same package whenrun using the GUPT framework classified carcinogens withan accuracy between 75 ∼ 80%. To understand the sourceof the error, when the non-private algorithm was executednon a data block of size records, the accuracy reducednto 82%. It was thus determined 0.4that much of the errorstems from the loss of accuracy when the algorithm is runon smaller blocks of the entire dataset reduced. For datasetsof increasingly large size, this error is expected to diminish.k-means Clustering: Figure 4 shows the cluster variancecomputed from a k-means implementation run on the lifesciences dataset. The x-axis is various choices of the privacybudget ɛ, and the y-axis∑is the normalized Intra-ClusterVariance (ICV) defined as 1 K∑n i=1 ⃗x∈C i|⃗x −⃗c i| 2 2, where Kdenotes the number of clusters, C i denotes the set of pointswithin the i th cluster, and ⃗c i denotes the center of the i thcluster. A standard k-means implementation from the scipypython package is used for the experiment.The k-means implementation was run using GUPT withdifferent configurations for calculating the output range (Section4.1). For GUPT-tight, a tight range for the output istaken to be the exact minimum and the maximum of eachattribute (for all 10 attributes). For GUPT-loose, a looseoutput range is fixed as [min ∗2, max ∗2], where min andmax are the actual minimum and maximum for that attribute.Figure 4 shows that with increasing privacy budgetɛ, the amount of Laplace noise added to guarantee differentialprivacy decreases, thereby reducing the intra-clustervariance, i.e. making the answer more accurate. It can alsobe seen that when GUPT is provided with reasonably tightbounds on the output range (GUPT-tight), the output of thek-means experiment is very close to a non-private run of theexperiment even for small values of the privacy budget. Ifonly loose bounds are available (GUPT-loose), then a largerprivacy budget is required for the same output accuracy.7.1.2 Budget Distribution between OperationsIn GUPT, the program is treated as a black box andnoise is only added to the output of the entire program.

Normalized Intra Cluster Variance120100806040200PINQ-tight ɛ=2PINQ-tight ɛ=4GUPT-tight ɛ=1GUPT-tight ɛ=220 80 200k-means iteration countFigure 5: Total perturbation introduced by GUPTdoes not change with number of operations in theutility functionThus the number of operations performed in the programitself is irrelevant. A problem with writing specialized differentiallyprivate algorithms such as in the case of PINQ isthat given a privacy budget ɛ for the task, it is difficult todecide how much ɛ to spend on each query, since it is difficultto determine the number of iterations needed ahead oftime. PINQ requires the analyst to pre-specify the numberof iterations in order to allocate the privacy budget betweeniterations. This is often hard to do, since many data analysisalgorithms such as PageRank [20] and recursive relationqueries [2] require iterative computation until the algorithmreaches convergence. The performance of PINQ thus dependson the ability to accurately predict the number ofiterations. If the specified number of iterations is too small,then the algorithm may not converge. On the other hand,if the specified number of iterations is too large, then muchmore noise than is required will be added which will bothslow down the convergence of the algorithm as well as harmits accuracy. Figure 5 shows the effect of PINQ on accuracywhen performing k-means clustering on the dataset.In this example, the program output for the dataset convergeswithin a small number of iterations, e.g., n = 20.Whereas if a larger number of iterations (e.g., n = 200) wasconservatively chosen, then PINQ’s performance degradessignificantly. On the other hand, GUPT produces the sameamount of perturbation irrespective of the number of iterationsin k-means. Further, it should be noted that PINQwas subjected to a weaker privacy constraint (ɛ = 2 and 4)as compared to GUPT (ɛ = 1 and 2).7.1.3 ScalabilityTime (seconds)2.52.01.51.00.50.0Non PrivateGUPT-helperGUPT-loose20 80 100 200IterationsFigure 6: Change in computation time for increasednumber of iterations in k-meansUsing a server with two Intel Xeon 5550 quad-core CPUsand the entire dataset loaded in memory, we compare theResult accuracy (%)10095908580GUPT-helper constant ɛ=1GUPT-helper constant ɛ=0.3GUPT-helper variable ɛExpected Accuracy0 20 40 60 80 100Portion of queries (%)Figure 7: CDF of query accuracy for privacy budgetallocation mechanismsexecution time of an unmodified (non-private) instance anda GUPT instance of the k-means experiment.If tight output range (i.e. , GUPT-tight) is not available,typically, the output range estimation phase of the sampleand aggregate framework takes up most of the CPU cycles.When only loose range for the input is available (i.e. ,GUPT-helper), a differentially private percentile estimationis performed on all of the input data. This is a O(n ln n)operation, n being the number of data records in the originaldataset. On the other hand, if even loose range for theoutput is available (i.e. , GUPT-loose), then the percentileestimation is performed only on the output of each of theblocks in sample and aggregate framework, which is typicallyaround n 0.4 . This results in significantly reduced run-timeoverhead. The overhead introduced by GUPT is irrespectiveof the actual computation time itself. Thus as the computationtime increases, the overhead introduced by GUPTdiminishes in comparison. Further, there is an additionalspeed up since each of the computation instances work on asmaller subset of the entire dataset. It should be noted thatthe reduction in computation time thus achieved could alsopotentially be achieved by the computational task runningwithout GUPT. Figure 6 shows that the overall completiontime of the private versions of the program increases slowlycompared to the non-private version as we increase the numberof iterations of k-means clustering.7.2 Using Aged DataGUPT uses an aged dataset (that is no longer consideredprivacy sensitive) drawn from a similar distribution as thereal dataset. Section 4.3 describes the use of aged data toestimate an optimal block size that reduces the error introducedby data sampling. Section 5.1 describes how dataanalysts who are not privacy experts can continue to only describetheir accuracy goals yet achieve differentially privateoutputs. Finally, Section 5.2 uses aged data to automaticallydistribute a privacy budget between different querieson the same data set. In this section, we show experimentalresults that support the claims made in Sections 4.3 and 5.1.7.2.1 Privacy Budget EstimationTo illustrate the ease with which GUPT can be used bydata analysts, we evaluate the efficiency of GUPT by executingqueries that are not provided with a privacy budget.We use a census income dataset from the UCI machinelearning repository [7] which consists of 32561 entries. Theage data from this dataset is used to calculate the averageage. A reasonably loose range of [0, 150] was enforced

Normalized privacy budget lifetime4.03.53.02.52.01.51.00.50.0GUPT-helper constant ɛ=1GUPT-helper variable ɛGUPT-helper constant ɛ=0.3Figure 8: Increased lifetime of total privacy budgetusing privacy budget allocation mechanismNormalized RMSE0.50.40.30.20.1Median ɛ=2Median ɛ=6Mean ɛ=2Mean ɛ=60 10 20 30 40 50 60 70Block size (β)Figure 9: Change in error for different block sizeson the output whose true average age is 38.5816. Initially,the experiment was run with a constant privacy budgets ofɛ = 1 and ɛ = 0.3. GUPT allows the analyst to providelooser constraints such as “90% result accuracy for 90% ofthe results” and allocates only as much privacy budget as isrequired to meet these properties. In this experiment, the10% of the dataset was assumed to be completely privacy insensitiveand was used to estimate ɛ given a pre-determinedblock size. Figure 7 shows the CDF of the output accuracyboth for constant privacy budget values as well as for the accuracyrequirement. Interestingly, not only does the figureshow that the accuracy guarantees are met by GUPT, butalso it shows that if the analyst was to define the privacybudget manually (as in the case of ɛ = 1 or ɛ = 0.3), theneither too much or too little privacy budget is used. Theprivacy budget estimation technique thus has the additionaladvantage that the lifetime of the total privacy budget for adataset will be extended. Figure 8 shows that if we were torun the average age query with the above constraints overand over again, GUPT will be able to run 2.3 times morequeries than using a constant privacy budget of ɛ = 1.7.2.2 Optimal Block Size EstimationSection 4.3 shows that the estimation error decreases withan increase in data block size, whereas the noise decreaseswith an increased number of blocks. The optimal trade offpoint between the block size and number of data blockswould be different for different queries executed on the dataset.To illustrate the tradeoff, we show results from queries executedon an internet advertisement dataset also from theUCI machine learning repository [7]. Figure 9 shows thenormalized root mean square error (from the true value) inestimating the mean and median aspect ratio of advertisementsshown on Internet pages with privacy budgets ɛ of 2and 6. In the case of the “mean” query, since the averagingoperation is already performed by the sample and aggregateframework, smaller data blocks would reduce the noiseGUPT PINQ AiravatWorks with unmodified Yes No NoprogramsAllows expressive programs Yes Yes NoAutomated privacy budget Yes No NoallocationProtection against privacy Yes No Yesbudget attackProtection against state Yes No NoattackProtection against timing Yes No NoattackTable 1: Comparison of GUPT, PINQ and Airavatadded to the output and thus provide more accurate results.As expected, we see that the ideal block size would be one.For the “median” query, it is expected that increasing theblock size would generate more accurate inputs to the averagingfunction. Figure 9 shows that when the “median”query is executed with ɛ = 2, the error is minimal for ablock size of 10. With increasing block sizes, the noise addedto compensate for the reduction in number of blocks wouldhave a dominating effect. On the other hand, when executingthe same query with ɛ = 6, the error continues to dropfor increased block sizes, as the estimation error dominatesthe Laplace noise (owing to the increased privacy budget).It is thus clear that GUPT can significantly reduce the totalerror by estimating the optimal block size for the sampleand aggregate framework.7.3 Qualitative AnalysisIn this section, GUPT is contrasted with both PINQ andAiravat on various fronts (see Table 1 for a summary). Wealso list the significant changes introduced by GUPT in orderto mold the sample and aggregate framework (SAF) [24]into a practically useful one.Unmodified programs: Because PINQ [16] is an API thatprovides a set of low-level data manipulation primitives, applicationswill need to be re-written to perform all operationsusing these primitives. On the other hand, Airavat [22] implementsthe Map-Reduce programming paradigm [4] andrequires that the analyst splits the user’s data analysis programinto an “untrusted” map program and a reduce aggregatorthat is “trusted” to be differentially private.In contrast, GUPT treats the complete application programas a black box and as a result the entire applicationprogram is deemed untrusted.Expressiveness of the program: PINQ provides a limitedset of primitives for data operations. However, if therequired primitives are not already available, then a privacyunaware analyst would be unable to ensure privacy for theoutput. Airavat also severely restricts the expressiveness ofthe programs that can run in it’s framework: a) the “untrusted”map program is completely isolated for each dataelement and cannot save any global state and b) it restrictsthe number of key-value pairs generated from the mapper.Many machine learning algorithms (such as clustering andclassification) require global state and complex aggregationfunctions. This would be infeasible in Airavat without placingmuch of the logic in the “trusted” reducer program.GUPT places no restriction on the application program,and thus does not degrade the expressiveness of the program.Privacy budget distribution: As was shown in Section

7.1.2, PINQ requires the analyst to allocate a privacy budgetfor each operation on the data. An inefficient distribution ofthe budget either add too much noise or use up too much ofthe budget. Like GUPT, Airavat spends a constant privacybudget on the entire program. However, neither Airavat norPINQ provides any support for distributing an aggregateprivacy budget across multiple data analysis programs.Using the aging of sensitivity model, GUPT provides anmechanism for efficiently distributing an aggregate privacybudget across multiple data analysis programs.Side Channel Attacks: Current implementation of PINQlays the onus of protecting against side channel attacks onthe program developer. As was noted in [10], although Airavatprotects against privacy budget attacks, it remains vulnerableto state attacks. GUPT defends against both stateattacks and privacy budget attacks (see Section 6.2).Other differences: GUPT extends SAF to use a noveldata resampling mechanism to reduce the variance in theoutput induced via data sub-sampling. Using the aging ofsensitivity model, GUPT overcomes a fundamental issue indifferential privacy not considered previously (to the best ofour knowledge): for an arbitrary data analysis application,how do we describe an abstract privacy budget in terms ofutility? The model also allows us to further reduce the errorin SAF by estimating a reasonable block size.8. DISCUSSIONApproximately normal computation: GUPT does notprovide any theoretical guarantees on the accuracy of theoutputs, if the applied computation is not approximatelynormal, or if the data entries are not independently andidentically distributed (i.i.d). However, GUPT still guaranteesthat the differential privacy of individual data recordsis always maintained. In practice, GUPT provides reasonablyaccurate results for a lot of queries that do not satisfyapproximate normality.For some real world datasets, such as sensing and streamingdata that have temporal correlation and do not satisfythe i.i.d property, the current implementation of GUPT failsto provide reasonable output. In our future work we intendto design GUPT for catering to these kind of datasets.Ordering of multiple outputs: When data analysis programyields multi-dimensional outputs, the program maynot be able to guarantee the same ordering of outputs fordifferent data blocks. In this case, we have to sort the outputsaccording to some canonical form before applying theaveraging operation. For example, in the case of k-meansclustering, each block can yield different ordering of the kcluster centers. In our experiments, the cluster centers weresorted according to their first coordinate.8.1 LimitationsForeknowledge of output dimension: GUPT assumesthat the output dimensions are known in advance. This mayhowever not always be true in practice. For example, SupportVector Machines (SVM) output an indefinite numberof support vectors. Unless the dimension is fixed in advance,private information may leak through the dimensionality itself.Another problem is that if the output dimensions fordifferent blocks are different, then performing an average ofthe outputs will not be meaningful. A partial solution is toclamp or pad the output dimension to a predefined number.We note that this limitation is not unique to GUPT. Forexample, Airavat [22] also suffers from the same problem inthe sense that their mappers have to output a fixed numberof (key, value) pairs.Inherits limitations of differential privacy: GUPT inheritssome of the issues that are common to many differentialprivacy mechanisms. For high dimensional data outputs,the privacy budget needs to be split across the multipleoutputs. The privacy budget also needs to be split acrossmultiple data mining tasks and users. Given a fixed totalprivacy budget, the more tasks and queries we divide thebudget over, the more noise there is – and at some point, themagnitude of noise may cause the data analysis to becomeunusable. In Section 5.2, we outlined a potential method formanaging privacy budgets. Alternative approaches such asauctioning of privacy budget [9] can also be used.While differential privacy guarantees the privacy of individualrecords in the dataset, many real world applicationwill want to enforce higher level privacy concepts such asuser-level privacy [6]. In cases where multiple records maycontain information about the same user, user-level privacyneeds to be accounted for accordingly.9. CONCLUSION AND FUTURE WORKGUPT makes privacy-preserving data analytics easy forprivacy non-experts. The analyst can upload arbitrary datamining programs and GUPT guarantees the privacy of theoutputs. We propose novel improvements to the sample andaggregate framework to enhance the usability and accuracyof the data analysis. Through a new model that reduces thesensitivity of the data over time, GUPT is able to representprivacy budgets as accuracy guarantees on the final output.Although this model is not strictly required for the defaultfunctioning of GUPT, it improves both usability and accuracy.Through the efficient distribution of the limited privacybudget between data queries, GUPT is able to ensurethat more queries that meet both their accuracy and privacygoals can be executed on the dataset. Through experimentson real-world datasets, we demonstrate that GUPTcan achieve reasonable accuracy in private data analysis.AcknowledgementsWe would like to thank Adam Smith, Daniel Kifer, FrankMcSherry, Ganesh Ananthanarayanan, David Zats, PiyushSrivastava and the anonymous reviewers of SIGMOD fortheir insightful comments that have helped improve thispaper. This project was funded by Nokia, Siemens, Intelthrough the ISTC for Secure Computing, NSF awards (CPS-0932209, CPS-0931843, BCS-0941553 and CCF-0747294) andthe Air Force Office of Scientific Research under MURI Grants(22178970-4170 and FA9550-08-1-0352).10. REFERENCES[1] N. Anciaux, L. Bouganim, H. H. van, P. Pucheral, andP. M. Apers. Data degradation: Making private dataless sensitive over time. In CIKM, 2008.[2] F. Bancilhon and R. Ramakrishnan. An amateur’sintroduction to recursive query processing strategies.In SIGMOD, 1986.

[3] M. Barbaro and T. Zeller. A face is exposed for aolsearcher no. 4417749. The New York Times, Aug.2006.[4] J. Dean and S. Ghemawat. Mapreduce: Simplifieddata processing on large clusters. Operating SystemsDesign and Implementation, October 2004.[5] C. Dwork, F. McSherry, K. Nissim, and A. Smith.Calibrating noise to sensitivity in private dataanalysis. In TCC, 2006.[6] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum.Differential privacy under continual observation. InSTOC, 2010.[7] A. Frank and A. Asuncion. UCI machine learningrepository, 2010.[8] S. R. Ganta, S. P. Kasiviswanathan, and A. Smith.Composition attacks and auxiliary information in dataprivacy. In KDD, 2008.[9] A. Ghosh and A. Roth. Selling privacy at auction.http://arxiv.org/abs/1011.1375.[10] A. Haeberlen, B. C. Pierce, and A. Narayan.Differential privacy under fire. In USENIX Security,2011.[11] M. Hay, V. Rastogi, G. Miklau, and D. Suciu.Boosting the accuracy of differentially privatehistograms through consistency. Proc. VLDB Endow.,3:1021–1032, September 2010.[12] V. Karwa, S. Raskhodnikova, A. Smith, andG. Yaroslavtsev. Private analysis of graph structure.In VLDB, 2011.[13] D. Kifer. Attacks on privacy and definetti’s theorem.In SIGMOD, 2009.[14] C. Li, M. Hay, V. Rastogi, G. Miklau, andA. McGregor. Optimizing linear counting queriesunder differential privacy. In PODS, 2010.[15] A. Machanavajjhala, J. Gehrke, D. Kifer, andM. Venkitasubramaniam. l-diversity: Privacy beyondk-anonymity. In ICDE, 2006.[16] F. McSherry. Privacy integrated queries: an extensibleplatform for privacy-preserving data analysis. InSIGMOD, 2009.[17] F. McSherry and K. Talwar. Mechanism design viadifferential privacy. In FOCS, 2007.[18] A. Narayanan and V. Shmatikov. Robustde-anonymization of large sparse datasets. In IEEESymposium on Security and Privacy, 2008.[19] K. Nissim, S. Raskhodnikova, and A. Smith. Smoothsensitivity and sampling in private data analysis. InSTOC, 2007.[20] L. Page, S. Brin, R. Motwani, and T. Winograd. Thepagerank citation ranking: Bringing order to the web.Technical Report 1999-66, Stanford InfoLab, 1999.[21] V. Rastogi and S. Nath. Differentially privateaggregation of distributed time-series withtransformation and encryption. In SIGMOD, 2010.[22] I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov, andE. Witchel. Airavat: security and privacy formapreduce. In NSDI, 2010.[23] A. Serjantov and G. Danezis. Towards an informationtheoretic metric for anonymity. In PET, 2002.[24] A. Smith. Privacy-preserving statistical estimationwith optimal convergence rates. In STOC, 2011.[25] L. Sweeney. k-anonymity: A model for protectingprivacy. International Journal on Uncertainty,Fuzziness and Knowledge-based Systems, 2002.[26] H. H. van, M. Fokkinga, and N. Anciaux. Aframework to balance privacy and data usability usingdata degradation. In CSE, 2009.[27] C. Wright, C. Cowan, S. Smalley, J. Morris, andG. Kroah-Hartman. Linux security modules: Generalsecurity support for the linux kernel. In USENIXSecurity, 2002.[28] X. Xiao, G. Bender, M. Hay, and J. Gehrke. ireduct:differential privacy with reduced relative errors. InSIGMOD, 2011.APPENDIXA. THEORETICAL GUARANTEESWe combine the privacy guarantees of the different piecesused in the system to present a final privacy guarantee. Weprovide three different privacy guarantees based on how theoutput range is being estimated.Theorem 1 (Privacy Guarantee for GUPT). LetT ∈ R k×n be a dataset and f : R k×n → R p be the query.GUPT is ɛ-differentially private if the following holds:1. GUPT uses GUPT-helper with loose range for theinput: Execute percentile estimation algorithm definedin [24] for each of the k input dimensions with privacyparameter ɛ/(2k) and then run the sample and aggregateframework (SAF) with privacy parameter ɛ/(2p)for each output dimension.2. GUPT uses GUPT-tight: Run SAF with privacy parameterɛ/p for each output dimension.3. GUPT uses GUPT-loose: For each output dimension,run the percentile estimation algorithm defined in [24]with privacy parameter ɛ/(2p) and then run SAF withprivacy parameter ɛ/(2p).The proof of this theorem directly follows from the privacyguarantees for each of the module of GUPT and thecomposition theorem of [5]. In terms of utility, we claim thefollowing about GUPT.Theorem 2 (Utility Guarantee for GUPT). Letf : R k×n → R p be a generic asymptotically normal statistic(see Definition 2 in [24]). Let T be a dataset of size n drawni.i.d. from some underlying distribution F . Let ˆf(T ) be thestatistic computed by GUPT.1. If GUPT uses GUPT-tight, then ˆf(T ) converges tof(T ) in distribution.2. Let f be a statistic which differ by at most γ (undersome distance metric d) on two datasets T and˜T , where ˜T is obtained by clamping the dataset T oneach dimension by the 75-th percentile and the 25-thpercentile for that dimension. If GUPT uses GUPThelperwith loose input range, then we haved( ˆf( ˜T ), f(T )) ≤ γ as n → ∞.3. If GUPT uses GUPT-loose, then ˆf(T ) converges in1distribution to f(T ) as n → ∞ as long as k, and ɛlog(| max − min |) are bounded from above by sufficientlysmall polynomials in n, where | max − min | is the looseoutput range provided.The proof follows using a similar analysis used in [24].

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

Create successful ePaper yourself

Delete template?

Save as template?