GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

Recommendations

Info

size that will allow us to balance the estimation error andthe noise? The following example elucidates why answeringthe above question is important.Example 3. Consider the same age dataset T used inExample 2. If our goal is to find the average of the entriesin T while preserving privacy, then it can be observed that(ignoring resampling) the optimal size of each block is onewhich attains the optimal balance between the estimation errorand noise. If the block size was one, then the expectederror will be O(1/n), where n is the size of the dataset. However,if we use the default block size ( i.e., n 0.6 ), the expectederror will be O(1/n 0.4 ) which is much higher.As a result getting the optimal block size based on thespecific task helps to reduce the final error to a large extent.The optimal block size varies from problem to problem. Forexample, in k-means clustering or logistic regression the optimalblock size has to be much larger than one.Let l = n α be the optimal number of blocks, where α isa parameter to be ascertained. Hence, n 1−α is the blocksize. (For the simplicity of exposition we do not considerresampling.) Let f : R k×n → R be the query which isto be computed on the dataset T . Let the data blocks berepresented as T 1, · · · , T l . Let s be the sensitivity of thequery, i.e., the absolute value of maximum change that anyf(T i) can have if any one entry of T is changed. With theabove parameters in place, the ɛ-differentially private outputfrom the sample and aggregate framework isˆf(T ) = 1 ∑n αn α f(T i ) + Lap(sɛni=1α ) (1)Assume that the entries of the dataset T are drawn i.i.d,and that there exists a dataset T np (with entries drawn i.i.d.from the same distribution as T ) whose privacy we do notcare for under the aging of sensitivity model. Let n np be thenumber of entries in T np . We will use the aged dataset T npto estimate the optimal block size. Specifically, we partitionT np into blocks of size β = n 1−α . The number of blocksl np in T np is therefore l np = nnp. Notice that ln 1−α np is afunction of α, whose optimal value has to be found. Alsonote that the minimum value of α must satisfy the followinginequality: n np ≥ n 1−α .One possible approach for achieving a good value of α isby minimizing the empirical error in the final output. Theempirical error in the output of ˆf is defined asl np1 ∑ f(T np i ) − f(T np )∣ l +npi=1∣} {{ }A√2sɛn α} {{ }BHere A characterizes the estimation error, and B is due tothe Laplace noise added. We can minimize Equation 2 w.r.t.α when α ∈ [1 − log n np/ log n, 1]. Conventional techniqueslike hill climbing can be used to obtain a local minima.The α computed above is used to obtain the optimal numberof blocks. Since, the computation involves only the nonprivatedatabase T np , there is no effect on overall privacy.5. PRIVACY BUDGET MANAGEMENTIn differential privacy, the analyst is expected to specifythe privacy goals in terms of an abstract privacy budgetɛ. The analyst performs the data analysis task optimizing(2)it for accuracy goals and the availability of computationalresources. These metrics do not directly map onto the abstractprivacy budget. It should be noted that even a privacyexpert might be unable to map the privacy budget into accuracygoals for arbitrary problems. In this section we describemechanisms that GUPT use to convert the accuracy goalsinto a privacy budget and to efficiently distribute a givenprivacy budget across different analysis tasks.5.1 Estimating Privacy Budget for AccuracyGoalsIn this section we seek to answer the question: How canGUPT pick an appropriate ɛ, given a fixed accuracy goal?Specifically, we wish to minimize the ɛ parameter to maximallypreserve the privacy budget. It is often more intuitiveto specify an accuracy goal rather than a privacy parameterɛ, since accuracy relates to the problem at hand.Similar to the previous section, we assume the existenceof an aged dataset T np (drawn from the same distributionas the original dataset T ) whose privacy is not a concern.Consider an analyst who wishes to guarantee an accuracyρ with probability 1 − δ, i.e. the output should be withina factor ρ of the true value. We wish to estimate an appropriateɛ from an aged data set T np of size n np. Let βdenote the desired block size. To estimate ɛ, first the permissiblestandard deviation in the output σ is calculatedfor a specified accuracy goal ρ and then the following optimizationproblem is solved. Solve for ɛ, under the followingconstraints: 1) the expression in Equation 3 equals σ 2 , 2)α = max{0, log(n/β)}.⎛ ⎛⎞1⎝ 1 ∑l np⎝f(T npn α i ) − 1 ∑l npl np li=1npi=1} {{ }Cf(T npi ) ⎠2 ⎞ ⎠+ 2s2ɛ 2 n 2α } {{ }DIn Equation 3, C denotes the variance in the estimationerror and D denotes the variance in the output due to noise.To calculate σ from the accuracy goal, we can rely onChebyshev’s inequality: Pr[| ˆf(T ) − E(f(T i))| > φσ] < 1 .φ 2Furthermore, assuming that the query f is a approximatelynormal statistic, we have |E(f(T i)) − Truth| = O (1/β).Therefore: Pr[| ˆf(T ) − Truth| > φσ + O (1/β)] < (1/φ 2 ) Tomeet the output accuracy goal of ρ with probability 1 − δ,we set σ ≃ √ δ|1 − ρ|f(T np ). Here, we have assumed thatthe true answer is f(T np ) and 1/β ≪ σ/ √ δ.Since in the above calculations we assumed that the trueanswer is f(T np ), an obvious question is “why not outputf(T np ) as the answer?”. It can be shown that in a lot ofcases, the private output will be much better than f(T np ).If the assumption that 1/β ≪ σ/ √ δ does not hold, thenthe above technique for selecting privacy budget would producesuboptimal results. This however does not compromisethe privacy properties that GUPT wants to maintain, as itexplicitly limit the total privacy budget allocated for queriesaccessing a particular dataset.5.2 Automatic Privacy Budget DistributionDifferential privacy is an alien concept for most analysts.Further, the proper distribution of the limited privacy budgetacross multiple computations require significant mathematicalexpertise. GUPT eliminates the need to manuallydistribute privacy budget between tasks. The following examplewill highlight the requirement of an efficient privacy(3)
udget distribution rather than distributing equally amongvarious tasks.Example 4. Consider the same age census dataset T fromExample 2. Suppose we want to find the average age andthe variance present in the dataset while preserving differentialprivacy. Assume that the maximum possible humanage is max and the minimum age is zero.∑Assume that thenon-private variance is computed as 1 nn i=1 (T (i)−Avpriv)2 ,where Av priv is the private estimate of the average and n isthe size of T . If an entry of T is modified, the average Avchanges by at most max/n, however the variance can changeby at most max 2 /n.Let ɛ 1 and ɛ 2 be the privacy level expected for average andvariance respectively, with the total privacy budget being ɛ =ɛ 1 + ɛ 2. Now, if it is assumed that ɛ 1 = ɛ 2, then the errorin the computation of variance will be in the order of maxmore than in the computation of average. Whereas if privacybudget were distributed as ɛ 1 : ɛ 2 = 1 : max, then the noisein both the average and variance will roughly be the same.Given privacy budget of ɛ and we need to use it for computingvarious queries f 1, · · · , f m privately. If the privateestimation of query f i requires ɛ i privacy budget, then thetotal privacy budget spent will be ∑ mi=1ɛi (by compositionproperty of differential privacy [5]). The privacy budget isdistributed as follows. Let ζ iɛ ibe the standard deviation ofthe Laplace noise added by GUPT to ensure privacy level ɛ i.ζAllocate the privacy budget by setting ɛ i = ∑mi=1 iζ iɛ. Therationale behind taking such an approach is that usually thevariance in the computation by GUPT is mostly due to thevariance in the Laplace noise added. Hence, distributing ɛacross various tasks using the technique discussed above ensuresthat the variance due to Laplace noise in the privateoutput for each f i is the same.6. SYSTEM SECURITYGUPT is designed as a hosted platform where the analystis not trusted. It is thus important to ensure thatthe untrusted computation should not be able to access thedatasets directly. Additionally, it is important to prevent thecomputation from exhausting resources or compromising theservice provider. To this end, the “computation manager”is split into a server component that interacts with the userand a client component that runs on each node in the cluster.The trusted client is responsible for instantiating thecomputation in an isolated execution environment. The isolatedenvironment ensures that the computation can onlycommunicate with a trusted forwarding agent which sendsthe messages to the computation manager.6.1 Access ControlGUPT uses a mandatory access control framework (MAC)to ensure that (a) communication between different instancesof the computation is disallowed and (b) each instance of thecomputation can only store state (or modify data) within itsown scratch space. This is the only component of GUPTthat depends upon a platform dependent implementation.On Linux, the LSM framework [27] has enabled many MACframeworks such as SELinux and AppArmor to be built.GUPT defines a simple AppArmor policy for each instanceof the computation, setting its working directory to a temporaryscratch space that is emptied upon program termination.AppArmor does not yet allow fine grained control tolimit network activity to individual hosts and ports. Thusthe “computation manager” is split into a server and clientcomponent. The client component of the computation managerallows GUPT to disable all network activity for theuntrusted computation and restrict IPC to the client.We determined an empirical estimate of the overhead introducedby the AppArmor sandbox by executing an implementationof k-means clustering on GUPT 6, 000 times. Wefound that the sandboxed version of GUPT was only 0.012times slower than the non-sandboxed version (overhead of1.26%).6.2 Protection against side-channel attacksHaeberlen et al. [10] identified three possible side-channelattacks against differentially private systems. They are i)state attack, ii) privacy budget attack, and iii) timing attack.GUPT is not vulnerable to any of these attacks.State attacks: If the adversarial program can modify someinternal state (e.g., change the value of a static variable)when encountered with a specific data record. An adversarycan then look at the state to figure out whether therecord was present in the dataset. Both PINQ (in it’s currentimplementation) and Airavat are vulnerable to stateattacks. However, it is conceivable that operations can beisolated using .NET AppDomains in PINQ to isolate datacomputations. Since GUPT executes the complete analysisprogram (which may be adversarial) in isolated executionchambers and allows the analyst to access only the final differentiallyprivate output, state attacks are automaticallyprotected against.Privacy budget attack: In this attack, on encountering aparticular record, the adversarial program issues additionalqueries that exhausts the remaining privacy budget. [10]noted that PINQ is vulnerable to this attack. GUPT protectsagainst privacy budget attacks by managing the privacybudget itself, instead of letting the untrusted programperform the budget management.Timing attacks: In a timing attack, the adversarial programcould consume an unreasonably long amount of timeto execute (perhaps get into an infinite loop) when encounteredwith a specific data record. GUPT protects againstthis attack by setting a predefined bound on the number ofcycles for which the data analyst program runs on each datablock. If the computation on a particular data block completesbefore the predefined number of cycles, then GUPTwaits for the remaining cycles before producing an outputfrom that block. In case the computation exceeds the predefinednumber of cycles, the computation is killed and a constantvalue within the expected output range is produced asthe output of the program running on the data block underconsideration.Note that with the scheme above, the runtime of GUPTis independent of the data. Hence, the number of executioncycles does not reveal any information about the dataset.The proof that the final output is still differentially privateunder this scheme follows directly from the privacy guaranteeof the sample and aggregate framework and the factthat a change in one data entry can affect only one datablock (ignoring resampling). Thus GUPT is not vulnerableto timing attacks. Both PINQ and Airavat do not protectagainst timing attacks [10].
Page 1 and 2: GUPT: Privacy Preserving Data Analy
Page 3 and 4: DATASETBLOCKSPROGRAMTT 1 T 2 T 3…
Page 5: Intuitively, the larger the number
Page 9 and 10: Normalized Intra Cluster Variance12
Page 11 and 12: 7.1.2, PINQ requires the analyst to

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?