09.08.2015 Views

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

size that will allow us to balance the estimation error andthe noise? The following example elucidates why answeringthe above question is important.Example 3. Consider the same age dataset T used inExample 2. If our goal is to find the average of the entriesin T while preserving privacy, then it can be observed that(ignoring resampling) the optimal size of each block is onewhich attains the optimal balance between the estimation errorand noise. If the block size was one, then the expectederror will be O(1/n), where n is the size of the dataset. However,if we use the default block size ( i.e., n 0.6 ), the expectederror will be O(1/n 0.4 ) which is much higher.As a result getting the optimal block size based on thespecific task helps to reduce the final error to a large extent.The optimal block size varies from problem to problem. Forexample, in k-means clustering or logistic regression the optimalblock size has to be much larger than one.Let l = n α be the optimal number of blocks, where α isa parameter to be ascertained. Hence, n 1−α is the blocksize. (For the simplicity of exposition we do not considerresampling.) Let f : R k×n → R be the query which isto be computed on the dataset T . Let the data blocks berepresented as T 1, · · · , T l . Let s be the sensitivity of thequery, i.e., the absolute value of maximum change that anyf(T i) can have if any one entry of T is changed. With theabove parameters in place, the ɛ-differentially private outputfrom the sample and aggregate framework isˆf(T ) = 1 ∑n αn α f(T i ) + Lap(sɛni=1α ) (1)Assume that the entries of the dataset T are drawn i.i.d,and that there exists a dataset T np (with entries drawn i.i.d.from the same distribution as T ) whose privacy we do notcare for under the aging of sensitivity model. Let n np be thenumber of entries in T np . We will use the aged dataset T npto estimate the optimal block size. Specifically, we partitionT np into blocks of size β = n 1−α . The number of blocksl np in T np is therefore l np = nnp. Notice that ln 1−α np is afunction of α, whose optimal value has to be found. Alsonote that the minimum value of α must satisfy the followinginequality: n np ≥ n 1−α .One possible approach for achieving a good value of α isby minimizing the empirical error in the final output. Theempirical error in the output of ˆf is defined asl np1 ∑ f(T np i ) − f(T np )∣ l +npi=1∣} {{ }A√2sɛn α} {{ }BHere A characterizes the estimation error, and B is due tothe Laplace noise added. We can minimize Equation 2 w.r.t.α when α ∈ [1 − log n np/ log n, 1]. Conventional techniqueslike hill climbing can be used to obtain a local minima.The α computed above is used to obtain the optimal numberof blocks. Since, the computation involves only the nonprivatedatabase T np , there is no effect on overall privacy.5. PRIVACY BUDGET MANAGEMENTIn differential privacy, the analyst is expected to specifythe privacy goals in terms of an abstract privacy budgetɛ. The analyst performs the data analysis task optimizing(2)it for accuracy goals and the availability of computationalresources. These metrics do not directly map onto the abstractprivacy budget. It should be noted that even a privacyexpert might be unable to map the privacy budget into accuracygoals for arbitrary problems. In this section we describemechanisms that <strong>GUPT</strong> use to convert the accuracy goalsinto a privacy budget and to efficiently distribute a givenprivacy budget across different analysis tasks.5.1 Estimating <strong>Privacy</strong> Budget for AccuracyGoalsIn this section we seek to answer the question: How can<strong>GUPT</strong> pick an appropriate ɛ, given a fixed accuracy goal?Specifically, we wish to minimize the ɛ parameter to maximallypreserve the privacy budget. It is often more intuitiveto specify an accuracy goal rather than a privacy parameterɛ, since accuracy relates to the problem at hand.Similar to the previous section, we assume the existenceof an aged dataset T np (drawn from the same distributionas the original dataset T ) whose privacy is not a concern.Consider an analyst who wishes to guarantee an accuracyρ with probability 1 − δ, i.e. the output should be withina factor ρ of the true value. We wish to estimate an appropriateɛ from an aged data set T np of size n np. Let βdenote the desired block size. To estimate ɛ, first the permissiblestandard deviation in the output σ is calculatedfor a specified accuracy goal ρ and then the following optimizationproblem is solved. Solve for ɛ, under the followingconstraints: 1) the expression in Equation 3 equals σ 2 , 2)α = max{0, log(n/β)}.⎛ ⎛⎞1⎝ 1 ∑l np⎝f(T npn α i ) − 1 ∑l npl np li=1npi=1} {{ }Cf(T npi ) ⎠2 ⎞ ⎠+ 2s2ɛ 2 n 2α } {{ }DIn Equation 3, C denotes the variance in the estimationerror and D denotes the variance in the output due to noise.To calculate σ from the accuracy goal, we can rely onChebyshev’s inequality: Pr[| ˆf(T ) − E(f(T i))| > φσ] < 1 .φ 2Furthermore, assuming that the query f is a approximatelynormal statistic, we have |E(f(T i)) − Truth| = O (1/β).Therefore: Pr[| ˆf(T ) − Truth| > φσ + O (1/β)] < (1/φ 2 ) Tomeet the output accuracy goal of ρ with probability 1 − δ,we set σ ≃ √ δ|1 − ρ|f(T np ). Here, we have assumed thatthe true answer is f(T np ) and 1/β ≪ σ/ √ δ.Since in the above calculations we assumed that the trueanswer is f(T np ), an obvious question is “why not outputf(T np ) as the answer?”. It can be shown that in a lot ofcases, the private output will be much better than f(T np ).If the assumption that 1/β ≪ σ/ √ δ does not hold, thenthe above technique for selecting privacy budget would producesuboptimal results. This however does not compromisethe privacy properties that <strong>GUPT</strong> wants to maintain, as itexplicitly limit the total privacy budget allocated for queriesaccessing a particular dataset.5.2 Automatic <strong>Privacy</strong> Budget DistributionDifferential privacy is an alien concept for most analysts.Further, the proper distribution of the limited privacy budgetacross multiple computations require significant mathematicalexpertise. <strong>GUPT</strong> eliminates the need to manuallydistribute privacy budget between tasks. The following examplewill highlight the requirement of an efficient privacy(3)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!