09.08.2015 Views

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Normalized Intra Cluster Variance120100806040200PINQ-tight ɛ=2PINQ-tight ɛ=4<strong>GUPT</strong>-tight ɛ=1<strong>GUPT</strong>-tight ɛ=220 80 200k-means iteration countFigure 5: Total perturbation introduced by <strong>GUPT</strong>does not change with number of operations in theutility functionThus the number of operations performed in the programitself is irrelevant. A problem with writing specialized differentiallyprivate algorithms such as in the case of PINQ isthat given a privacy budget ɛ for the task, it is difficult todecide how much ɛ to spend on each query, since it is difficultto determine the number of iterations needed ahead oftime. PINQ requires the analyst to pre-specify the numberof iterations in order to allocate the privacy budget betweeniterations. This is often hard to do, since many data analysisalgorithms such as PageRank [20] and recursive relationqueries [2] require iterative computation until the algorithmreaches convergence. The performance of PINQ thus dependson the ability to accurately predict the number ofiterations. If the specified number of iterations is too small,then the algorithm may not converge. On the other hand,if the specified number of iterations is too large, then muchmore noise than is required will be added which will bothslow down the convergence of the algorithm as well as harmits accuracy. Figure 5 shows the effect of PINQ on accuracywhen performing k-means clustering on the dataset.In this example, the program output for the dataset convergeswithin a small number of iterations, e.g., n = 20.Whereas if a larger number of iterations (e.g., n = 200) wasconservatively chosen, then PINQ’s performance degradessignificantly. On the other hand, <strong>GUPT</strong> produces the sameamount of perturbation irrespective of the number of iterationsin k-means. Further, it should be noted that PINQwas subjected to a weaker privacy constraint (ɛ = 2 and 4)as compared to <strong>GUPT</strong> (ɛ = 1 and 2).7.1.3 ScalabilityTime (seconds)2.52.01.51.00.50.0Non Private<strong>GUPT</strong>-helper<strong>GUPT</strong>-loose20 80 100 200IterationsFigure 6: Change in computation time for increasednumber of iterations in k-meansUsing a server with two Intel Xeon 5550 quad-core CPUsand the entire dataset loaded in memory, we compare theResult accuracy (%)10095908580<strong>GUPT</strong>-helper constant ɛ=1<strong>GUPT</strong>-helper constant ɛ=0.3<strong>GUPT</strong>-helper variable ɛExpected Accuracy0 20 40 60 80 100Portion of queries (%)Figure 7: CDF of query accuracy for privacy budgetallocation mechanismsexecution time of an unmodified (non-private) instance anda <strong>GUPT</strong> instance of the k-means experiment.If tight output range (i.e. , <strong>GUPT</strong>-tight) is not available,typically, the output range estimation phase of the sampleand aggregate framework takes up most of the CPU cycles.When only loose range for the input is available (i.e. ,<strong>GUPT</strong>-helper), a differentially private percentile estimationis performed on all of the input data. This is a O(n ln n)operation, n being the number of data records in the originaldataset. On the other hand, if even loose range for theoutput is available (i.e. , <strong>GUPT</strong>-loose), then the percentileestimation is performed only on the output of each of theblocks in sample and aggregate framework, which is typicallyaround n 0.4 . This results in significantly reduced run-timeoverhead. The overhead introduced by <strong>GUPT</strong> is irrespectiveof the actual computation time itself. Thus as the computationtime increases, the overhead introduced by <strong>GUPT</strong>diminishes in comparison. Further, there is an additionalspeed up since each of the computation instances work on asmaller subset of the entire dataset. It should be noted thatthe reduction in computation time thus achieved could alsopotentially be achieved by the computational task runningwithout <strong>GUPT</strong>. Figure 6 shows that the overall completiontime of the private versions of the program increases slowlycompared to the non-private version as we increase the numberof iterations of k-means clustering.7.2 Using Aged <strong>Data</strong><strong>GUPT</strong> uses an aged dataset (that is no longer consideredprivacy sensitive) drawn from a similar distribution as thereal dataset. Section 4.3 describes the use of aged data toestimate an optimal block size that reduces the error introducedby data sampling. Section 5.1 describes how dataanalysts who are not privacy experts can continue to only describetheir accuracy goals yet achieve differentially privateoutputs. Finally, Section 5.2 uses aged data to automaticallydistribute a privacy budget between different querieson the same data set. In this section, we show experimentalresults that support the claims made in Sections 4.3 and 5.1.7.2.1 <strong>Privacy</strong> Budget EstimationTo illustrate the ease with which <strong>GUPT</strong> can be used bydata analysts, we evaluate the efficiency of <strong>GUPT</strong> by executingqueries that are not provided with a privacy budget.We use a census income dataset from the UCI machinelearning repository [7] which consists of 32561 entries. Theage data from this dataset is used to calculate the averageage. A reasonably loose range of [0, 150] was enforced

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!