GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

Recommendations

Info

7. EVALUATIONFor each data analysis program, the program binary andinterfaces with the GUPT “computation manager” shouldbe provided. For arbitrary binaries, a lean wrapper programcan be used for marshaling data to/from the format of thecomputation manager.In this section, we show using results from running commonmachine learning algorithms (such as k-means clusteringand logistic regression on a life sciences dataset) thatGUPT does not significantly affect the accuracy of dataanalysis. Further, we show that GUPT not only relieves theanalysts from the burden of distributing a privacy budgetbetween data transformation operations, it also manages toprovide superior output accuracy. Finally, we show throughbenchmarks the scalability of the GUPT architecture andthe benefits of using aged data to estimate optimal values ofprivacy budget and block sizes.7.1 Case Study: Life Sciences datasetWe evaluate the efficacy of GUPT using the ds1.10 lifesciences dataset taken from http://komarix.org/ac/ds as amotivating example for data analysis. This dataset containsthe top 10 principal components of chemical/biological compoundswith each of the 26, 733 rows representing differentcompounds. Additionally, the reactivity of the compound isavailable as an additional component. A k-means clusteringexperiment enables us to cluster compounds with similar featurestogether and logistic regression builds a linear classifierfor the experiment (e.g., predicting carcinogens). It shouldbe noted that these experiments only provide estimates asthe final answer, e.g., the cluster centroids in the case ofk-means. We show in this section that the perturbation introducedby GUPT only affects the final result marginally.7.1.1 Output AccuracyAccuracy1.00.90.80.70.60.5GUPT-tightNon private baseline2 4 6 8 10Privacy Budget (ɛ)Figure 3: Effect of privacy budget on the accuracyof prediction using Logistic Regression on the lifesciences datasetAs mentioned in Section 4, any analysis performed usingGUPT has two sources of error – (a) an estimation error,introduced because each instance of the computation workson a smaller subset of the data and (b) Laplace noise thatis added in order to guarantee differential privacy. In thissection, we show the effect of these errors when runninglogistic regression and k-means on the life sciences dataset.GUPT can be used to run existing programs with no modifications,thus drastically reducing the overhead of writingprivacy preserving programs. Analysts using GUPT are freeto use their favorite software packages written in any language.To demonstrate this property, we evaluate black boxNormalized Intra Cluster Variance1008060402000.40.50.60.70.80.91.02.03.04.0Privacy Budget (ɛ)Baseline ICVGUPT-looseGUPT-tightFigure 4: Intra-cluster variance for k-means clusteringon the life sciences datasetimplementations of logistic regression and k-means clusteringon the life sciences dataset.Logistic Regression: The logistic regression software packagefrom Microsoft Research (Orthant-Wise Limited-memoryQuasi-Newton Optimizer for L 1-regularized Objectives ) wasused to classify the compounds in the dataset as carcinogensand non-carcinogens. Figure 3 shows the accuracy of GUPTfor different privacy budgets.When the package was run on the dataset directly, a baselineaccuracy of 94% was obtained. The same package whenrun using the GUPT framework classified carcinogens withan accuracy between 75 ∼ 80%. To understand the sourceof the error, when the non-private algorithm was executednon a data block of size records, the accuracy reducednto 82%. It was thus determined 0.4that much of the errorstems from the loss of accuracy when the algorithm is runon smaller blocks of the entire dataset reduced. For datasetsof increasingly large size, this error is expected to diminish.k-means Clustering: Figure 4 shows the cluster variancecomputed from a k-means implementation run on the lifesciences dataset. The x-axis is various choices of the privacybudget ɛ, and the y-axis∑is the normalized Intra-ClusterVariance (ICV) defined as 1 K∑n i=1 ⃗x∈C i|⃗x −⃗c i| 2 2, where Kdenotes the number of clusters, C i denotes the set of pointswithin the i th cluster, and ⃗c i denotes the center of the i thcluster. A standard k-means implementation from the scipypython package is used for the experiment.The k-means implementation was run using GUPT withdifferent configurations for calculating the output range (Section4.1). For GUPT-tight, a tight range for the output istaken to be the exact minimum and the maximum of eachattribute (for all 10 attributes). For GUPT-loose, a looseoutput range is fixed as [min ∗2, max ∗2], where min andmax are the actual minimum and maximum for that attribute.Figure 4 shows that with increasing privacy budgetɛ, the amount of Laplace noise added to guarantee differentialprivacy decreases, thereby reducing the intra-clustervariance, i.e. making the answer more accurate. It can alsobe seen that when GUPT is provided with reasonably tightbounds on the output range (GUPT-tight), the output of thek-means experiment is very close to a non-private run of theexperiment even for small values of the privacy budget. Ifonly loose bounds are available (GUPT-loose), then a largerprivacy budget is required for the same output accuracy.7.1.2 Budget Distribution between OperationsIn GUPT, the program is treated as a black box andnoise is only added to the output of the entire program.
Normalized Intra Cluster Variance120100806040200PINQ-tight ɛ=2PINQ-tight ɛ=4GUPT-tight ɛ=1GUPT-tight ɛ=220 80 200k-means iteration countFigure 5: Total perturbation introduced by GUPTdoes not change with number of operations in theutility functionThus the number of operations performed in the programitself is irrelevant. A problem with writing specialized differentiallyprivate algorithms such as in the case of PINQ isthat given a privacy budget ɛ for the task, it is difficult todecide how much ɛ to spend on each query, since it is difficultto determine the number of iterations needed ahead oftime. PINQ requires the analyst to pre-specify the numberof iterations in order to allocate the privacy budget betweeniterations. This is often hard to do, since many data analysisalgorithms such as PageRank [20] and recursive relationqueries [2] require iterative computation until the algorithmreaches convergence. The performance of PINQ thus dependson the ability to accurately predict the number ofiterations. If the specified number of iterations is too small,then the algorithm may not converge. On the other hand,if the specified number of iterations is too large, then muchmore noise than is required will be added which will bothslow down the convergence of the algorithm as well as harmits accuracy. Figure 5 shows the effect of PINQ on accuracywhen performing k-means clustering on the dataset.In this example, the program output for the dataset convergeswithin a small number of iterations, e.g., n = 20.Whereas if a larger number of iterations (e.g., n = 200) wasconservatively chosen, then PINQ’s performance degradessignificantly. On the other hand, GUPT produces the sameamount of perturbation irrespective of the number of iterationsin k-means. Further, it should be noted that PINQwas subjected to a weaker privacy constraint (ɛ = 2 and 4)as compared to GUPT (ɛ = 1 and 2).7.1.3 ScalabilityTime (seconds)2.52.01.51.00.50.0Non PrivateGUPT-helperGUPT-loose20 80 100 200IterationsFigure 6: Change in computation time for increasednumber of iterations in k-meansUsing a server with two Intel Xeon 5550 quad-core CPUsand the entire dataset loaded in memory, we compare theResult accuracy (%)10095908580GUPT-helper constant ɛ=1GUPT-helper constant ɛ=0.3GUPT-helper variable ɛExpected Accuracy0 20 40 60 80 100Portion of queries (%)Figure 7: CDF of query accuracy for privacy budgetallocation mechanismsexecution time of an unmodified (non-private) instance anda GUPT instance of the k-means experiment.If tight output range (i.e. , GUPT-tight) is not available,typically, the output range estimation phase of the sampleand aggregate framework takes up most of the CPU cycles.When only loose range for the input is available (i.e. ,GUPT-helper), a differentially private percentile estimationis performed on all of the input data. This is a O(n ln n)operation, n being the number of data records in the originaldataset. On the other hand, if even loose range for theoutput is available (i.e. , GUPT-loose), then the percentileestimation is performed only on the output of each of theblocks in sample and aggregate framework, which is typicallyaround n 0.4 . This results in significantly reduced run-timeoverhead. The overhead introduced by GUPT is irrespectiveof the actual computation time itself. Thus as the computationtime increases, the overhead introduced by GUPTdiminishes in comparison. Further, there is an additionalspeed up since each of the computation instances work on asmaller subset of the entire dataset. It should be noted thatthe reduction in computation time thus achieved could alsopotentially be achieved by the computational task runningwithout GUPT. Figure 6 shows that the overall completiontime of the private versions of the program increases slowlycompared to the non-private version as we increase the numberof iterations of k-means clustering.7.2 Using Aged DataGUPT uses an aged dataset (that is no longer consideredprivacy sensitive) drawn from a similar distribution as thereal dataset. Section 4.3 describes the use of aged data toestimate an optimal block size that reduces the error introducedby data sampling. Section 5.1 describes how dataanalysts who are not privacy experts can continue to only describetheir accuracy goals yet achieve differentially privateoutputs. Finally, Section 5.2 uses aged data to automaticallydistribute a privacy budget between different querieson the same data set. In this section, we show experimentalresults that support the claims made in Sections 4.3 and 5.1.7.2.1 Privacy Budget EstimationTo illustrate the ease with which GUPT can be used bydata analysts, we evaluate the efficiency of GUPT by executingqueries that are not provided with a privacy budget.We use a census income dataset from the UCI machinelearning repository [7] which consists of 32561 entries. Theage data from this dataset is used to calculate the averageage. A reasonably loose range of [0, 150] was enforced
Page 1 and 2: GUPT: Privacy Preserving Data Analy
Page 3 and 4: DATASETBLOCKSPROGRAMTT 1 T 2 T 3…
Page 5 and 6: Intuitively, the larger the number
Page 7: udget distribution rather than dist
Page 11 and 12: 7.1.2, PINQ requires the analyst to

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

Create successful ePaper yourself

Delete template?

Save as template?