09.08.2015 Views

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

While differential privacy has strong theoretical properties,the shortcomings of existing differentially private dataanalysis systems have limited its adoption. For instance, existingprograms cannot be leveraged for private data analysiswithout modification. The magnitude of the perturbationintroduced in the final output is another cause ofconcern for data analysts. Differential privacy systems operateusing an abstract notion of privacy, called the ‘privacybudget’. Intuitively a lower privacy budget implies betterprivacy. However, this unit of privacy does not easily translateinto the utility of the program and is thus difficult fordata analysts who are not experts in privacy to interpret.Further, analysts would also be required to efficiently distributethis limited privacy budget between multiple queriesoperating on a dataset. An inefficient distribution of theprivacy budget would result in inaccurate data analysis andreduce the number of queries that can be safely performedon the dataset.We introduce <strong>GUPT</strong> 1 , a platform that allows organizationsto allow external aggregate analysis on their datasetswhile ensuring that data analysis is performed in a differentiallyprivate manner. It allows the execution of existingprograms with no modifications, eliminating the expensiveand demanding task of rewriting programs to be differentiallyprivate. <strong>GUPT</strong> enables data analysts to specify adesired output accuracy rather than work with an abstractprivacy budget. Finally, <strong>GUPT</strong> automatically parallelizesthe task across a cluster ensuring scalability for concurrentanalytics. We show through experiments on real datasetsthat <strong>GUPT</strong> overcomes many shortcomings of existing differentialprivacy systems without sacrificing accuracy.1.1 ContributionsWe design and develop <strong>GUPT</strong>, a platform for privacypreservingdata analytics. We introduce a new model fordata sensitivity which applies to a large class of datasetswhere the privacy requirement of data decreases over time.As we will explain in Section 3.3, using this model is appropriateand allows us to overcome significant challengesthat are fundamental to differential privacy. This approachenables us to analyze less sensitive data to get reasonableapproximations of privacy parameters that can be used fordata queries running on the newer data.<strong>GUPT</strong> makes the following technical contributions thatmake differential privacy usable in practice:1. Describing privacy budget in terms of accuracy:<strong>Data</strong> analysts are accustomed to the idea of workingwith inaccurate output (as is the case with data samplingin large datasets and many machine learning algorithmshave probabilistic output). <strong>GUPT</strong> uses theaging model of data sensitivity, to allow analysts todescribe the abstract ‘privacy budget’ in terms of expectedaccuracy of the final output.2. <strong>Privacy</strong> budget distribution: <strong>GUPT</strong> automaticallyallocates a privacy budget to each query in orderto match the data analysts’ accuracy requirements.Further, the analyst also does not have to distributethe privacy budget between the individual data operationsin the program.3. Accuracy of output: <strong>GUPT</strong> extends a theoreticaldifferential privacy framework called “sample and aggregate”(described in Section 2.1) for practical ap-1 <strong>GUPT</strong> is a Sanskrit word meaning ‘Secret’.plicability. This includes using a novel data resamplingtechnique that reduces the error introduced bythe framework’s data partitioning scheme. Further,the aging model of data sensitivity allows <strong>GUPT</strong> toselect an optimal partition size that reduces the perturbationadded for differential privacy.4. Prevent side channel attacks: <strong>GUPT</strong> defends againstside channel attacks such as the privacy budget attacks,state attacks and timing attacks described in [10].2. BACKGROUNDDifferential privacy places privacy research on a firm theoreticalfoundation. It guarantees that the presence or absenceof a particular record in a dataset will not significantlychange the output of any computation on a statisticaldataset. An adversary thus learns approximately the sameinformation about any individual record, irrespective of itspresence or absence in the original dataset.Definition 1 (ɛ-differential privacy [5]). Arandomized algorithm A is ɛ-differentially private if for alldatasets T, T ′ ∈ D n differing in at most one data record andfor any set of possible outputs O ⊆ Range(A), Pr[A(T ) ∈O] ≤ e ɛ Pr[A(T ′ ) ∈ O] . Here D is the domain from whichthe data records are drawn.The privacy parameter ɛ, also called the privacy budget [16],is fundamental to differential privacy. Intuitively, a lowervalue of ɛ implies stronger privacy guarantee and a highervalue implies a weaker privacy guarantee while possibly achievinghigher accuracy.2.1 Sample and AggregateAlgorithm 1 Sample and Aggregate Algorithm [24]Input: <strong>Data</strong>set T ∈ R n , length of the dataset n, privacyparameters ɛ, output range (min, max).1: Let l = n 0.42: Randomly partition T into l disjoint blocks T 1, · · · , T l .3: for i ∈ {1, · · · , l} do4: O i ← Output of user application on dataset T i.5: If O i > max, then O i ← max.6: If O i < min, then O i ← min.7: end for∑8: A ← 1 l | max − min |l i=1Oi + Lap( )l·ɛ<strong>GUPT</strong> leverages and extends the “sample and aggregate”framework [24, 19] (SAF) to design a practical and usablesystem which will guarantee differential privacy for arbitraryapplications. Given a statistical estimator P(T ) , where Tis the input dataset , SAF constructs a differentially privatestatistical estimator ˆP(T ) using P as a black box. Moreover,theoretical analysis guarantees that the output of ˆP(T ) convergesto that of P(T ) as the size of the dataset T increases.As the name “sample and aggregate” suggests, the algorithmfirst partitions the dataset into smaller subsets; i.e.,l = n 0.4 blocks (call them T 1, · · · , T l ) (see Figure 1). Theanalytics program P is applied on each of these datasets T iand the outputs O i are recorded. The O i’s are now clampedto within an output range that is either provided by the analystor inferred using a range estimator function. (Refer to

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!