09.08.2015 Views

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

GUPT: Privacy Preserving Data Analysis Made Easy - Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

stances of the available datasets and maintains theavailable privacy budget.• The computation manager instantiates computationsand seamlessly pipes data from the dataset to the appropriateinstances.• The isolated execution chambers isolate and preventany malicious behavior by the computation instances.These building blocks allow the principals of the systemto easily interact with many parameters being automaticallyoptimized or are left as optional arguments for experts.Interface with the data owner: (a) A multi-dimensionaldataset (such as a database table) that for the purposeof our discussion, we assume is a collection of real valuedvectors , (b) a total privacy budget that can be allocatedfor computations on the dataset and (c) [Optional] inputattribute ranges, i.e. the lower and upper bound for eachdimension of the data. A detailed discussion on these boundsis presented in Section 4.1.For privacy reasons, the input ranges provided should notcontain any sensitive information. For example, it is reasonableto expect that a majority of the household’s annualincome should fall within the range [0; 500,000] dollars, thusit is not considered sensitive. On the other hand if the richesthousehold’s income is used as the upper bound privateinformation could be leaked. In this case, a public informationsource such as the national GDP could be used.Interface with the analyst: (a) <strong>Data</strong> analytics program,(b) a reference to the data set in the datasetmanager, (c) either a privacy budget or the desired accuracyfor the final answer and finally (d) an output rangeor a helper function for estimating the output range.A key requirement for the analytics program is that itshould be able to run on any subset of the original dataset.As <strong>GUPT</strong> executes the application in a black-box fashion,the program may also be provided as a binary executable.<strong>Privacy</strong> budget distribution: In order to guarantee ɛ-differential privacy for a dataset T , the privacy budget shouldbe distributed among the various applications (call themA 1, · · · , A k ) that operate on T . A composition lemma [5]states that if A 1, · · · , A k guarantee ɛ 1, · · · , ɛ k -differentialprivacy respectively, then T is ɛ-differential private, whereɛ = ∑ ki=1ɛi. Thus judicious allocation of the privacy budgetis extremely important. Unlike existing differential privacysolutions, <strong>GUPT</strong> relieves the analyst and the data ownerfrom that task of distributing this limited privacy budgetbetween multiple data analytics programs.3.2 <strong>Privacy</strong> and Utility Guarantees<strong>GUPT</strong> guarantees ɛ-differential privacy to the final output.It provides similar utility guarantees as the originalsample and aggregate algorithm from [24]. This guaranteeapplies to the queries satisfying “approximate 2 normality”condition defined by Smith [24], who also observed that awide class of queries satisfy this normality condition. Someof the examples being various maximum-likelihood estimatorsand estimators for regression problems. The formalutility guarantees are deferred to Appendix A.<strong>GUPT</strong> provides the same level of privacy for queries thatare not approximately normal. Reasonable utility could be2 By approximately normal statistic we refer to the genericasymptotically normal statistic in Definition 2 of [24].expected even from queries that are not approximately normal,even though no theoretical guarantees are provided.3.3 Aging of SensitivityIn real life datasets, the potential privacy threat for eachrecord is different. A privacy mechanism that considers thiscan obtain good utility while satisfying the privacy constraints.We introduce a new model called aging of sensitivity ofdata where older data records are considered to have lowerprivacy requirements. <strong>GUPT</strong> uses this model to optimizesome parameters of the sample and aggregate frameworklike block size (Section 4.3) and privacy budget allocation(Section 5.1). Consider the following motivating example:Example 1. Let T 70 yrs and T now be two datasets containingthe ages of citizens in a particular region 70 years earlierand at present respectively. It is conceivable that the privacythreat to T 70 yrs is much lower as many of the participatingpopulation may have deceased. Although T 70 yrs might not beas useful as T now for learning specific trends about the currentpopulation, it can be used to learn some general conceptsabout T now. For example, a crude estimate of the maximumage present in T now can be obtained from T 70 yrs.<strong>GUPT</strong> estimates such general trends in data distributionand uses them to optimize the performance of the system.The optimization results in a significant reduction in error.More precisely, the aged data is used for the following: i)to estimate an optimal block size for use in the sample andaggregate framework, ii) to identify the minimum privacybudget needed to estimate the final result within a givenaccuracy bound, and iii) to appropriately distribute the privacybudget ɛ across various tasks and queries.For simplicity of exposition, the particular aging model inour analysis is that a constant fraction of the dataset hascompletely aged out, i.e. the privacy of the entries in thisconstant fraction is no more of a concern. In reality, if theaged data is still weakly privacy sensitive, then it is possibleto privately estimate these parameters by introducing anappropriate magnitude of noise into these calculations. Theweak privacy of aged data allows us to keep the noise lowenough such that the estimated parameters are still useful.Existing techniques [1, 26] have attempted to use progressiveaggregation of old data in order to reduce its sensitivity. Theuse of differentially private operations for aggregation canpotentially be exploited to generate our training datasets.The use of these complementary approaches offer excitingopportunities that have not been explored in this paper.It is important to mention that <strong>GUPT</strong> does not requirethe aging model for default functionality. The default parameterchoices allow it work well in a generic setting. However,our experiments show that the aging model providesan additional improvement in performance.4. ACCURACY IMPROVEMENTSThe sample and aggregate framework (Algorithm 1) introducestwo sources of error:• Estimation error: This arises because the query is evaluatedon a smaller data block, rather than the entiredataset. Typically, the larger the block size, thesmaller the estimation error.• Noise: Another source of error is due to the Laplacenoise introduced to guarantee differential privacy.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!