16.01.2014 Views

Privacy with Prior Information

Privacy with Prior Information

Privacy with Prior Information

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Privacy</strong> <strong>with</strong> <strong>Prior</strong> <strong>Information</strong><br />

Final Report<br />

Jiangwei Pan<br />

1 Introduction<br />

In [2], Kifer and Machanavajjhala proposed a new privacy framework Pufferfish. In the definition of Pufferfish,<br />

a set of pairs of facts are given and the goal is to protect each pair from being distinguished. One<br />

example pair of facts for a medical database could be (“patient i has cancer”,“patient i doesn’t have cancer”).<br />

A data publish/query report mechanism guarantees Pufferfish privacy if the ability of the adversary<br />

to distinguish each pair does not change much after the adversary seeing the output of the mechanism. Note<br />

that it is possible the adversary can already distinguish some pair very well before seeing the output of the<br />

mechanism. The Pufferfish framework only wants to guarantee giving the output of some mechanism to the<br />

adversary will not increase his ability much.<br />

It turns out that differential privacy is a special case of Pufferfish privacy where all individual data<br />

records are generated independently according to some distributions. However, it is not always the case that<br />

individual records in the database are independent. For example, in social network, the presence of edges<br />

(Alice, Bob) and (Alice, Eve) makes edge (Bob, Eve) more likely to exist. When correlation exists between<br />

records, differential privacy doesn’t always guarantee privacy. In [1], a new definition of neighbors called<br />

“Induced Neighbors” was introduced to deal <strong>with</strong> correlations between data records. Based on this general<br />

neighbor definition, a generic differential privacy definition was also proposed.<br />

Pufferfish privacy is a good definition in terms of its semantic guarantee. However, it’s not easy to<br />

design privacy-preserving mechanisms directly based on Pufferfish. In contrast, we would first give some<br />

assumption about the data generating process (e.g. all records are independently generated), and then try<br />

to instantiate Pufferfish as some simpler definition of privacy. For example, it would be nice if the generic<br />

differential privacy definition fits into the Pufferfish privacy framework for the case where different records<br />

have correlations. However, the authors in [2] showed a bad example where this generic differential privacy<br />

can not guarantee Pufferfish privacy.<br />

In this project we first propose a modified version of the induced neighbor definition under which the<br />

bad example will not work any more. Under this modified definition of induced neighbors, we can also define<br />

generic differential privacy similarly.<br />

Our next objective is to show that this modified generic differential privacy is equivalent to Pufferfish<br />

framework under some assumptions of data generating process and the constraint. We were able to show<br />

the equivalence under single-attribute count constraint and iid generating process.<br />

Finally, we present a sampling based mechanism that guarantees an approximate notion of Pufferfish<br />

privacy.<br />

2 Induced Neighbor<br />

We consider the case that correlation between records is due to the existence of some constraints. For<br />

example, if we consider the database as a contingency table, the constraints may be known row/column<br />

sums; if the database is a graph, the constraints may be some prior released exact statistics about the graph<br />

like the degree distribution. Under a set of constraints Q, it may happen that traditional neighbors, that<br />

is, neighbors that differ by one record, of a database D satisfying Q may not satisfy Q. Differential privacy<br />

1


only guarantees these “invalid” neighbors of D indistinguishable from D, but for those “valid” neighbors,<br />

who might be different from D by more than one record, no privacy can be guaranteed.<br />

To deal <strong>with</strong> constraints, [1] proposed the notion of Induced Neighbors.<br />

Definition 1. ( Induced Neighbors [1]) Given a set Q of constraints, let D Q be the set of databases satisfying<br />

those constraints. Let D a and D b be two databases. Let n ab be the smallest number of moves necessary to<br />

transform D a into D b and let m 1 , . . . , m nab be one such sequence of moves. We say that D a and D b are<br />

neighbors induced by Q, denoted as N Q (D a , D b ) = true, if the following holds.<br />

• D a ∈ D Q and D b ∈ D b .<br />

• No subset of {m 1 , . . . , m nab } can transform D a into some D c ∈ D Q .<br />

Here a move is a process that adds or deletes a tuple from database D.<br />

Based on this definition of neighbors, the generic differential privacy can be defined as follows.<br />

Definition 2. ( Generic Differential <strong>Privacy</strong> [1]) A randomized algorithm A satisfies ɛ-differential privacy<br />

if for any set S, P ((D i ) ∈ S) ≤ e ɛ P (A(D j ) ∈ S) whenever databases D i and D j are induced neighbors.<br />

In [2], the authors gave a bad example showing that this induced neighbor definition fails to guarantee<br />

Pufferfish privacy. Suppose each database of the dataset contains n records, which are 0’s or 1’s. Let the<br />

constraint Q be: ∃k ∀i ≤ k, r i = 1∀i > k, r i = 0. The set of databases satisfying Q are shown below.<br />

D 0 D 1 D 2 . . . D n−2 D n−1 D n<br />

0 1 1 . . . 1 1 1<br />

0 0 1 . . . 1 1 1<br />

. . . . . . . . . . . . . . . . . . . . .<br />

0 0 0 . . . 0 1 1<br />

0 0 0 . . . 0 0 1<br />

According to Def. 1, each D i has only two induced neighbors, i.e. D i−1 and D i+1 (D 0 and D n have one<br />

neighbor). The generic differential privacy (see Def. 2) thus guarantees that each pair of databases, D i and<br />

D i+1 ), are indistinguishable.<br />

In the Pufferfish framework, facts about each individual record are considered secret and thus need to be<br />

kept private. In the above example, the most important fact about the first record of a database is its value,<br />

which can be either 1 or 0. Therefore, Pufferfish requires that the output distribution should be piecewise<br />

similar when the first record has value either 1 or 0. Notice that under constraint Q, only D 0 has 0 in<br />

the first record and all other D i ’s (i > 0) have 1 in that record. This essentially requires us to make D 0<br />

indistinguishable from all other D i ’s. Therefore, the induced neighbor definition doesn’t guarantee Pufferfish<br />

privacy.<br />

We now slightly modify the original definition of induced neighbor such that it guarantees Pufferfish<br />

privacy for the above example.<br />

Definition 3. ( Modified Induced Neighbors) Given a set Q of constraints, let D Q be the set of databases<br />

satisfying those constraints. Let D a and D b be two databases from D Q such that their j-th records are a and<br />

b respectively (a ≠ b). Let m 1 , m 2 , . . . , m k be the minimum sequence of moves that transform D a into D b .<br />

We say that D a and D b are neighbors induced by Q, denoted as N Q (D a , D b ) = true, if the following holds.<br />

• D a ∈ D Q and D b ∈ D b .<br />

• No subset of {m 1 , . . . , m k } can transform D a into some D ′ b ∈ D Q <strong>with</strong> j-th record equal to b.<br />

Here a move is a process that adds or deletes a tuple from database D.<br />

2


In this modified definition, each neighbor will be associated <strong>with</strong> some record j whose change resulted in<br />

the neighboring relation. Under this definition, in the above example, every two databases D i and D j are<br />

induced neighbors. Thus, the corresponding generic differential privacy definition will guarantee Pufferfish<br />

privacy.<br />

Under certain type of constraints, two seemingly very different databases can become induced neighbors.<br />

For example, suppose the databases are gaphs (represented by adjacency matrices) and the constraint is<br />

a fixed degree distribution of the graph (through some prior exact queries). Then in the worst case, two<br />

graphs that are induced neighbors can differ by O(n) edges, where n is the number of nodes in the graph. If<br />

the degree of each node is 1, then each graph satisfying this constraint is a set of node disjoint edges. One<br />

such graph G 1 may contain edges {(v 1 , v 2 ), (v 3 , v 4 ), . . . , (v n−1 , v n )} and another such graph G 2 may contain<br />

edges {(v 2 , v 3 ), (v 4 , v 5 ), . . . , (v n , v 1 )}. The set of moves of transfering G 1 to G 2 consists of removing all edges<br />

in G 1 and adding all edges in G 2 . Also, no subset of those moves can make G 1 into some other graph <strong>with</strong><br />

degrees of all nodes equal to 1. Therefore, G 1 and G 2 are induced neighbors and they differ by n/2 edges.<br />

3 Pufferfish With Constraints<br />

Suppose the set of constraints are Q, and all the databases satisfying Q are denoted as D Q . In [2], it is assumed<br />

that each database is generated according to one of the following distributions: θ = {π 1 , . . . , π N , f 1 , . . . , f N , Q}.<br />

We have<br />

{ 0, if D doesn’t satisfy Q<br />

P (D|θ) =<br />

1<br />

Z Q<br />

Π ri∈Df i (r i )π i Π rj /∈D(1 − π j ),<br />

otherwise<br />

In other words, a database is firstly generated independently according to each record’s (individual’s) probability<br />

distribution (π i , f i ) and then checked against Q. If it satisfies Q, then add it to D Q ; otherwise, discard<br />

it.<br />

3.1 Pufferfish privacy<br />

The definition of Pufferfish privacy is as follows.<br />

Definition 4. ( Pufferfish <strong>Privacy</strong> [2]) Given a set of potential secrets S, a set of discriminative pairs S pairs ,<br />

a set of data generating distributions Θ, and a privacy parameter ɛ > 0, a (potentially randomized) algorithm<br />

M satisfies ɛ-Pufferfish(S, S pairs , Θ) privacy if<br />

• for all possible outputs w,<br />

• for all pairs (s i , s j ) ∈ S pairs of potential secrets,<br />

• for all distributions θ ∈ Θ for which P (s i |θ) ≠ 0 and P (s j |θ) ≠ 0<br />

the following holds:<br />

P (M(Data) = w|s i , θ) ≤ e ɛ P (M(Data) = w|s j , θ) (1)<br />

P (M(Data) = w|s j , θ) ≤ e ɛ P (M(Data) = w|s i , θ) (2)<br />

It has been shown that if there is no correlation between records, then differentially private mechanisms<br />

also satisfy Pufferfish privacy. Since we have defined a general notion of neighbors and differential privacy<br />

when correlation exists, a natural question is: does’t the generic differential privacy under our modified<br />

induced neighbor definition satisfy Pufferfish privacy?<br />

3


3.2 Maximum Flow<br />

We model our question as a maximum flow problem. Suppose we want a discriminative (s i , s j ) to be<br />

indistinguishable and the set of constraints is Q and the data generating distribution is θ. Define D i as the<br />

set of all databases that satisfy constraints Q and also fact s i . Similarly D j is defined. Define a directed<br />

graph G = ({s, t} ∪ V i ∪ V j , E) as follows.<br />

• For each database D ∈ D i , create a node v ∈ V i ; similarly, for each database D ∈ D j , create a node<br />

v ∈ V j .<br />

• There is an edge (s, v) for every v ∈ V i and an edge (v, t) for every v ∈ V j . An edge (v, v ′ ) for v ∈ V i<br />

and v ′ ∈ V j is created if the corresponding databases D and D ′ are induced neighbors.<br />

• The capacity on edge (s, v) is equal to Pr(D v |s i , Q) and the capacity on edge (v ′ , t) is equal to<br />

Pr(D v ′|s j , Q), where D v and D v ′ are databases corresponding to v and v ′ in D i and D j respectively.<br />

Notice that ∑ D∈D i<br />

Pr(D|s i , Q) = 1 and similarly ∑ D∈D j<br />

Pr(D|s j , Q) = 1.<br />

observation.<br />

We have the following<br />

Observation 1. If the maximum flow from s to t in graph G is equal to 1 for any data generating distribution<br />

θ, then induced neighbor privacy will guarantee Pufferfish privacy.<br />

Proof. According to the definition of Pufferfish privacy, we show that Eq. (1) is true, i.e.,<br />

The left side can be decomposed as<br />

Pr(M(Data) = w|s i , θ) ≤ e ɛ Pr(M(Data) = w|s j , θ)<br />

Pr(M(Data) = w|s i , θ) = ∑<br />

D∈D i<br />

Pr(M(Data) = w, Data = D|s i , θ)<br />

= ∑<br />

D∈D i<br />

Pr(Data = D|s i , θ) Pr(M(Data) = w|Data = D)<br />

Similar decomposition applies to the right side. For any D ∈ D i and D ′ ∈ D j , if D and D ′ are induced<br />

neighbors, induced neighbor privacy guarantees that Pr(M(Data) = w|Data = D) ≤ e ɛ Pr(M(Data) =<br />

w|Data = D ′ ).<br />

Suppose the maximum flow of G is 1, then for each D ∈ D i , there is a set of neighboring databases<br />

N D ⊆ D j such that each D ′ ∈ N D is a induced neighbor of D and Pr(Data = D|s i , θ) = ∑ D ′ ∈N D<br />

Pr(Data =<br />

D ′ |s j , θ). Therefore, we have<br />

Pr(M(Data) = w|s i , θ) = ∑<br />

Pr(Data = D|s i , θ) Pr(M(Data) = w|Data = D)<br />

D∈D i<br />

≤ e ∑ ∑<br />

ɛ<br />

Pr(Data = D ′ |s j , θ) Pr(M(Data) = w|Data = D ′ )<br />

D∈D i D ′ ∈N D<br />

∑<br />

= e ɛ Pr(Data = D ′ |s j , θ) Pr(M(Data) = w|Data = D ′ )<br />

D ′ ∈D j<br />

= e ɛ Pr(M(Data) = w|s j , θ)<br />

Notice that induced neighbor does not depend on the generating probability distribution θ, which makes<br />

it easy to apply in practice.<br />

4


3.3 A Special Case<br />

We now consider the following assumption on the data generating process. Suppose each record of the<br />

database is generated according to the same distribution f and the single attribute count constraints Q.<br />

For example, f may be the disease distribution of the whole population and the constraints may be of the<br />

following form: C(x 1 ) = n 1 , C(x 2 ) = n 2 , . . . , C(x k ) = n k , where x i is some disease and C(x i ) is the count of<br />

x i in the database. We have the following result.<br />

Claim 1. If all records of the database are identically distributed under the single-attribute count constraints<br />

Q, then a mechanism M that satisfies induced neighbor privacy also satisfies Pufferfish privacy.<br />

Proof. According to Observation 1, we only need to show the maximum flow from s to t is 1 in the graph G<br />

corresponding to our data generating process. Suppose the count constraints are C(x 1 ) = n 1 , . . . , C(x k ) = n k<br />

and ∑ k<br />

i=1 n k ≤ n, where n is the total number of records. Also assume there is a pair of secrets (s i , s j ),<br />

where s i means record r has value x i and s j means r has value x j . Define D i and D j similarly. The number<br />

of different databases in D i can be computed as N i = ( )(<br />

n−1 n−ni<br />

n i−1<br />

n j<br />

)<br />

A, where A is a function of n, n1 , . . . , n k .<br />

Similarly, N j = ( )(<br />

n−1 n−ni−1) n i n j−1 A.<br />

The incoming flow to each node D ∈ D i is 1/N i and the outgoing flow from each node D ′ ∈ D j is 1/N j .<br />

For each D ∈ D i , by changing r from x i to x j and changing one of the n j records <strong>with</strong> value x j to value<br />

x i , we get an induced neighbor in D j . Therefore, each D ∈ D j has n j induced neighbors. Similarly, each<br />

D ′ ∈ D j has n i induced neighbors from D i . We split the incoming flow of D evenly to each of its induced<br />

neighbor. That is, each edge in G from D i to D j has flow 1/(N i n j ). Then, each D ′ is getting a total flow of<br />

n i /(N i n j ) = 1/N j . Therefore, the maximum flow is 1.<br />

Note that if ∑ k<br />

h=1 n h < n, then each D and D ′ we talked about represents a group of databases <strong>with</strong><br />

the same configuration for x 1 , . . . , x k .<br />

3.4 A Sampling Based Mechanism<br />

Although we have shown equivalence of induced neighbor privacy and Pufferfish privacy in a special case,<br />

there is no instantiation of Pufferfish privacy under more general constraints such as multi-attribute count<br />

constraint and node degree constraint for graph data. In this section, we give a sampling based guess-and-test<br />

mechanism that guarantee an approximate notion of Pufferfish privacy.<br />

The mechanism is nothing but add Laplacian noise to the output. The problem is, we don’t know how<br />

much noise is enough to guarantee Pufferfish privacy. Our strategy is to first guess a noise level, and then<br />

use sampling to test whether Pufferfish is satisfied by adding so much noise.<br />

We first give a more general definition of Pufferfish by allowing additive error.<br />

Definition 5. ( (ɛ, δ)-Pufferfish <strong>Privacy</strong>) Given a set of potential secrets S, a set of discriminative pairs<br />

S pairs , a set of data generating distributions Θ, and a privacy parameter ɛ > 0, a (potentially randomized)<br />

algorithm M satisfies (ɛ, δ)-Pufferfish(S, S pairs , Θ) privacy if<br />

• for all possible outputs w,<br />

• for all pairs (s i , s j ) ∈ S pairs of potential secrets,<br />

• for all distributions θ ∈ Θ for which P (s i |θ) ≠ 0 and P (s j |θ) ≠ 0<br />

the following holds:<br />

P (M(Data) = w|s i , θ) ≤ e ɛ P (M(Data) = w|s j , θ) + δ (3)<br />

P (M(Data) = w|s j , θ) ≤ e ɛ P (M(Data) = w|s i , θ) + δ (4)<br />

Before describe the algorithm, we state the assumptions we make:<br />

• prior distribution θ is given and a database can be sampled from it efficiently<br />

5


• assume every secret pair is of the form (s, ¬s) and Pr(s|θ)/ Pr(¬s|θ) ∈ [1/c, c] for some constant c.<br />

This is reasonable in that, secret s is only worthy to protect if the adversary doesn’t know much about<br />

it in advance.<br />

• ɛ and δ are parameters<br />

The algorithm works as follows:<br />

1. set α to some initial noise level α 0<br />

2. sample O((c + 1)k) = O(k) databases independently from θ<br />

3. compute the output for each sample and add noise Lap(α)<br />

4. For every pair of secrets (s, ¬s)<br />

5. find O s and O ¬s : outputs for samples <strong>with</strong> s and ¬s true<br />

6. for every possible output w<br />

7. denote p ws = Pr(w|s, θ) and p w¬s = Pr(w|¬s, θ)<br />

8. estimate ¯p ws and ¯p ws ′ using fraction of w in O s and O ¬s .<br />

9. if ¯p ws > e ɛ ¯p ws ′ + δ/2<br />

10. set α = 2α and go to step 3<br />

11. end for<br />

12. end for<br />

13. use α as the noise level for any future query.<br />

Before discussing the running time of the above procedure, we first show the accuracy of the estimation.<br />

Claim 2. Set the size of the sample k = O( (1+eɛ ) 2<br />

δ<br />

log U), where U is the upper bound of the size of the<br />

2<br />

output range and the number of pairs of secrets. Let α ɛ and α ɛ,δ be the minimum noise levels that guarantee<br />

ɛ-Pufferfish and (ɛ, δ)-Pufferfish privacy. With high probability, our algorithm will output a noise level α<br />

such that α ɛ,δ ≤ α ≤ α ɛ .<br />

Proof. Denote S as the set of databases that satisfy secret s and θ and ¯S as θ − S. Firstly, given 2(c + 1)k<br />

independent samples from θ, <strong>with</strong> probability at least 1 − 1 , at least k samples are from each of S or ¯S.<br />

U O(1)<br />

Fix some output w and some secret s. Let X i be the indicator random variable of the event that the i-th<br />

sample outputs w. Note that E[X i ] = p ws . Let X = ∑ k<br />

i=1 X i. We have<br />

Pr(|¯p ws − p ws | ≥ ∆) = Pr(|X/k − p ws | ≥ ∆)<br />

≤ 2e −2k∆2 Set ∆ =<br />

≤ 1<br />

U O(1)<br />

δ<br />

2(1 + e ɛ )<br />

Therefore, the probability that the estimated ¯p ws is more than ∆ away from its true value for any w and s<br />

is at most 1/U. In other words, <strong>with</strong> high probability, all estimates are <strong>with</strong>in ∆ distance away from their<br />

true values.<br />

6


In our algorithm, if α = α ɛ , then <strong>with</strong> high probability, for any w and (s, s ′ ),<br />

¯p ws ≤ p ws + ∆<br />

≤ e ɛ p ws ′ + ∆<br />

≤ e ɛ (¯p ws ′ + ∆) + ∆<br />

= e ɛ ¯p ws ′ + (1 + e ɛ )∆<br />

= e ɛ ¯p ws ′ + δ/2<br />

Therefore, our algorithm will output α = α ɛ . This means that <strong>with</strong> high probability, the output of our<br />

algorithm will not exceed α ɛ .<br />

On the other hand, if our algorithm outputs some α, then <strong>with</strong> high probability, for any w and (s, s ′ ),<br />

p ws ≤ ¯p ws + ∆<br />

≤ e ɛ ¯p ws ′ + δ/2 + ∆<br />

≤ e ɛ (p ws ′ + ∆) + δ/2 + ∆<br />

= e ɛ p ws ′ + (1 + e ɛ )∆ + δ/2<br />

= e ɛ p ws ′ + δ<br />

Therefore, <strong>with</strong> high probability, the noise level output by our algorithm will be at least α ɛ,δ .<br />

The running time depends on<br />

• how fast a sample of θ that satisfy s can be obtained?<br />

• how many different possible outputs?<br />

• how many different secret pairs?<br />

Claim 3. Let t θ be the time needed to obtain one sample from distribution θ, and let |O| denote the number<br />

of possible outputs. The running time of the sampling algorithm is O((k(t θ + t output ) + |O|) log α α 0<br />

), where<br />

k = O( (1+eɛ ) 2<br />

δ<br />

log U) and t 2<br />

output is the time to compute the query output given a database and perturb it<br />

using laplace noise.<br />

4 Future Work<br />

The following questions are worth exploring in the future:<br />

1. Is Induced Neighbor <strong>Privacy</strong> equivalent to Pufferfish under more general assumptions of the priors?<br />

2. How to sample a database from some natural distribution θ?<br />

References<br />

[1] Daniel Kifer and Ashwin Machanavajjhala. No free lunch in data privacy. In Proceedings of the 2011<br />

ACM SIGMOD International Conference on Management of data, SIGMOD ’11, pages 193–204, New<br />

York, NY, USA, 2011. ACM.<br />

[2] Daniel Kifer and Ashwin Machanavajjhala. A rigorous and customizable framework for privacy. In<br />

Proceedings of the 31st symposium on Principles of Database Systems, PODS ’12, pages 77–88, New<br />

York, NY, USA, 2012. ACM.<br />

7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!