Privacy with Prior Information

More documents

Recommendations

Info

3.2 Maximum Flow We model our question as a maximum flow problem. Suppose we want a discriminative (s i , s j ) to be indistinguishable and the set of constraints is Q and the data generating distribution is θ. Define D i as the set of all databases that satisfy constraints Q and also fact s i . Similarly D j is defined. Define a directed graph G = ({s, t} ∪ V i ∪ V j , E) as follows. • For each database D ∈ D i , create a node v ∈ V i ; similarly, for each database D ∈ D j , create a node v ∈ V j . • There is an edge (s, v) for every v ∈ V i and an edge (v, t) for every v ∈ V j . An edge (v, v ′ ) for v ∈ V i and v ′ ∈ V j is created if the corresponding databases D and D ′ are induced neighbors. • The capacity on edge (s, v) is equal to Pr(D v |s i , Q) and the capacity on edge (v ′ , t) is equal to Pr(D v ′|s j , Q), where D v and D v ′ are databases corresponding to v and v ′ in D i and D j respectively. Notice that ∑ D∈D i Pr(D|s i , Q) = 1 and similarly ∑ D∈D j Pr(D|s j , Q) = 1. observation. We have the following Observation 1. If the maximum flow from s to t in graph G is equal to 1 for any data generating distribution θ, then induced neighbor privacy will guarantee Pufferfish privacy. Proof. According to the definition of Pufferfish privacy, we show that Eq. (1) is true, i.e., The left side can be decomposed as Pr(M(Data) = w|s i , θ) ≤ e ɛ Pr(M(Data) = w|s j , θ) Pr(M(Data) = w|s i , θ) = ∑ D∈D i Pr(M(Data) = w, Data = D|s i , θ) = ∑ D∈D i Pr(Data = D|s i , θ) Pr(M(Data) = w|Data = D) Similar decomposition applies to the right side. For any D ∈ D i and D ′ ∈ D j , if D and D ′ are induced neighbors, induced neighbor privacy guarantees that Pr(M(Data) = w|Data = D) ≤ e ɛ Pr(M(Data) = w|Data = D ′ ). Suppose the maximum flow of G is 1, then for each D ∈ D i , there is a set of neighboring databases N D ⊆ D j such that each D ′ ∈ N D is a induced neighbor of D and Pr(Data = D|s i , θ) = ∑ D ′ ∈N D Pr(Data = D ′ |s j , θ). Therefore, we have Pr(M(Data) = w|s i , θ) = ∑ Pr(Data = D|s i , θ) Pr(M(Data) = w|Data = D) D∈D i ≤ e ∑ ∑ ɛ Pr(Data = D ′ |s j , θ) Pr(M(Data) = w|Data = D ′ ) D∈D i D ′ ∈N D ∑ = e ɛ Pr(Data = D ′ |s j , θ) Pr(M(Data) = w|Data = D ′ ) D ′ ∈D j = e ɛ Pr(M(Data) = w|s j , θ) Notice that induced neighbor does not depend on the generating probability distribution θ, which makes it easy to apply in practice. 4
3.3 A Special Case We now consider the following assumption on the data generating process. Suppose each record of the database is generated according to the same distribution f and the single attribute count constraints Q. For example, f may be the disease distribution of the whole population and the constraints may be of the following form: C(x 1 ) = n 1 , C(x 2 ) = n 2 , . . . , C(x k ) = n k , where x i is some disease and C(x i ) is the count of x i in the database. We have the following result. Claim 1. If all records of the database are identically distributed under the single-attribute count constraints Q, then a mechanism M that satisfies induced neighbor privacy also satisfies Pufferfish privacy. Proof. According to Observation 1, we only need to show the maximum flow from s to t is 1 in the graph G corresponding to our data generating process. Suppose the count constraints are C(x 1 ) = n 1 , . . . , C(x k ) = n k and ∑ k i=1 n k ≤ n, where n is the total number of records. Also assume there is a pair of secrets (s i , s j ), where s i means record r has value x i and s j means r has value x j . Define D i and D j similarly. The number of different databases in D i can be computed as N i = ( )( n−1 n−ni n i−1 n j ) A, where A is a function of n, n1 , . . . , n k . Similarly, N j = ( )( n−1 n−ni−1) n i n j−1 A. The incoming flow to each node D ∈ D i is 1/N i and the outgoing flow from each node D ′ ∈ D j is 1/N j . For each D ∈ D i , by changing r from x i to x j and changing one of the n j records with value x j to value x i , we get an induced neighbor in D j . Therefore, each D ∈ D j has n j induced neighbors. Similarly, each D ′ ∈ D j has n i induced neighbors from D i . We split the incoming flow of D evenly to each of its induced neighbor. That is, each edge in G from D i to D j has flow 1/(N i n j ). Then, each D ′ is getting a total flow of n i /(N i n j ) = 1/N j . Therefore, the maximum flow is 1. Note that if ∑ k h=1 n h < n, then each D and D ′ we talked about represents a group of databases with the same configuration for x 1 , . . . , x k . 3.4 A Sampling Based Mechanism Although we have shown equivalence of induced neighbor privacy and Pufferfish privacy in a special case, there is no instantiation of Pufferfish privacy under more general constraints such as multi-attribute count constraint and node degree constraint for graph data. In this section, we give a sampling based guess-and-test mechanism that guarantee an approximate notion of Pufferfish privacy. The mechanism is nothing but add Laplacian noise to the output. The problem is, we don’t know how much noise is enough to guarantee Pufferfish privacy. Our strategy is to first guess a noise level, and then use sampling to test whether Pufferfish is satisfied by adding so much noise. We first give a more general definition of Pufferfish by allowing additive error. Definition 5. ( (ɛ, δ)-Pufferfish Privacy) Given a set of potential secrets S, a set of discriminative pairs S pairs , a set of data generating distributions Θ, and a privacy parameter ɛ > 0, a (potentially randomized) algorithm M satisfies (ɛ, δ)-Pufferfish(S, S pairs , Θ) privacy if • for all possible outputs w, • for all pairs (s i , s j ) ∈ S pairs of potential secrets, • for all distributions θ ∈ Θ for which P (s i |θ) ≠ 0 and P (s j |θ) ≠ 0 the following holds: P (M(Data) = w|s i , θ) ≤ e ɛ P (M(Data) = w|s j , θ) + δ (3) P (M(Data) = w|s j , θ) ≤ e ɛ P (M(Data) = w|s i , θ) + δ (4) Before describe the algorithm, we state the assumptions we make: • prior distribution θ is given and a database can be sampled from it efficiently 5
Page 1 and 2: Privacy with Prior Information Fina
Page 3: In this modified definition, each n
Page 7: In our algorithm, if α = α ɛ , t

Privacy with Prior Information

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?