Privacy with Prior Information

Privacy with Prior Information 

Final Report 

Jiangwei Pan 

1 Introduction 

In [2], Kifer and Machanavajjhala proposed a new privacy framework Pufferfish. In the definition of Pufferfish, 

a set of pairs of facts are given and the goal is to protect each pair from being distinguished. One 

example pair of facts for a medical database could be (“patient i has cancer”,“patient i doesn’t have cancer”). 

A data publish/query report mechanism guarantees Pufferfish privacy if the ability of the adversary 

to distinguish each pair does not change much after the adversary seeing the output of the mechanism. Note 

that it is possible the adversary can already distinguish some pair very well before seeing the output of the 

mechanism. The Pufferfish framework only wants to guarantee giving the output of some mechanism to the 

adversary will not increase his ability much. 

It turns out that differential privacy is a special case of Pufferfish privacy where all individual data 

records are generated independently according to some distributions. However, it is not always the case that 

individual records in the database are independent. For example, in social network, the presence of edges 

(Alice, Bob) and (Alice, Eve) makes edge (Bob, Eve) more likely to exist. When correlation exists between 

records, differential privacy doesn’t always guarantee privacy. In [1], a new definition of neighbors called 

“Induced Neighbors” was introduced to deal with correlations between data records. Based on this general 

neighbor definition, a generic differential privacy definition was also proposed. 

Pufferfish privacy is a good definition in terms of its semantic guarantee. However, it’s not easy to 

design privacy-preserving mechanisms directly based on Pufferfish. In contrast, we would first give some 

assumption about the data generating process (e.g. all records are independently generated), and then try 

to instantiate Pufferfish as some simpler definition of privacy. For example, it would be nice if the generic 

differential privacy definition fits into the Pufferfish privacy framework for the case where different records 

have correlations. However, the authors in [2] showed a bad example where this generic differential privacy 

can not guarantee Pufferfish privacy. 

In this project we first propose a modified version of the induced neighbor definition under which the 

bad example will not work any more. Under this modified definition of induced neighbors, we can also define 

generic differential privacy similarly. 

Our next objective is to show that this modified generic differential privacy is equivalent to Pufferfish 

framework under some assumptions of data generating process and the constraint. We were able to show 

the equivalence under single-attribute count constraint and iid generating process. 

Finally, we present a sampling based mechanism that guarantees an approximate notion of Pufferfish 

privacy. 

2 Induced Neighbor 

We consider the case that correlation between records is due to the existence of some constraints. For 

example, if we consider the database as a contingency table, the constraints may be known row/column 

sums; if the database is a graph, the constraints may be some prior released exact statistics about the graph 

like the degree distribution. Under a set of constraints Q, it may happen that traditional neighbors, that 

is, neighbors that differ by one record, of a database D satisfying Q may not satisfy Q. Differential privacy 

1

only guarantees these “invalid” neighbors of D indistinguishable from D, but for those “valid” neighbors, 

who might be different from D by more than one record, no privacy can be guaranteed. 

To deal with constraints, [1] proposed the notion of Induced Neighbors. 

Definition 1. ( Induced Neighbors [1]) Given a set Q of constraints, let D Q be the set of databases satisfying 

those constraints. Let D a and D b be two databases. Let n ab be the smallest number of moves necessary to 

transform D a into D b and let m 1 , . . . , m nab be one such sequence of moves. We say that D a and D b are 

neighbors induced by Q, denoted as N Q (D a , D b ) = true, if the following holds. 

• D a ∈ D Q and D b ∈ D b . 

• No subset of {m 1 , . . . , m nab } can transform D a into some D c ∈ D Q . 

Here a move is a process that adds or deletes a tuple from database D. 

Based on this definition of neighbors, the generic differential privacy can be defined as follows. 

Definition 2. ( Generic Differential Privacy [1]) A randomized algorithm A satisfies ɛ-differential privacy 

if for any set S, P ((D i ) ∈ S) ≤ e ɛ P (A(D j ) ∈ S) whenever databases D i and D j are induced neighbors. 

In [2], the authors gave a bad example showing that this induced neighbor definition fails to guarantee 

Pufferfish privacy. Suppose each database of the dataset contains n records, which are 0’s or 1’s. Let the 

constraint Q be: ∃k ∀i ≤ k, r i = 1∀i > k, r i = 0. The set of databases satisfying Q are shown below. 

D 0 D 1 D 2 . . . D n−2 D n−1 D n 

0 1 1 . . . 1 1 1 

0 0 1 . . . 1 1 1 

. . . . . . . . . . . . . . . . . . . . . 

0 0 0 . . . 0 1 1 

0 0 0 . . . 0 0 1 

According to Def. 1, each D i has only two induced neighbors, i.e. D i−1 and D i+1 (D 0 and D n have one 

neighbor). The generic differential privacy (see Def. 2) thus guarantees that each pair of databases, D i and 

D i+1 ), are indistinguishable. 

In the Pufferfish framework, facts about each individual record are considered secret and thus need to be 

kept private. In the above example, the most important fact about the first record of a database is its value, 

which can be either 1 or 0. Therefore, Pufferfish requires that the output distribution should be piecewise 

similar when the first record has value either 1 or 0. Notice that under constraint Q, only D 0 has 0 in 

the first record and all other D i ’s (i > 0) have 1 in that record. This essentially requires us to make D 0 

indistinguishable from all other D i ’s. Therefore, the induced neighbor definition doesn’t guarantee Pufferfish 

privacy. 

We now slightly modify the original definition of induced neighbor such that it guarantees Pufferfish 

privacy for the above example. 

Definition 3. ( Modified Induced Neighbors) Given a set Q of constraints, let D Q be the set of databases 

satisfying those constraints. Let D a and D b be two databases from D Q such that their j-th records are a and 

b respectively (a ≠ b). Let m 1 , m 2 , . . . , m k be the minimum sequence of moves that transform D a into D b . 

We say that D a and D b are neighbors induced by Q, denoted as N Q (D a , D b ) = true, if the following holds. 

• D a ∈ D Q and D b ∈ D b . 

• No subset of {m 1 , . . . , m k } can transform D a into some D ′ b ∈ D Q with j-th record equal to b. 

Here a move is a process that adds or deletes a tuple from database D. 

2

In this modified definition, each neighbor will be associated with some record j whose change resulted in 

the neighboring relation. Under this definition, in the above example, every two databases D i and D j are 

induced neighbors. Thus, the corresponding generic differential privacy definition will guarantee Pufferfish 

privacy. 

Under certain type of constraints, two seemingly very different databases can become induced neighbors. 

For example, suppose the databases are gaphs (represented by adjacency matrices) and the constraint is 

a fixed degree distribution of the graph (through some prior exact queries). Then in the worst case, two 

graphs that are induced neighbors can differ by O(n) edges, where n is the number of nodes in the graph. If 

the degree of each node is 1, then each graph satisfying this constraint is a set of node disjoint edges. One 

such graph G 1 may contain edges {(v 1 , v 2 ), (v 3 , v 4 ), . . . , (v n−1 , v n )} and another such graph G 2 may contain 

edges {(v 2 , v 3 ), (v 4 , v 5 ), . . . , (v n , v 1 )}. The set of moves of transfering G 1 to G 2 consists of removing all edges 

in G 1 and adding all edges in G 2 . Also, no subset of those moves can make G 1 into some other graph with 

degrees of all nodes equal to 1. Therefore, G 1 and G 2 are induced neighbors and they differ by n/2 edges. 

3 Pufferfish With Constraints 

Suppose the set of constraints are Q, and all the databases satisfying Q are denoted as D Q . In [2], it is assumed 

that each database is generated according to one of the following distributions: θ = {π 1 , . . . , π N , f 1 , . . . , f N , Q}. 

We have 

{ 0, if D doesn’t satisfy Q 

P (D|θ) = 

1 

Z Q 

Π ri∈Df i (r i )π i Π rj /∈D(1 − π j ), 

otherwise 

In other words, a database is firstly generated independently according to each record’s (individual’s) probability 

distribution (π i , f i ) and then checked against Q. If it satisfies Q, then add it to D Q ; otherwise, discard 

it. 

3.1 Pufferfish privacy 

The definition of Pufferfish privacy is as follows. 

Definition 4. ( Pufferfish Privacy [2]) Given a set of potential secrets S, a set of discriminative pairs S pairs , 

a set of data generating distributions Θ, and a privacy parameter ɛ > 0, a (potentially randomized) algorithm 

M satisfies ɛ-Pufferfish(S, S pairs , Θ) privacy if 

• for all possible outputs w, 

• for all pairs (s i , s j ) ∈ S pairs of potential secrets, 

• for all distributions θ ∈ Θ for which P (s i |θ) ≠ 0 and P (s j |θ) ≠ 0 

the following holds: 

P (M(Data) = w|s i , θ) ≤ e ɛ P (M(Data) = w|s j , θ) (1) 

P (M(Data) = w|s j , θ) ≤ e ɛ P (M(Data) = w|s i , θ) (2) 

It has been shown that if there is no correlation between records, then differentially private mechanisms 

also satisfy Pufferfish privacy. Since we have defined a general notion of neighbors and differential privacy 

when correlation exists, a natural question is: does’t the generic differential privacy under our modified 

induced neighbor definition satisfy Pufferfish privacy? 

3

3.2 Maximum Flow 

We model our question as a maximum flow problem. Suppose we want a discriminative (s i , s j ) to be 

indistinguishable and the set of constraints is Q and the data generating distribution is θ. Define D i as the 

set of all databases that satisfy constraints Q and also fact s i . Similarly D j is defined. Define a directed 

graph G = ({s, t} ∪ V i ∪ V j , E) as follows. 

• For each database D ∈ D i , create a node v ∈ V i ; similarly, for each database D ∈ D j , create a node 

v ∈ V j . 

• There is an edge (s, v) for every v ∈ V i and an edge (v, t) for every v ∈ V j . An edge (v, v ′ ) for v ∈ V i 

and v ′ ∈ V j is created if the corresponding databases D and D ′ are induced neighbors. 

• The capacity on edge (s, v) is equal to Pr(D v |s i , Q) and the capacity on edge (v ′ , t) is equal to 

Pr(D v ′|s j , Q), where D v and D v ′ are databases corresponding to v and v ′ in D i and D j respectively. 

Notice that ∑ D∈D i 

Pr(D|s i , Q) = 1 and similarly ∑ D∈D j 

Pr(D|s j , Q) = 1. 

observation. 

We have the following 

Observation 1. If the maximum flow from s to t in graph G is equal to 1 for any data generating distribution 

θ, then induced neighbor privacy will guarantee Pufferfish privacy. 

Proof. According to the definition of Pufferfish privacy, we show that Eq. (1) is true, i.e., 

The left side can be decomposed as 

Pr(M(Data) = w|s i , θ) ≤ e ɛ Pr(M(Data) = w|s j , θ) 

Pr(M(Data) = w|s i , θ) = ∑ 

D∈D i 

Pr(M(Data) = w, Data = D|s i , θ) 

= ∑ 

D∈D i 

Pr(Data = D|s i , θ) Pr(M(Data) = w|Data = D) 

Similar decomposition applies to the right side. For any D ∈ D i and D ′ ∈ D j , if D and D ′ are induced 

neighbors, induced neighbor privacy guarantees that Pr(M(Data) = w|Data = D) ≤ e ɛ Pr(M(Data) = 

w|Data = D ′ ). 

Suppose the maximum flow of G is 1, then for each D ∈ D i , there is a set of neighboring databases 

N D ⊆ D j such that each D ′ ∈ N D is a induced neighbor of D and Pr(Data = D|s i , θ) = ∑ D ′ ∈N D 

Pr(Data = 

D ′ |s j , θ). Therefore, we have 

Pr(M(Data) = w|s i , θ) = ∑ 

Pr(Data = D|s i , θ) Pr(M(Data) = w|Data = D) 

D∈D i 

≤ e ∑ ∑ 

ɛ 

Pr(Data = D ′ |s j , θ) Pr(M(Data) = w|Data = D ′ ) 

D∈D i D ′ ∈N D 

∑ 

= e ɛ Pr(Data = D ′ |s j , θ) Pr(M(Data) = w|Data = D ′ ) 

D ′ ∈D j 

= e ɛ Pr(M(Data) = w|s j , θ) 

Notice that induced neighbor does not depend on the generating probability distribution θ, which makes 

it easy to apply in practice. 

4

3.3 A Special Case 

We now consider the following assumption on the data generating process. Suppose each record of the 

database is generated according to the same distribution f and the single attribute count constraints Q. 

For example, f may be the disease distribution of the whole population and the constraints may be of the 

following form: C(x 1 ) = n 1 , C(x 2 ) = n 2 , . . . , C(x k ) = n k , where x i is some disease and C(x i ) is the count of 

x i in the database. We have the following result. 

Claim 1. If all records of the database are identically distributed under the single-attribute count constraints 

Q, then a mechanism M that satisfies induced neighbor privacy also satisfies Pufferfish privacy. 

Proof. According to Observation 1, we only need to show the maximum flow from s to t is 1 in the graph G 

corresponding to our data generating process. Suppose the count constraints are C(x 1 ) = n 1 , . . . , C(x k ) = n k 

and ∑ k 

i=1 n k ≤ n, where n is the total number of records. Also assume there is a pair of secrets (s i , s j ), 

where s i means record r has value x i and s j means r has value x j . Define D i and D j similarly. The number 

of different databases in D i can be computed as N i = ( )( 

n−1 n−ni 

n i−1 

n j 

) 

A, where A is a function of n, n1 , . . . , n k . 

Similarly, N j = ( )( 

n−1 n−ni−1) n i n j−1 A. 

The incoming flow to each node D ∈ D i is 1/N i and the outgoing flow from each node D ′ ∈ D j is 1/N j . 

For each D ∈ D i , by changing r from x i to x j and changing one of the n j records with value x j to value 

x i , we get an induced neighbor in D j . Therefore, each D ∈ D j has n j induced neighbors. Similarly, each 

D ′ ∈ D j has n i induced neighbors from D i . We split the incoming flow of D evenly to each of its induced 

neighbor. That is, each edge in G from D i to D j has flow 1/(N i n j ). Then, each D ′ is getting a total flow of 

n i /(N i n j ) = 1/N j . Therefore, the maximum flow is 1. 

Note that if ∑ k 

h=1 n h < n, then each D and D ′ we talked about represents a group of databases with 

the same configuration for x 1 , . . . , x k . 

3.4 A Sampling Based Mechanism 

Although we have shown equivalence of induced neighbor privacy and Pufferfish privacy in a special case, 

there is no instantiation of Pufferfish privacy under more general constraints such as multi-attribute count 

constraint and node degree constraint for graph data. In this section, we give a sampling based guess-and-test 

mechanism that guarantee an approximate notion of Pufferfish privacy. 

The mechanism is nothing but add Laplacian noise to the output. The problem is, we don’t know how 

much noise is enough to guarantee Pufferfish privacy. Our strategy is to first guess a noise level, and then 

use sampling to test whether Pufferfish is satisfied by adding so much noise. 

We first give a more general definition of Pufferfish by allowing additive error. 

Definition 5. ( (ɛ, δ)-Pufferfish Privacy) Given a set of potential secrets S, a set of discriminative pairs 

S pairs , a set of data generating distributions Θ, and a privacy parameter ɛ > 0, a (potentially randomized) 

algorithm M satisfies (ɛ, δ)-Pufferfish(S, S pairs , Θ) privacy if 

• for all possible outputs w, 

• for all pairs (s i , s j ) ∈ S pairs of potential secrets, 

• for all distributions θ ∈ Θ for which P (s i |θ) ≠ 0 and P (s j |θ) ≠ 0 

the following holds: 

P (M(Data) = w|s i , θ) ≤ e ɛ P (M(Data) = w|s j , θ) + δ (3) 

P (M(Data) = w|s j , θ) ≤ e ɛ P (M(Data) = w|s i , θ) + δ (4) 

Before describe the algorithm, we state the assumptions we make: 

• prior distribution θ is given and a database can be sampled from it efficiently 

5

• assume every secret pair is of the form (s, ¬s) and Pr(s|θ)/ Pr(¬s|θ) ∈ [1/c, c] for some constant c. 

This is reasonable in that, secret s is only worthy to protect if the adversary doesn’t know much about 

it in advance. 

• ɛ and δ are parameters 

The algorithm works as follows: 

1. set α to some initial noise level α 0 

2. sample O((c + 1)k) = O(k) databases independently from θ 

3. compute the output for each sample and add noise Lap(α) 

4. For every pair of secrets (s, ¬s) 

5. find O s and O ¬s : outputs for samples with s and ¬s true 

6. for every possible output w 

7. denote p ws = Pr(w|s, θ) and p w¬s = Pr(w|¬s, θ) 

8. estimate ¯p ws and ¯p ws ′ using fraction of w in O s and O ¬s . 

9. if ¯p ws > e ɛ ¯p ws ′ + δ/2 

10. set α = 2α and go to step 3 

11. end for 

12. end for 

13. use α as the noise level for any future query. 

Before discussing the running time of the above procedure, we first show the accuracy of the estimation. 

Claim 2. Set the size of the sample k = O( (1+eɛ ) 2 

δ 

log U), where U is the upper bound of the size of the 

2 

output range and the number of pairs of secrets. Let α ɛ and α ɛ,δ be the minimum noise levels that guarantee 

ɛ-Pufferfish and (ɛ, δ)-Pufferfish privacy. With high probability, our algorithm will output a noise level α 

such that α ɛ,δ ≤ α ≤ α ɛ . 

Proof. Denote S as the set of databases that satisfy secret s and θ and ¯S as θ − S. Firstly, given 2(c + 1)k 

independent samples from θ, with probability at least 1 − 1 , at least k samples are from each of S or ¯S. 

U O(1) 

Fix some output w and some secret s. Let X i be the indicator random variable of the event that the i-th 

sample outputs w. Note that E[X i ] = p ws . Let X = ∑ k 

i=1 X i. We have 

Pr(|¯p ws − p ws | ≥ ∆) = Pr(|X/k − p ws | ≥ ∆) 

≤ 2e −2k∆2 Set ∆ = 

≤ 1 

U O(1) 

δ 

2(1 + e ɛ ) 

Therefore, the probability that the estimated ¯p ws is more than ∆ away from its true value for any w and s 

is at most 1/U. In other words, with high probability, all estimates are within ∆ distance away from their 

true values. 

6

In our algorithm, if α = α ɛ , then with high probability, for any w and (s, s ′ ), 

¯p ws ≤ p ws + ∆ 

≤ e ɛ p ws ′ + ∆ 

≤ e ɛ (¯p ws ′ + ∆) + ∆ 

= e ɛ ¯p ws ′ + (1 + e ɛ )∆ 

= e ɛ ¯p ws ′ + δ/2 

Therefore, our algorithm will output α = α ɛ . This means that with high probability, the output of our 

algorithm will not exceed α ɛ . 

On the other hand, if our algorithm outputs some α, then with high probability, for any w and (s, s ′ ), 

p ws ≤ ¯p ws + ∆ 

≤ e ɛ ¯p ws ′ + δ/2 + ∆ 

≤ e ɛ (p ws ′ + ∆) + δ/2 + ∆ 

= e ɛ p ws ′ + (1 + e ɛ )∆ + δ/2 

= e ɛ p ws ′ + δ 

Therefore, with high probability, the noise level output by our algorithm will be at least α ɛ,δ . 

The running time depends on 

• how fast a sample of θ that satisfy s can be obtained? 

• how many different possible outputs? 

• how many different secret pairs? 

Claim 3. Let t θ be the time needed to obtain one sample from distribution θ, and let |O| denote the number 

of possible outputs. The running time of the sampling algorithm is O((k(t θ + t output ) + |O|) log α α 0 

), where 

k = O( (1+eɛ ) 2 

δ 

log U) and t 2 

output is the time to compute the query output given a database and perturb it 

using laplace noise. 

4 Future Work 

The following questions are worth exploring in the future: 

1. Is Induced Neighbor Privacy equivalent to Pufferfish under more general assumptions of the priors? 

2. How to sample a database from some natural distribution θ? 

References 

[1] Daniel Kifer and Ashwin Machanavajjhala. No free lunch in data privacy. In Proceedings of the 2011 

ACM SIGMOD International Conference on Management of data, SIGMOD ’11, pages 193–204, New 

York, NY, USA, 2011. ACM. 

[2] Daniel Kifer and Ashwin Machanavajjhala. A rigorous and customizable framework for privacy. In 

Proceedings of the 31st symposium on Principles of Database Systems, PODS ’12, pages 77–88, New 

York, NY, USA, 2012. ACM. 

7

Privacy with Prior Information

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?