Presentation is loading. Please wait.

Presentation is loading. Please wait.

Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie.

Similar presentations


Presentation on theme: "Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie."— Presentation transcript:

1 Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie

2 Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

3 Problem Settings Microdata: sensitive personal data held by an organization, e.g. medical records, transaction history. Often open to public access for reasons such as research.

4 Risk to Privacy An attacker knows the age 20 and zipcode 15000 of Alice. In order to infer Alice’s income, s/he issues 2 queries: q 0 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 20] AND Zipcode ∈ [15k, 15k] AND Income ∈ [80k, +∞) q’ 0 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 20] AND Zipcode ∈ [15k, 15k] AND Income ∈ (-∞, 80k) Table 1:

5 Solutions Output Perturbation: injecting a small random noise into each query result. ε-differential Privacy: Let Q be the set of previously answered queries. Given a new query q, the database determines whether {q} ∪ Q violates ε-differential privacy.

6 Output Perturbation Count Queries: SELECT COUNT(*) FROM T where pred(A 1 ) AND... AND pred(A d ), such that pred(A i ) has the format A i = * or A i ∈ [x i, y i ] Perturbed Answer: given a query q, D returns an answer q(T) + δ, where δ is a random variable subjects to Laplace distribution: f(δ) = (1/2λ) * e -|δ| / λ where λ is the noise magnitude.

7 ε-Differential Privacy Sibling Tables: two microdata tables T 1 and T 2 that have the same schema and cardinality and differ in only one tuple. e.g. we change Alice’s income from 85k to 30k. ε-Differential Privacy: Let Q = {q 1,..., q m } be any subset of the queries that have been answered by D, and R = {r 1,..., r m } be a set of arbitrary real numbers. D ensures ε-Differential Privacy, if the following inequality holds for any R and any pair of sibling tables T 1 and T 2 : Pr[ ∀i, q i (D) = r i | Δ 1 ] <= e ε * Pr[ ∀i, q i (D) = r i | Δ 2 ] where Δ i denotes the event that T i is the table where D is constructed.

8 ε-Differential Privacy: An Example A statistical database D is built on T 1. Q is the set of queries issued by an attacker, and S rst is the set of result returned by D. Assume D is constructed on another table T 2 where Alice’s income is arbitrarily modified, which may still return S rst. Pr[ D returns S rst | Alice’s income is NOT modified ] <= e ε * Pr[ D returns S rst | Alice’s income is modified ] e ε ≈ 1 + ε, which is close to 1. A smaller ε leads to better privacy.

9 Computation of ε-Differential Privacy L 1 Sensitivity: given a set Q of queries, its L1 sensitivity equals: S L1 (Q) = max T1, T2 ( ∑ q∈Q |q(T 1 ) - q(T 2 )| ) where T 1 and T 2 are any two sibling tables. An example: Q = {q 0, q 0 ’ }. T 1 is table 1, T 2 changes Alice's income to be 30K. We show that S L1 (Q) = 2. |q 0 (T 1 ) – q 0 (T 2 )| <= 1 and |q 0 ’ (T 1 ) – q 0 ’ (T 2 )| <= 1, so S L1 (Q) <= 2. q 0 (T 1 ) = 1, q 0 (T 2 ) = 0, q 0 ’ (T 1 ) = 0, q 0 ’ (T 2 ) = 1, so S L1 (Q) >= |1 - 0| + |0 - 1| = 2. So S L1 (Q) = 2.

10 Computation of ε-Differential Privacy Theorem 1: A statistical database D ensures ε-differential privacy, if and only if S L1 (Q) <= ελ. Lemma 1: Deciding whether S L1 (Q) is larger than a threshold is NP- hard. Proof: a reduction from the maximum 2-satisfiability (MAX-2-SAT) problem............ So the verification of ε-differential privacy is NP-hard.

11 Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

12 Some Definitions Data Space/Query Region: We regard the data space Ω of a table T as a d-dimensional space, where the i-th dimension is A i. The region of a query q is a rectangle r in Ω such that if q has a predicate “A i ∈ [x i, y i ]”, the projection of r on A i equals [x i, y i ]. Popularity/Convergence: For any point p in the data space Ω, its popularity p(Q) is the number of query regions that cover p. The convergence of Q is the largest p(Q) of all points p ∈ Ω.

13 The Upper Bound of S L1 (Q) Lemma 2: For any set Q of queries, S L1 (Q) <= 2C(Q). Proof:...... This bound motivates a simple approach to ensure ε- differential privacy.

14 A Histogram Approach The above approach requires keeping values for all points, which is not practical. We can maintain a histogram H, which partitions the data space Ω into rectangular buckets. Each bucket B has a counter B.c to record the number of queries that intersect it. If B.c <= λε/2, the ε-differential privacy is preserved. If a new query intersects a bucket with counter greater than or equal to λε/2, it’s rejected.

15 A Histogram Approach: Simple Split The initial number of bucket is one, and a bucket B can be split in a way to minimize B’.c + B’’.c, if needed. The largest number of buckets θ is a system parameter. An example where the maximum permissible popularity λε/2 is 3:

16 A Histogram Approach: the Split Algorithm Algorithm Split (B) /* B is a bucket to be decomposed */ 1. U = the set of regions of the queries in Q that partially intersect B 2. if U ≠ ∅ ; 3. remove B from H 4. r ∩ = the intersection of all the regions in U 5. if r ∩ = ∅ ; 6. split B into buckets B’ and B’’ with the minimum B’.c + B’’.c using the cutting lines passing the boundaries of the regions in U 7. else 8. repetitively split B by the cutting lines passing the boundaries of r ∩ until a bucket has extent r ∩ 9. insert the new buckets into H with counters set to B.c

17 A Histogram Approach: A Complex Split Query q 4 : SELECT COUNT(*) FROM T where age = * AND INCOME ∈ [40000, 99999]

18 Limitation of Output Perturbation Volume of a query: the percentage of points in Ω that satisfy the query. For a solution that 1) ensures ε-differential privacy and 2) perturbs each answer with Laplace noise of magnitude λ, let θ be the max. number of queries that can be processed by such a solution, then: if each query has a volume at least s’ and at most 1-s’ (0 < s’ <= 1/2), θ < λε / s’. For queries with volume in (0, 1), the above solution can process at most n * λε queries.

19 Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

20 Query Relaxation If the maximum number of supported queries is reached, new queries are all rejected. Instead of simply refusing a query, we may return a useful synthetic answer, which is based on previously answered queries, thus the privacy is not violated. This process is called relaxation. An example: q 1 ’: SELECT COUNT(*) FROM T WHERE Age ∈ [20, 51] AND Income ∈ [40K, 70K] q 1 : SELECT COUNT(*) FROM T WHERE Age ∈ [20, 50] AND Income ∈ [40K, 70K]

21 Query Relaxation: Compound Two disjoint sets P + and P - of queries constitute a compound P, if 1) for each point p in Ω, p(P + ) - p(P - ) equals 0 or 1. 2) All points p satisfying p(P + ) - p(P - ) = 1 form a rectangle r diff, which is the difference region of P. A synthetic answer of P is calculated by ∑ q∈P+ q(D) - ∑ q∈P- q(D)

22 Relaxation Error Relaxation Error E(P,q) can be calculated using the formula below: Let Q be a set of accepted queries and P a compound. A query q ∈ Q but not in P is a positive (negative) patch if after including it in P + (P - ), 1) P remains a compound and 2) E(P, q*) decreases.

23 Artificial Patches We can dynamically generate a query, force the database to process it normally, and use its perturbed answer to obtain a better synthetic answer for the denied query. 2d artificial queries are generated, each of which aligns with a boundary of r diff. Then each query is checked whether it’s a patch and it violates the ε-differential privacy or not.

24 Probabilistic Accuracy We return a synthetic answer ∑ q∈P+ q(D) - ∑ q∈P- q(D) as well as a relaxed query q*’. The synthetic answer has the expected value q*’(T), and its variance is 2λ 2 * | P + ∪ P - |, where λ is the noise magnitude. A tradeoff: more queries in P lowers the relaxation error, but increase the noise in the query results. So the user may specify an upper bound ξ of the size of a compound.

25 An Illustration of Relaxation

26 Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

27 Experiment Settings Dataset: CENSUS Computer: 3G Pentium IV, 1G RAM. Parameters: Queries: select count(*) from CENSUS where A 1 ∈ [x 1, y 1 ] and A 2 ∈ [x 2, y 2 ]. The center z i of the range [x i, y i ] is chosen in 2 different ways: 1) Data: z i = t [A i ], where t is a random tuple. 2) Uniform: z i is a random value in the domain of A i. The workload of queries is 20K.

28 Experiment: Processing Capability Without Relaxation Two approaches: Disjoint: reject a query if its region intersects any of the previously answered query. Histogram.

29 Experiment: Processing Capability Without Relaxation Effects of ε and s: The upper bound of capacity: n * λε. Queries with larger regions cause faster growth of C(Q).

30 Experiment: Quality of Relaxation Effects of compound size: A larger compound raises the chance of finding a good compound. The compound size can be well below the bound ξ because of early termination.

31 Experiment: Quality of Relaxation Effects of ε: A greater ε allows more queries, thus a larger query set Q for relaxation, which enhances the relaxation quality.

32 Experiment: Quality of Relaxation Effects of s: Queries with larger regions cause faster growth of C(Q), which results in a smaller query set Q and a higher relaxation error.

33 Experiment: Computation Overhead Greater ε (s) results in higher (lower) query process capacity and the size of query set Q. Greater ξ ( θ) results in larger compounds (more buckets).

34 Outlines Introduction & Motivation The Histogram Approach Query Relaxation Experiments Conclusion

35 Conclusion & Future Works Propose an applicable solution (the histogram) to ensure ε-differential privacy. Use query relaxation to overcome the limitation of query processing capacity. Future works:  Apply to other kinds of queries (SUM, MIN, MAX, etc.)  Consider update of database.  Other types of microdata besides relational tables.

36 THANKS


Download ppt "Output Perturbation with Query Relaxation By: XIAO Xiaokui and TAO Yufei Presenter: CUI Yingjie."

Similar presentations


Ads by Google