Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical.

Similar presentations


Presentation on theme: "Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical."— Presentation transcript:

1 Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical Engineering & Computer Science Syracuse University, Syracuse, New York.

2 Introduction  Privacy-Preserving Data Publishing.  The impact of background knowledge: How does it affect privacy? How to measure its impact on privacy?  Integrate background knowledge in privacy quantification. Privacy-MaxEnt: A systematic approach. Based on well-established theories.  Evaluation.

3 Privacy-Preserving Data Publishing  Data disguise methods Randomization Generalization (e.g. Mondrian) Bucketization (e.g. Anatomy)  Our Privacy-MaxEnt method can be applied to Generalization and Bucketization. We pick Bucketization in our presentation.

4 Data Sets IdentifierQuasi-Identifier (QI)Sensitive Attribute (SA)

5 Bucketized Data P( Breast cancer | { female, college }, bucket=1 ) = 1/4 P( Breast cancer | { female, junior }, bucket=2 ) = 1/3 Quasi-Identifier (QI)Sensitive Attribute (SA)

6 Impact of Background Knowledge  Background Knowledge: It’s rare for male to have breast cancer.  This analysis is hard for large data sets.

7 Previous Studies  Martin, et al. ICDE’07. First formal study on background knowledge  Chen, LeFevre, Ramakrishnan. VLDB’07. Improves the previous work.  They deal with rule-based knowledge. Deterministic knowledge.  Background knowledge can be much more complicated. Uncertain knowledge

8 Complicated Background Knowledge  Rule-based knowledge: P (s | q) = 1. P (s | q) = 0.  Probability-Based Knowledge P (s | q) = 0.2. P (s | Alice) = 0.2.  Vague background knowledge 0.3 ≤ P (s | q) ≤ 0.5.  Miscellaneous types P (s | q 1 ) + P (s | q 2 ) = 0.7 One of Alice and Bob has “Lung Cancer”.

9 Challenges  How to analyze privacy in a systematic way for large data sets and complicated background knowledge?  Directly computing P( S | Q ) is hard.  What do we want to compute? P( S | Q ), given the background knowledge and the published data set. P(S | Q ) is primitive for most privacy metrics.

10 Our Approach Background Knowledge Published Data Public Information Constraints on x Constraints on x Solve x Consider P( S | Q ) as variable x (a vector). Most unbiased solution

11 Maximum Entropy Principle  “Information theory provides a constructive criterion for setting up probability distributions on the basis of partial knowledge, and leads to a type of statistical inference which is called the maximum entropy estimate. It is least biased estimate possible on the given information.” — by E. T. Jaynes, 1957.

12 The MaxEnt Approach Background Knowledge Published Data Public Information Constraints on P( S | Q ) Constraints on P( S | Q ) Estimate P( S | Q ) Maximum Entropy Estimate

13 Entropy Because H(S | Q, B) = H(Q, S, B) – H(Q, B) Constraint should use P(Q, S, B) as variables

14 Maximum Entropy Estimate  Let vector x = P(Q, S, B).  Find the value for x that maximizes its entropy H(Q, S, B), while satisfying h 1 (x) = c 1, …, h u (x) = c u : equality constraints g 1 (x) ≤ d 1, …, g v (x) ≤ d v : inequality constraints  A special case of Non-Linear Programming.

15 Constraints from Knowledge  Linear model: quite generic.  Conditional probability: P (S | Q) = P(Q, S) / P(Q).  Background knowledge has nothing to do with B: P(Q, S) = P(Q, S, B=1) + … + P(Q, S, B=m). Background Knowledge Constraints on P(Q, S, B)

16 Constraints from Published Data  Constraints Truth and only the truth. Absolutely correct for the original data set. No inference. Published Data Set D’ Constraints on P(Q, S, B)

17 Assignment and Constraints Observation: the original data is one of the assignments Constraint: true for all possible assignments

18 QI Constraint Constraint: Example:

19 SA Constraint Constraint: Example:

20 Zero Constraint  P(q, s, b) = 0, if q or s does not appear in Bucket b.  We can reduce the number of variables.

21 Theoretic Properties  Soundness: Are they correct? Easy to prove.  Completeness: Have we missed any constraint? See our theorems and proofs.  Conciseness: Are there redundant constraints? Only one redundant constraint in each bucket.  Consistency: Is our approach consistent with the existing methods (i.e., when background knowledge is Ø).

22 Completeness w.r.t Equations  Have we missed any equality constraint? Yes! If F 1 = C 1 and F 2 = C 2 are constraints, F 1 + F 2 = C 1 + C 2 is too. However, it is redundant.  Completeness Theorem: U: our constraint set. All linear constraints can be written as the linear combinations of the constraints in U.

23 Completeness w.r.t Inequalities  Have we missed any inequalities constraint? Yes! If F = C, then F ≤ C+0.2 is also valid (redundant).  Completeness Theorem: Our constraint set is also complete in the inequality sense.

24 Putting Them Together Background Knowledge Published Data Public Information Constraints on P( S | Q ) Constraints on P( S | Q ) Estimate P( S | Q ) Maximum Entropy Estimate Tools: LBFGS, TOMLAB, KNITRO, etc.

25 Inevitable Questions:  Where do we get background knowledge?  Do we have to be very very knowledgeable?  For P (s | q) type of knowledge: All useful knowledge is in the original data set. Association rules:  Positive: Q  S  Negative: Q  ¬ S, ¬ Q  S, ¬ Q  ¬ S Bound the knowledge in our study.  Top-K strongest association rules.

26 Knowledge about Individuals Knowledge 1: Alice has either s 1 or s 4. Constraint: Knowledge 1: Two people among Alice, Bob, and Charlie have s 4. Constraint: Alice: (i 1, q 1 ) Bob: (i 4, q 2 ) Charlie: (i 9, q 5 )

27 Evaluation  Implementation: Lagrange multipliers: Constrained Optimization  Unconstrained Optimization LBFGS: solving the unconstrained optimization problem.  Pentium 3Ghz CPU with 4GB memory.

28 Privacy versus Knowledge Estimation Accuracy: KL Distance between P (MaxEnt) (S | Q) and P (Original) (S | Q).

29 Privacy versus # of QI attributes

30 Performance vs. Knowledge

31 Running Time vs. Data Size

32 Iteration vs. Data size

33 Conclusion  Privacy-MaxEnt is a systematic method Model various types of knowledge Model the information from the published data Based on well-established theory.  Future work Reducing the # of constraints Vague background knowledge Background knowledge about individuals


Download ppt "Privacy-MaxEnt: Integrating Background Knowledge in Privacy Quantification Wenliang (Kevin) Du, Zhouxuan Teng, and Zutao Zhu. Department of Electrical."

Similar presentations


Ads by Google