Probabilistic Inference Protection on Anonymized Data

Probabilistic Inference Protection on Anonymized Data
Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the Chinese University of Hong Kong) Ke Wang (Simon Fraser University) Yabo Xu (Sun Yat-sen University) Jian Pei (Simon Fraser University) Philip S. Yu (Univerisity of Illinois at Chicago) Prepared by Raymond Chi-Wing Wong Presented by Raymond Chi-Wing Wong

Outline Introduction Background Knowledge Proposed Model Conclusion
l-diversity Background Knowledge Proposed Model Conclusion

1. l-diversity Alan Male 41 Lung Cancer Betty Female 42 Hypertension
Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity Patient Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine 63 Flu Diana 64 HIV Bucketization I also know Alan with (Male, 41) Knowledge 2 In other words, P(Alan is linked to Lung Cancer) is at most 1/2. Release the data set to public Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer with probability=1/2. Knowledge 1 Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV This dataset satisfies 2-diversity. QI Table Sensitive Table

Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity This can be obtained from statistical reports from the US department of Health and Human Services and other statistical data sources discussed in previous studies Patient Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine 63 Flu Diana 64 HIV Bucketization I also know Alan with (Male, 41) Knowledge 2 Knowledge 3 QI Based Distribution Release the data set to public p() Lung Cancer Not Lung Cancer Male 0.1 0.9 Female 0.003 0.997 Knowledge 1 Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV This dataset satisfies 2-diversity. QI Table Sensitive Table

Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 1. l-diversity It is more likely that a male patient is linked to Lung Cancer compared with a female patient. Patient Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine 63 Flu Diana 64 HIV Bucketization I also know Alan with (Male, 41) Knowledge 2 Knowledge 3 QI Based Distribution Release the data set to public p() Lung Cancer Not Lung Cancer Male 0.1 0.9 Female 0.003 0.997 Knowledge 1 Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancer with very high probability (much greater than 1/2). This dataset satisfies 2-diversity. Why? QI Table Sensitive Table

Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2 Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity Patient Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine 63 Flu Diana 64 HIV Bucketization I also know Alan with (Male, 41) Knowledge 2 Knowledge 3 QI Based Distribution Release the data set to public p() Lung Cancer Not Lung Cancer Male 0.1 0.9 Female 0.003 0.997 We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3 Knowledge 1 Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancer with very high probability (much greater than 1/2). This dataset satisfies 2-diversity. QI Table Sensitive Table

1. l-diversity Objective: to make sure that the probability
is bounded by a threshold (e.g., 1/2). 1. l-diversity We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3

Objective: to make sure that the probability
is bounded by a threshold (e.g., 1/2). 1. l-diversity Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3

is bounded by a threshold (e.g., 1/2). 1. l-diversity Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.

is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.

is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Related Work: There is a closely related work [LLZ09] for this problem. [LLZ09] T. Li, N. Li and J. Zhang, “Modeling and Integrating Background Knowledge in Data Anonymization”, ICDE 2009 [LLZ09] approximates the formula for this probability. Thus, there is no solid guarantee on the privacy protection.

is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size. Contributions: We propose a condition. If this condition is satisfied, we can guarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 ) Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically, (1) Computing the condition is computationally cheap, and (2) The condition involves a monotonic function on the A-group size.

is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 The major idea of the condition includes some simple calculations based on the statistics of an A-group The size of the A-group (N) The privacy requirement (r) The global probabilities of each tuple in the A-group to a sensitive value Contributions: We propose a condition. If this condition is satisfied, we can guarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 ) Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically, (1) Computing the condition is computationally cheap, and (2) The condition involves a monotonic function on the A-group size.

is bounded by a threshold (e.g., 1/2). 1. l-diversity Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 The major idea of the condition includes some simple calculations based on the statistics of an A-group The size of the A-group (N) The privacy requirement (r) The global probabilities of each tuple in the A-group to a sensitive value Condition Check N r Global probabilities Satisfied/ Not Satisfied If it is satisfied, we deduce that the privacy requirement is satisfied (e.g., P(Alan is linked to Lung Cancer) ≤ 1/2)

4. Conclusion Background Knowledge Two Challenges Proposed Condition
QI-based Probability Distribution Two Challenges Challenge 1: The formula for the probability is computationally expensive Challenge 2: The formula is not monotonic Proposed Condition overcomes Challenge 1 and Challenge 2

There is another way to prevent this linkage called Generalization. The following principle to be discussed can also be applied to Generalization. A way to prevent this linkage. Patient Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine 63 Flu Diana 64 HIV Bucketization Release the data set to public These two tuples form an anonymized group (A-group) Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension 63 Flu 64 HIV GID = L1 These two tuples form another A-group. GID = L2

Patient Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine 63 Flu Diana 64 HIV Bucketization Release the data set to public Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension 63 Flu 64 HIV GID = L1 GID = L2 QI Table Sensitive Table

Patient Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine 63 Flu Diana 64 HIV Bucketization Release the data set to public Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV QI Table Sensitive Table

Patient Gender Age Disease Alan Male 41 Lung Cancer Betty Female 42 Hypertension Catherine 63 Flu Diana 64 HIV I also know Alan with (Male, 41) Knowledge 2 Release the data set to public Knowledge 1 Gender Age Disease Male 41 Lung Cancer Female 42 Hypertension 63 Flu 64 HIV Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer.

An A-group “merged” from these two A-groups
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity P(an individual is linked to a sensitive value) ≤ 0.5 Monotonicity Consider two A-groups P(an individual is linked to a sensitive value) = 0.5 Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV An A-group with GID = L1 An A-group “merged” from these two A-groups Merging An A-group with GID = L2 P(an individual is linked to a sensitive value) = 0.4 The probability is monotonically decreasing when the size of the A-gourp increases.

An A-group “merged” from these two A-groups
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity It is possible that P(an individual is linked to a sensitive value) > 0.5 Non-Monotonicity Consider two A-groups P(an individual is linked to a sensitive value) = 0.5 Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV An A-group with GID = L1 An A-group “merged” from these two A-groups Merging An A-group with GID = L2 P(an individual is linked to a sensitive value) = 0.4 The probability is not monotonically decreasing when the size of the A-gourp increases.

1. l-diversity Male 41 L1 Female 42 63 L2 64 L1 Lung Cancer
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2). 1. l-diversity I also know Alan with (Male, 41) Knowledge 2 Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2 Knowledge 1 For the sake of illustration, we focus on attribute Gender only. Gender Age GID Male 41 L1 Female 42 63 L2 64 GID Disease L1 Lung Cancer Hypertension L2 Flu HIV Knowledge 3 QI Based Distribution p() Lung Cancer Not Lung Cancer Male 0.1 0.9 Female 0.003 0.997 Suppose we are interested in knowing whether P(Alan is linked to Lung Cancer) ≤ 1/2. 2 Condition Check N r Global probabilities 2 Satisfied/ Not Satisfied If it is satisfied, we deduce that the privacy requirement is satisfied (e.g., P(Alan is linked to Lung Cancer) ≤ 1/2) 0.1 0.003

there is an expression ceil in terms of N, r and
What is the condition check? In the condition check, there is an expression ceil in terms of N, r and global probabilities to compute. 2 Condition Check N r Global probabilities 2 Satisfied/ Not Satisfied 0.1 0.003

What is the condition check?
In the condition check, there is an expression ceil in terms of N, r and global probabilities to compute. Theorem 1: If the condition is satisfied, then the privacy requirement is satisfied.

Theorem 2: Computing ceil can be done in O(1) time.
This means that we overcome Challenge 1. Challenge 1: Calculating the probability is computationally expensive. Theorem 3: ceil is a monotonically increasing function on N where N is the A-group size. This means that we overcome Challenge 2. Challenge 2: The formula for the original probability is not monotonic with respect to the A-group size.

The greatest global probability fmax = max{f1, f2} = max{0.1, 0.003} = 0.1 The difference between the greatest global probability and the “current” global probability 1 = fmax – f1 = 0.1 – 0.1 = 0 in terms of N, r and fmax. 2 = fmax – f2 = 0.1 – 0.003 = 0.097 The condition is whether this difference 1 (and 2) is at most an expression ceil ceil = (N-r)/fmax fmax(r-1)/(1-fmax) + (N-1) 2 Condition Check N r Global probabilities 2 Satisfied/ Not Satisfied f1 0.1 0.003 f2

The greatest global probability fmax = max{f1, f2} = max{0.1, 0.003} = 0.1 The difference between the greatest global probability and the “current” global probability 1 = fmax – f1 = 0.1 – 0.1 = 0 2 = fmax – f2 = 0.1 – 0.003 = 0.097 The condition is whether this difference 1 (and 2) is at most an expression ceil ceil = (N-r)/fmax fmax(r-1)/(1-fmax) + (N-1) Theorem 1: If i ≤ceil is satisfied, then the privacy requirement is satisfied.

Anonymization The condition check gives hints for anonymization
Initially, each tuple forms an A-group. Repeat the following until each A-group satisfies the condition. If there is an A-group violating the condition, merge this A-group with some other A-group such that the “merged” A-group satisfies the condition.

B.1.2 K-Anonymity Raymond Male Shatin 29 Jan None Peter Fanling
Problem: to generate a data set such that each possible value appears at least TWO times. Customer Gender District Birthday Cancer Raymond Male Shatin 29 Jan None Peter Fanling 16 July Yes Kitty Female 21 Oct Mary 8 Feb Two Kinds of Generalisations 1. ShatinNT 2. 16 July* Release the data set to public Gender District Birthday Cancer Male NT * None Yes Female Shatin “ShatinNT” causes LESS distortion than “16 July*” Question: how can we measure the distortion? This data set is 2-anonymous

B.1.2 K-Anonymity Measurement= 1/1=1.0 * Measurement= 2/2=1.0 Male
Female Shatin Fanling Mongkok Jordon NT KLN HKG 29 Jan 16 July 21 Oct 8 Feb Jan July Oct Feb * Measurement= 1/2 =0.5 Conclusion: We propose a measurement of distortion of the modified/anonymized data.

B.1.2 K-Anonymity Measurement= 1/1=1.0 * Measurement= 2/2=1.0 Male
Female Shatin Fanling Mongkok Jordon NT KLN HKG 29 Jan 16 July 21 Oct 8 Feb Jan July Oct Feb * Measurement= 1/2 =0.5 Can we modify the measurement? e.g. different weightings to each level

Probabilistic Inference Protection on Anonymized Data

Similar presentations

Presentation on theme: "Probabilistic Inference Protection on Anonymized Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Probabilistic Inference Protection on Anonymized Data

Similar presentations

Presentation on theme: "Probabilistic Inference Protection on Anonymized Data"— Presentation transcript:

Similar presentations

About project

Feedback