Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.

Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity

February 12, 2009 2 What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting ( non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Knowledge discovery in databases (KDD), knowledge extraction, data/pattern analysis, information harvesting, business intelligence

Privacy preserving data mining Support data mining while preserving privacy Sensitive raw data Sensitive mining results

February 12, 2009 4 Seminal work Privacy preserving data mining, Agrawal and Srikant, 2000 Centralized data Data randomization (additive noise) Decision tree classifier Privacy preserving data mining, Lindell and Pinkas, 2000 Distributed data mining Secure multi-party computation Decision tree classifier

Input Perturbation x1…xnx1…xn Reveal entire database, but randomize entries Database x1+1…xn+nx1+1…xn+n Add random noise  i to each database entry x i For example, if distribution of noise has mean 0, user can compute average of x i User

February 12, 2009 6 Taxonomy of PPDM algorithms Data distribution Centralized Distributed – Privacy preserving distributed data mining Approaches Input perturbation – additive noise (randomization), multiplicative noise, generalization, swapping, sampling Output perturbation – rule hiding Crypto techniques – secure multiparty computation Data mining algorithms Classification Association rule mining Clustering

Randomization techniques Privacy preserving data mining, Agrawal and Srikant, 2000 Seminal work on decision tree classifier Limiting Privacy Breaches in Privacy- Preserving Data Mining, Evfimievski and Gehrke, 2003 Refined privacy definition Association rule mining

Randomization Based Decision Tree Learning (Agrawal and Srikant ’00) Basic idea: Perturb Data with Value Distortion User provides x i +r instead of x i r is a random value Uniform, uniform distribution between [- ,  ] Gaussian, normal distribution with  = 0,  Hypothesis Miner doesn’t see the real data or can’t reconstruct real values Miner can reconstruct enough information to build decision tree for classification

Randomization Approach 50 | 40K |...30 | 70K |...... Randomizer Classification Algorithm Model 65 | 20K |...25 | 60K |...... 30 becomes 65 (30+35) Alice’s age Add random number to Age ?

February 12, 200810 Classification predicts categorical class labels (discrete or nominal) Prediction (Regression) models continuous-valued functions, i.e., predicts unknown or missing values Typical applications Credit approval Target marketing Medical diagnosis Fraud detection Classification

Li Xiong11 Motivating Example for Classification – Fruit Identification … DangerousHardSmallSmooth SafeSoftLargeGreenHairy DangerousSoftRedSmooth SafeHardLargeGreenHairy safeHardLargeBrownHairy ConclusionFleshSizeColorSkin Large Red

February 12, 200812 Another Example – Credit Approval Classification rule: If age = “31...40” and income = high then credit_rating = excellent Future customers Paul: age = 35, income = high  excellent credit rating John: age = 20, income = medium  fair credit rating NameAgeIncome…Credit Clark35High…Excellent Milton38High…Excellent Neo25Medium…Fair ……………

February 12, 2008Data Mining: Concepts and Techniques13 Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects

February 12, 2008Data Mining: Concepts and Techniques14 Training Dataset

February 12, 2008Data Mining: Concepts and Techniques15 Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? <=30 >40 noyes 31..40 no fairexcellent yesno

February 12, 2008Data Mining: Concepts and Techniques16 Algorithm for Decision Tree Induction ID3 (Iterative Dichotomiser), C4.5 CART (Classification and Regression Trees) Basic algorithm (a greedy algorithm) - tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root A test attribute is selected that “best” separate the data into partitions Heuristic or statistical measure Samples are partitioned recursively based on selected attributes Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf There are no samples left

February 12, 2008Data Mining: Concepts and Techniques17 Attribute Selection Measures Idea: select attribute that partition samples into homogeneous groups Measures Information gain (ID3) Gain ratio (C4.5) Gini index (CART)

February 12, 2008Data Mining: Concepts and Techniques18 Attribute Selection Measure: Information Gain (ID3) Select the attribute with the highest information gain Let p i be the probability that an arbitrary tuple in D belongs to class C i, estimated by |C i, D |/|D| Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gain – difference between original information requirement and the new information requirement by branching on attribute A

February 12, 2008Data Mining: Concepts and Techniques19 Attribute Selection Measure: Gini index (CART) If a data set D contains examples from n classes, gini index, gini(D) is defined as where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as Reduction in Impurity: The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node

February 12, 2008Data Mining: Concepts and Techniques20 Information-Gain for Continuous-Value Attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point

Randomization Approach 50 | 40K |...30 | 70K |...... Randomizer Classification Algorithm Model 65 | 20K |...25 | 60K |...... 30 becomes 65 (30+35) Alice’s age Add random number to Age ?

February 12, 2008Data Mining: Concepts and Techniques22 Attribute Selection Measure: Gini index (CART) If a data set D contains examples from n classes, gini index, gini(D) is defined as where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(D) is defined as Reduction in Impurity: The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node

Randomization Approach Overview 50 | 40K |...30 | 70K |...... Randomizer Reconstruct Distribution of Age Reconstruct Distribution of Salary Classification Algorithm Model 65 | 20K |...25 | 60K |...... 30 becomes 65 (30+35) Alice’s age Add random number to Age

Original Distribution Reconstruction x 1, x 2, …, x n are the n original data values Drawn from n iid random variables X 1, X 2, …, X n similar to X Using value distortion, The given values are w 1 = x 1 + y 1, w 2 = x 2 + y 2, …, w n = x n + y n y i ’s are from n iid random variables Y 1, Y 2, …, Y n similar to Y Reconstruction Problem: Given F Y and w i ’s, estimate F X

Original Distribution Reconstruction: Method Bayes’ theorem for continuous distribution The estimated density function: Iterative estimation The initial estimate for f X at j=0: uniform distribution Iterative estimation Stopping Criterion:  2 test between successive iterations

Reconstruction of Distribution

Original Distribution Reconstruction

Original Distribution Construction for Decision Tree When to reconstruct distributions? Global Reconstruct for each attribute once at the beginning Build the decision tree using the reconstructed data ByClass First split the training data Reconstruct for each class separately Build the decision tree using the reconstructed data Local First split the training data Reconstruct for each class separately Reconstruct at each node while building the tree

Accuracy vs. Randomization Level

More Results Global performs worse than ByClass and Local ByClass and Local have accuracy within 5% to 15% (absolute error) of the Original accuracy Overall, all are much better than the Randomized accuracy

Privacy level Is the privacy level sufficiently measured?

How to Measure Privacy Breach Weak: no single database entry has been revealed Stronger: no single piece of information is revealed (what’s the difference from the “weak” version?) Strongest: the adversary’s beliefs about the data have not changed

Kullback-Leibler Distance Measures the “difference” between two probability distributions

Privacy of Input Perturbation X is a random variable, R is the randomization operator, Y=R(X) is the perturbed database Measure mutual information between original and randomized databases Average KL distance between (1) distribution of X and (2) distribution of X conditioned on Y=y E y (KL(P X|Y=y || P x )) Intuition: if this distance is small, then Y leaks little information about actual values of X Why is this definition problematic?

Is the randomization sufficient? Gladys: 85 Doris: 90 Beryl: 82 Name: Age database Gladys: 72 Doris: 110 Beryl: 85 Age is an integer between 0 and 90 Randomize database entries by adding random integers between -20 and 20 Randomization operator has to be public (why?) Doris’s age is 90!!

Privacy Definitions Mutual information can be small on average, but an individual randomized value can still leak a lot of information about the original value Better: consider some property Q(x) Adversary has a priori probability P i that Q(x i ) is true Privacy breach if revealing y i =R(x i ) significantly changes adversary’s probability that Q(x i ) is true Intuition: adversary learned something about entry x i (namely, likelihood of property Q holding for this entry)

Example Data: 0  x  1000, p(x=0)=0.01, p(x=k)=0.00099 Reveal y=R(x) Three possible randomization operators R R 1 (x) = x with prob. 20%; a uniformly random number with prob. 80% R 2 (x) = x+  mod 1001,  uniform in [-100,100] R 3 (x) = R 2 (x) with prob. 50%, a uniform random number with prob. 50% Which randomization operator is better?

Some Properties Q 1 (x): x=0; Q 2 (x): x  {200,..., 800} What are the a priori probabilities for a given x that these properties hold? Q 1 (x): 1%, Q 2 (x): 40.5% Now suppose adversary learned that y=R(x)=0. What are probabilities of Q 1 (x) and Q 2 (x)? If R = R 1 then Q 1 (x): 71.6%, Q 2 (x): 83% If R = R 2 then Q 1 (x): 4.8%, Q 2 (x): 100% If R = R 3 then Q 1 (x): 2.9%, Q 2 (x): 70.8%

Privacy Breaches R 1 (x) leaks information about property Q 1 (x) Before seeing R 1 (x), adversary thinks that probability of x=0 is only 1%, but after noticing that R 1 (x)=0, the probability that x=0 is 72% R 2 (x) leaks information about property Q 2 (x) Before seeing R 2 (x), adversary thinks that probability of x  {200,..., 800} is 41%, but after noticing that R 2 (x)=0, the probability that x  {200,..., 800} is 100% Randomization operator should be such that posterior distribution is close to the prior distribution for any property

Privacy Breach: Definitions Q(x) is some property,  1,  2 are probabilities  1  “very unlikely”,  2  “very likely” Straight privacy breach: P(Q(x))   1, but P(Q(x) | R(x)=y)   2 Q(x) is unlikely a priori, but likely after seeing randomized value of x Inverse privacy breach: P(Q(x))   2, but P(Q(x) | R(x)=y)   1 Q(x) is likely a priori, but unlikely after seeing randomized value of x [Evfimievski et al.]

How to check privacy breach How to ensure that randomization operator hides every property? There are 2 |X| properties Often randomization operator has to be selected even before distribution P x is known (why?) Idea: look at operator’s transition probabilities How likely is x i to be mapped to a given y? Intuition: if all possible values of x i are equally likely to be randomized to a given y, then revealing y=R(x i ) will not reveal much about actual value of x i

Amplification Randomization operator is  -amplifying for y if For given  1,  2, no straight or inverse privacy breaches occur if [Evfimievski et al.]

Amplification: Example R 1 (x) = x with prob. 20%; a uniformly random number with prob. 80% R 2 (x) = x+  mod 1001,  uniform in [-100,100] R 3 (x) = R 2 (x) with prob. 50%, a uniform random number with prob. 50% For R 3, p(x  y) = ½ (1/201 + 1/1001) if y  [x-100,x+100] ½(0 + 1/1001) otherwise Fractional difference = 1 + 1001/201 < 6 (=  ) Therefore, no straight or inverse privacy breaches will occur with  1 =14%,  2 =50%

Coming up Multiplicative noise Output perturbation

February 12, 2008Data Mining: Concepts and Techniques45 Example: Information Gain  Class P: buys_computer = “yes”,  Class N: buys_computer = “no”

Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.

Similar presentations

Presentation on theme: "Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity.

Similar presentations

Presentation on theme: "Privacy preserving data mining Li Xiong CS573 Data Privacy and Anonymity."— Presentation transcript:

Similar presentations

About project

Feedback