Presentation is loading. Please wait.

Presentation is loading. Please wait.

Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include.

Similar presentations


Presentation on theme: "Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include."— Presentation transcript:

1 Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include materials from this paper

2 Privacy-Preserving Data Mining Problem: How do we publish data without compromising individual privacy? Solution : randomization, anonymization

3 Randomization Adding random noise to original dataset Challenge – Is data still useful for further analysis?

4 Randomization Model: data is distorted by adding random noise Original data X = {x 1...x N }, for record x i ∈ X, random variable Y = {y 1...y N } is added, so new data is denoted by Z ={ z 1...z N }, z i =x i + y i. y i is a random value – Uniform, [-α, +α] – Gaussian, N (0, σ 2 )

5 Reconstruction Perturbed data hides data distribution and need be reconstructed before data mining Given – x 1 +y 1, x 2 +y 2,..., x n +y n – the probability distribution of Y Estimate the probability distribution of x Clifton AusDM‘11

6 1.f x 0 = Uniform distribution 2.Repeat update until stop criterion met Reconstruction Bayes rule to estimate cumulative density functions reconstruction algorithm

7 reconstructed original randomized original reconstructed randomized N(0, 0.25) (-0.5, 0.5)

8 Privacy Metric If a data x is estimated to be in the interval [α, β] with c% confidence, then the interval (β-α) defines the amount of privacy with c% confidence. Example Age 20-40, 95% confidence, 50% privacy in Uniform 2 α = 20*0.5/0.95 = 10.5 Confidence 50%95%99.9% Uniform0.5 X 2α0.95 X 2α0.999 X 2α Gaussian1.34 X σ3.92 X σ6.8 X σ

9 Decision Tree

10 Training Decision Tree Split point – interval boundaries Reconstruction algorithm – Global – Byclass – Local Dataset – Synthetic dataset, training set of 100,000 records and testing set of 5,000 records, equally split into two classes

11

12 original global and randomized Byclass and local global randomized original byclass local

13

14 Extended Work ‘02 proposed a method to quantify information loss – Mutual information ‘07 evaluated randomization with combining of public information – Gaussian is better than uniform – Dataset with inherent cluster pattern will improve randomization performance – Varying density and outliers will decrease performance

15 Multiplicative Randomization Rotation randomization – Distorted by an orthogonal matrix Projection randomization – Project high-dimensional dataset into low- dimensional space Preserving Euclidean distance and can be applied with distance-based classification (KNN, SVM) and clustering (K-means)

16 Summary Pros: data and noise are independent, can be applied during data collection time, useful for stream data Cons: information loss, dimensionality curse

17 Questions?


Download ppt "Randomization in Privacy Preserving Data Mining Agrawal, R., and Srikant, R. Privacy-Preserving Data Mining, ACM SIGMOD’00 the following slides include."

Similar presentations


Ads by Google