Presentation is loading. Please wait.

Presentation is loading. Please wait.

Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.

Similar presentations


Presentation on theme: "Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte."— Presentation transcript:

1

2 Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte

3 2 Market Basket Data TIDmilksugarbread … cereals 1101 … 1 2011 … 1 3100 … 1 4111 … 0.... …. N011 … 0 1: presence 0: absence …  Association rule (R.Agrawal SIGMOD 1993)  with support and confidence

4 3 Other measures 2 x 2 contingency table Objective measures for A=>B

5 4 Related Work Privacy preserving association rule mining Data swapping Frequent itemset or rule hiding Inverse frequent itemset mining Item randomization

6 5 Item Randomization TIDmilksugarbread … cereals 11011 20111 31001 41110.... …. N0110 TIDmilksugarbread … cereals 10111 21110 31111 40011.... …. N1101 Original DataRandomized Data To what extent randomization affects mining results? (Focus) To what extent it protects privacy?

7 6 Randomized Response ([ Stanley Warner; JASA 1965]) : Cheated in the exam : Didn ’ t cheat in the exam Cheated in exam Didn’t cheat Randomization device Do you belong to A? (p) Do you belong to ?(1-p) … “Yes” answer “No” answer As:Unbiased estimate of is:  Procedure: Purpose: Get the proportion( ) of population members that cheated in the exam. … Purpose

8 7 Application of RR in MBD RR can be expressed by matrix as: ( 0: No 1:Yes) =  Extension to multiple variables e.g., for 2 variables  Unbiased estimate of is: stands for Kronecker product diagonal matrix with elements

9 8 Analysis the dispersion matrix of the regular survey estimation nonnegative definite, represents the components of dispersion associated with RR experiment diagonal matrix with elements

10 9 Kronecker Product Example = =

11 10 Randomization example TIDmilksugarbread … cereals 11011 20111 31001 41110.... …. N0110 Original Data Randomized Data TIDmilksugarbread … cereals 10111 21110 31111 41011.... …. N0101 RR A: Milk B: Cereals 0.4150.0430.458 0.1830.3590.542 0.5980.402 0.3680.0970.465 0.2180.3170.537 0.5860.414 =(0.415,0.043,0.183,0.359)’ =(0.427,0.031,0.181,0.362)’ 0.662 0.671 We can get the estimate, how accurate we can achieve? =(0.368,0.097,0.218,0.316)’ Data miners Data owners

12 11 Motivation 31.5 35.9 36.3 22.1 12.3 23.8 Frequent set Not frequent set Estimated values Original values Rule 6 is falsely recognized from estimated value! Lower& Upper bound Frequent set with high confidence Frequent set without confidence Both are frequent set

13 12 Accuracy on Support S Estimate of support Variance of support Interquantile range (normal dist.) 0.362 0.3460.378

14 13 Accuracy on Confidence C Estimate of confidence A =>B Variance of confidence Interquantile range (ratio dist. is F(w))  Loose range derived on Chebyshev’s theorem where  Let be a random variable with expected value and finite variance.Then for any real

15 14 Bounds of other measures Accuracy Bounds

16 15 General Framework  Step1: Estimation  Express the measure as one derived function from the observed variables ( or their marginal totals, ).  Compute the estimated measure value.  Step2: Variance of the estimated measure  Get the variance of the estimated measure (a function with multi known variables) through Taylor approximation  Step 3: Derive the interquantile range through Chebyshev's theorem

17 16 Example for with two variables  Step 1: Get the estimate of the measure   Step 2: Get the variance of the estimated measure   Step 3: Derive the interquantile range through Chebyshev's theorem. Where:,,,

18 17 Accuracy Bounds With unknown distribution, Chebyshev theorm only gives loose bounds. Bounds of the support vs. varying p

19 18 Distortion All the above discussions assume distortion matrices P are known to data miners P could be exploited by attackers to improve the posteriori probability of their prediction on sensitive items How about not releasing P? Disclosure risk is decreased Data mining result?

20 19 Unknown distortion P MeasureExpression Correlation ( ) Mutual Information (M) Likelihood ratio ( ) Pearson Statistics( )  Some measures have monotonic properties  Other measures don’t have such properties

21 20 Applications: hypothesis test  From the randomized data, if we discover an itemset which satisfies, we can guarantee dependence exists among the original itemset since. Still be able to derive the strong dependent itemsets from the randomized data No false positive

22 21 Conclusion Propose a general approach to deriving accuracy bounds of various measures adopted in MBD analysis Prove some measures have monotonic property and some data mining tasks can be conducted directly on randomized data (without knowing the distortion). No false positive pattern exists in the mining result.

23 22 Future Work Which measures are more sensible to randomization? The tradeoff between the privacy of individual data and the accuracy of data mining results Accuracy vs. disclosure analysis for general categorical data

24 23 Acknowledgement NSF IIS-0546027 Ph.D. students Ling Guo Songtao Guo

25 24 Q A &


Download ppt "Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte."

Similar presentations


Ads by Google