Download presentation
Presentation is loading. Please wait.
Published byAsher Mason Modified over 8 years ago
1
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002
2
2 Content Background Problem framework MASK -- distortion part MASK -- mining part Performance Conclusion
3
3 Background In data mining, the accuracy of the input data is very important for obtaining valuable mining results. However, in real life, there are many reasons which lead to inaccurate data. One example is that, the users deliberately provide wrong information to protect their privacy. – age, income, illness, etc. Problem: how to protect user privacy while getting accurate mining results at the same time?
4
4 Background (cont’d) Privacy and accuracy are contradictory in nature. A compromise way is more feasible. – satisfactory (not 100%) privacy and satisfactory (not 100%) accuracy This paper studied this problem in the context of mining association rules.
5
5 Overview of the Paper The authors proposed a scheme --- MASK (Mining Associations with Secrecy Konstraints). Major idea of MASK – Apply a simple probabilistic distortion on original data The distortion can be done at the user machine – The miner tries to find accurate mining results, given the following inputs: The distorted data A description of the distortion procedure
6
6 Problem Framework Database model – Each customer transaction is a record in the database. – A record is a fixed-length sequence of 1’s and 0’s. E.g: for market-basket data – length of the record: the total number of items sold by the market. – 1: the corresponding item was bought in the transaction – 0: vice versa. – The database can be regarded as a two-dimensional boolean matrix.
7
7 Problem Framework (cont’d) The matrix is very sparse. Why not use item- lists? – The data will be distorted. – After the distortion, it will not as sparse as the original (true) data. Mining objective: find frequent itemsets – Itemset whose appearance (support) in the database is larger than a threshold.
8
8 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion
9
9 MASK --- Distortion Part Distortion Procedure – Represent a customer record by a random vector. – Original record: X={X i }, where X i =0 or 1. – Distorted record: Y={Y i }, where Y i =0 or 1. Y i = X i (with a probability of p) Y i = 1-X i (with a probability of 1-p)
10
10 Quantifying Privacy Privacy metric – The probability of reconstructing the true data – Consider each individual item With what probability can a given 1 or 0 in the true matrix database be reconstructed? Calculate reconstruction probability – Let s i = prob (a random customer C bought the i th item) = the true support of item i – The probability of correctly reconstruction of a ‘1’ in a random item i is: R 1 (p,s i )= s i x p 2 / (s i x p +(1-s i ) x (1-p) ) + s i x (1-p) 2 / ( s i x (1-p) + (1-s i ) x p)
11
11 Reconstruction Probability Reconstruction probability of a ‘1’ across all items: R 1 (p) = ( i s i R 1 (p,s i ) ) / ( i s i ) Suppose – s 0 =the average support of an item Replace s i by s 0, we get – R 1 (p)= s 0 x p 2 / (s 0 x p +(1-s 0 ) x (1-p) ) + s 0 x (1-p) 2 / ( s 0 x (1-p) + (1-s 0 ) x p)
12
12 Reconstruction Probability (cont’d) Relationship between R 1 (p) and p, s 0 Observations: – R 1 (p) is high when p is near 0 and 1, and it is lowest when p=0.5. – The curves become flatter as s 0 decreases.
13
13 Privacy Measure The reconstruction probability of a ‘0’ – R 0 (p)= func(p and s 0 ). The total reconstruction probability – R(p)=a R 1 (p) +(1-a) R 0 (p) – a is the weight parameter. Privacy – P(p) = ( 1- R(p) ) x 100
14
14 Privacy Measure (cont’d) Privacy vs. p Observations: – For a given value of s 0, the curve shape is fixed. The value of a determines the absolute value of privacy. – The privacy is nearly constant for a large range of p. provide flexibility in choosing p that can minimize the error in the later mining part. P(p) for s 0 =0.01
15
15 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion
16
16 MASK --- Mining Part How to estimate the accurate supports of itemsets from a distorted database? – Remember that the miner knows the value of p. Estimating 1-itemset supports Estimating n-itemset supports The whole mining process
17
17 Estimating 1-itemset Supports Symbols: – T: the original true matrix; D: the distorted matrix; – i: a random item; – C 1 T and C 0 T : the number of 1’s and 0’s in the i column of T; – C 1 D and C 0 D : the number of 1’s and 0’s in the i column of D. From distortion method, we have – C 1 D : roughly C 1 T p+ C 0 T (1-p) -> C 1 D = C 1 T p+ C 0 T (1-p) – C 0 D : roughly C 0 T p+ C 1 T (1-p) -> C 0 D = C 0 T p+ C 1 T (1-p) Let,,, then C D = MC T. So C T = M -1 C D.
18
18 Estimating n-itemset Supports Still use C T = M -1 C D to estimate support. Define – C K T is the number of records in T that have the binary form of k for the given itemset. E.g: for a 3-itemset that contains the first 3 items – C T has 2 3 =8 rows – C 3 T is the No. of records in T of form {0,1,1,…} M i,j = Prob ( C j T -> C i D ). – M 7,3 =p 2 (1-p) (C 3 T -> C 7 D or C 011 T -> C 111 D )
19
19 Mining Process Similar to Apriori algorithm Difference: – E.g: when counting supports of 2-itemsets, Apriori only need to count the No. of records that have value ‘1’ for both items, or of form “11”. MASK has to keep track of all 4 combinations: 00,01,10 and 11 for the corresponding items. – C 2 n -1 T is estimated from C 0 D, C 1 D, …, C 2 n -1 D. MASK requires more time and space than Apriori. – Some optimizations (omitted)
20
20 Background Problem framework MASK --- distortion part MASK --- mining part Performance Conclusion
21
21 Performance Data sets – Synthetic database 1,000,000 records; 1000 items s 0 =0.01 – Real dataset Click-stream data of a retailer web site 600,000 records; about 500 items s 0 =0.005
22
22 Performance (cont’d) Error Metrics – Right class, wrong support Infrequent itemsets, error doesn’t matter Frequent itemsets – Support Error ( ): – Wrong class Identity Error ( ) – false positives: – false negatives:
23
23 Performance (cont’d) Parameters – sup = 0.25%, 0.5% – p = 0.9, 0.7 – a=1: only concern of privacy of 1’s – r = 0%, 10% Coverage may be more important than precision. Use a smaller support threshold to mine the distorted database. Support used to mine D = sup x (1-r)
24
24 Performance (cont’d) Synthetic dataset – Experiment 1: p=0.9 (85%), sup=0.25% Level|F| -- ++ 16893.311.16 226483.584.495.14 319901.714.572.16 414181.283.670.22 57301.275.890 62121.364.255.19 7351.4000 830.9900 Level|F| -- ++ 16893.370.733.19 226483.730.1919.68 319901.76028.09 414181.29025.81 57301.32016.44 62121.37025.47 7351.40051.43 830.99066.67 r=0%r=10%
25
25 Performance (cont’d) Synthetic dataset – Experiment 2: p=0.9 (85%), sup=0.5% Level|F| -- ++ 15602.601.250.89 24702.135.534.89 33261.223.070.31 42081.341.440.48 51251.8100 6432.6200 7103.44100 814.5000 Level|F| -- ++ 15602.660.184.29 24702.21044.89 33261.26042.64 42081.35051.44 51251.81022.4 6432.62018.60 7103.47010 814.5000 r=0%r=10%
26
26 Performance (cont’d) Synthetic dataset – Experiment 3: p=0.7 (96%), sup=0.25%, r=10% Level|F| -- ++ 168910.162.617.84 2264825.2319.52630.93 3199026.9342.86172.71 4141829.1465.940.35 573028.4779.320 621236.2584.910 73551.3785.710 83-1000
27
27 Performance (cont’d) Real database – Experiment 1: p=0.9 (89%), sup=0.25% Level|F| -- ++ 12495.894.022.81 22393.876.697.11 3732.6010.969.59 441.41025.0 Level|F| -- ++ 12496.121.20.40 22394.041.2623.43 3732.93045.21 441.41075 r=0%r=10%
28
28 Performance (cont’d) Real database – Experiment 2: p=0.9 (89%), sup=0.5% Level|F| -- ++ 11504.230.674.67 2452.422.224.44 361.07016.66 Level|F| -- ++ 11504.2708 2452.56037.77 361.07066.66 r=0%r=10%
29
29 Performance (cont’d) Real database – Experiment 3: p=0.7 (97%), sup=0.25%, r=10% Level|F| -- ++ 124918.967.2315.66 223933.5920.081907.53 37332.8730.142308.22 447.5550400
30
30 Performance (cont’d) Summary – Good privacy and good accuracy can be achieved at the same time by careful selection of p. – In experiments, p around 0.9 is the best choice. – A smaller p leads to much error in mining results. – A larger p will reduces the privacy greatly.
31
31 Conclusion This paper studies the problem of achieving a satisfactory privacy and accuracy simultaneously for association rule mining. A probabilistic distortion of the true data is proposed. Privacy is measured by a formula, which is a function of p and s 0.
32
32 Conclusion (cont’d) A mining process is put forward to estimate the real support from the distorted database. Experiment results show that there is a small window of p (near 0.9) that can achieve good accuracy (90%+) and privacy (80%+) at the same time.
33
33 Related Works On preventing sensitive rules from being inferred by the miner (output privacy) – Y. Saygin, V. Verykios and C. Clifton, “Using Unknowns to Prevent Discovery of Association Rules”, ACM SIGMOD Record, vol.30 no. 4, 2001 – M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim and V. Verykios, “Disclosure Limitation of Sensitive Rules”, Proc. Of IEEE Knowledge and Data Engineering Exchange Workshop, Nov.1999
34
34 Related Works On input data privacy in distributed databases – J. Vaidya and C. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, KDD2002 – M. Kantarcioglu and C. Clifton, “Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data”, Proc. Of ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2002
35
35 Related Works Privacy-preserving mining in the context of classification rules – D. Agrawal and C. Aggarwal, “On the Design and Quantification of Privacy Preserving Data Mining Algorithms”, PODS, 2001 A recent paper also appears in 2002 – A. Evfimievski, R. Srikant, R. Agrawal and J. Gehrke, “Privacy Preserving Mining of Association Rules”, KDD2002
36
36 ?
37
37 More information Distortion procedure – Y i = X i XOR r i ‘, where r i ‘ is the complement of r i, r i is a random variable with density function f ( r ) = bernoulli(p) (0 <= p <= 1)
38
38 More Information Reconstruction error bounds (1-itemsets) – With probability P E (m,p,(2p-1) /2) X P E (n,p,(2p- 1) /2), the error is less than . n: the real support count of the item; m: dbsize-n; P E (n,p, ) = ( r=np- np+ ) n C r p r (1-p) n-r
39
39 Reconstruction probability of a ‘1’ in a random item i – S i = the true support of item i = pr (a random customer C bought the i th item), X i = the original entry for item i Y i = the distorted entry for item I – The probability of correct reconstruction of a ‘1’ in a random item i is: R 1 (p,s i )= Pr{Y i =1| X i =1} x pr{X i =1| Y i =1} + Pr{Y i =0| X i =1} x Pr{X i =1| Y i =0} = s i x p 2 / (s i x p +(1-s i ) x (1-p) ) + s i x (1-p) 2 / ( s i x (1-p) + (1-s i ) x p)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.