Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.

Similar presentations


Presentation on theme: "Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal."— Presentation transcript:

1 Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal

2 2/5/2003 E0 361 TIDS - Project Presenatation 2 Privacy Preserving Data Mining of Association rules Aim –To develop accurate model about aggregated data without access to individual information Reconstruct support Identify frequent item-sets Conflicting goals PrivacyAccuracy

3 2/5/2003 E0 361 TIDS - Project Presenatation 3 The third dimension - Efficiency Lot of work done on efficiency of association rule mining Mining the distorted database observed to be significantly more expensive The 3 conflicting dimensions Accuracy Efficiency Privacy

4 2/5/2003 E0 361 TIDS - Project Presenatation 4 Our Goal Supporting the conflicting goals of Privacy, Accuracy and Efficiency in a single privacy preserving scheme –Using MASK as base algorithm for improvement

5 2/5/2003 E0 361 TIDS - Project Presenatation 5 MASK (Maintaining Accuracy with Secrecy Konstraints) Miner receives a probabilistic function of true customer database –Bernoulli distribution with probability (1-p) Matrix of Order 2 n –Estimated reconstructed support

6 2/5/2003 E0 361 TIDS - Project Presenatation 6 Analysis of critical features Simultaneously provides –High degree of Privacy : above 80% –High level of Accuracy : above 90% Needs to keep track of 2 n components for each n-itemset ! –After distorting the true database, the original ‘11’ can now potentially be any of the four combinations 11,10,01,00 –2 n counters required (n+1 after optimization) Costly in terms of mining time –Around 300 times the Apriori running on true database!

7 2/5/2003 E0 361 TIDS - Project Presenatation 7 Optimization 1 Shifting the computation of components of n- itemset to end of pass Motivated by simple concept from set theory In general

8 2/5/2003 E0 361 TIDS - Project Presenatation 8 Optimization 1 (cont.) Count of any component of n-itemset can be calculated from counts of k-itemset where k<=n Single count of n-itemset needs to be kept track of during the pass –Components’ counts can be calculated at end of pass Can be Critical for performance improvement

9 2/5/2003 E0 361 TIDS - Project Presenatation 9 Optimization 2 Exploiting the symmetric properties of distortion matrix to eliminate matrix computations. Observations –Only first row of inversed matrix required for reconstruction –Highly symmetric matrix – partition method can be applied

10 2/5/2003 E0 361 TIDS - Project Presenatation 10 Optimization 2(cont.) By application of partition method – first row of inverse matrix in nth pass can be calculated by –multiplying p / (p 2 – q 2 ) with the first row of (n-1)th inverse to get first half elements –multiplying -q / (p 2 – q 2 ) with the first row of (n-1)th inverse to get other half elements Single row needs to be maintained Large matrix inversion eliminated –Makes implementation simple and elegant

11 2/5/2003 E0 361 TIDS - Project Presenatation 11 Performance Study Data Set –Synthetically generated by IBM Almaden generator –Support =0.25% unless otherwise mentioned –Parameter table SymbolParameter MeaningValue NNo of items1000 TMean Transaction Length10 IMean frequent itemset length4 DNo. of transactions1M

12 2/5/2003 E0 361 TIDS - Project Presenatation 12 Comparison of improved MASK performance with Apriori and MASK On Linear Scale On Logarithmic Scale Improvement by a factor of 13 approx Running time of EMASK still 25-30 times Apriori !

13 2/5/2003 E0 361 TIDS - Project Presenatation 13 Explanation for Performance Gap Time take by Reconstruction from distorted counts? Contribution of reconstruction time to total mining time

14 2/5/2003 E0 361 TIDS - Project Presenatation 14 Who is the Culprit? Distortion of 0s to 1 –MASK distorts 0s and 1s with equal probability 0.9 (ideal value observed) –Sparse database  Large number of 0s  Larger number of 0s distort to 1s than 1s which distort to 0  Number of 1s in distorted database significantly increases  Transaction length increases  Counting passes of Apriori over the distorted database takes significantly more time than those over original database

15 2/5/2003 E0 361 TIDS - Project Presenatation 15 Experimental illustration Variation of average transaction length with p Variation of mining time with avg transaction length Average transaction length and hence mining time decreases with decreasing distortion probability

16 2/5/2003 E0 361 TIDS - Project Presenatation 16 Inference Decreasing distortion probability decreases transaction length  Distortion of 0 to 1 has more impact on transaction length than than the distortion of 1 to 0. We know Decreasing the distortion probability decreases privacy Preference of privacy of 1s over 0s  Distortion of 1s should have more impact on privacy than that of 0s

17 2/5/2003 E0 361 TIDS - Project Presenatation 17 Inference(cont.) Proposed Solution to Performance improvement –Distort 1s and 0s with different probability –Distort more 1s and less 0s, to preserve privacy and efficiency –More distortion of 1s to 0 will decrease accuracy, while less distortion of 0 will support it we may be able to get reasonable accuracy for some value of p and q which supports all the three where p=probability of 1 remaining unflipped q=probability of 0 remaining unflipped

18 2/5/2003 E0 361 TIDS - Project Presenatation 18 Modifications to MASK Distortion procedure Reconstruction probability

19 2/5/2003 E0 361 TIDS - Project Presenatation 19 Searching Space Apply Privacy constraints Search space for Reconstruction probability <=0.21

20 2/5/2003 E0 361 TIDS - Project Presenatation 20 Searching Space(cont.) Apply accuracy constraints Added accuracy constraint <=25 p q

21 2/5/2003 E0 361 TIDS - Project Presenatation 21 Most interesting points Nearest neighbor search –5-dimensional space [transaction length,,,,R(p,q)] –Query point (0,0)

22 2/5/2003 E0 361 TIDS - Project Presenatation 22 Observations p = 0.40, q =0.97, minsupp =0.0025 –Reconstruction probability 0.142 –Time (sec) 338.9 [ Transaction length = 33.7] Level 14.87 [3.852]2.36 [1.96]1.69 [1.302] 27.29 [4.190]5 [21.875]10.76 [8.837] 36.8117 [1.751]0 [0] 413.9 [1.638]0 [0] Time taken is 5 times Apriori 85% privacy

23 2/5/2003 E0 361 TIDS - Project Presenatation 23 Observations p = 0.40, q =0.99, minsupp =0.0025 –Reconstruction probability 0.2 –Time (sec) 151.1 [ Transaction length = 13.9] Level 12.73 [3.852]0.66 [1.96]0.79 [1.302] 24.82 [4.190] 19.79 [21.88]9.52 [8.837] 35.78 [1.751]0 [0] 412.89 [1.638]0 [0] Time taken is 2 times Apriori 80% privacy

24 2/5/2003 E0 361 TIDS - Project Presenatation 24 Conclusion A large number of points (p,q) exist where the three metrics privacy, accuracy and efficiency could be achieved with reasonable success.


Download ppt "Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal."

Similar presentations


Ads by Google