Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal.

Slides:



Advertisements
Similar presentations
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Advertisements

A distributed method for mining association rules
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Object Specific Compressed Sensing by minimizing a weighted L2-norm A. Mahalanobis.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
Farnoush Banaei-Kashani and Cyrus Shahabi Criticality-based Analysis and Design of Unstructured P2P Networks as “ Complex Systems ” Mohammad Al-Rifai.
Data Mining Association Analysis: Basic Concepts and Algorithms
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Rakesh Agrawal Ramakrishnan Srikant
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Data Mining Association Analysis: Basic Concepts and Algorithms
An architecture for Privacy Preserving Mining of Client Information Jaideep Vaidya Purdue University This is joint work with Murat.
Privacy Preserving Market Basket Data Analysis Ling Guo, Songtao Guo, Xintao Wu University of North Carolina at Charlotte.
Data Mining Association Analysis: Basic Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
Statistical Analysis of Transaction Dataset Data Visualization Homework 2 Hongli Li.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Fast Algorithms for Association Rule Mining
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Privacy Preserving OLAP Rakesh Agrawal, IBM Almaden Ramakrishnan Srikant, IBM Almaden Dilys Thomas, Stanford University.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.
Distributed Networks & Systems Lab. Introduction Collaborative filtering Characteristics and challenges Memory-based CF Model-based CF Hybrid CF Recent.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Secure Incremental Maintenance of Distributed Association Rules.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
1 FINDING FUZZY SETS FOR QUANTITATIVE ATTRIBUTES FOR MINING OF FUZZY ASSOCIATE RULES By H.N.A. Pham, T.W. Liao, and E. Triantaphyllou Department of Industrial.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Tetris Agent Optimization Using Harmony Search Algorithm
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Maintaining Data Privacy in Association Rule Mining Speaker: Minghua ZHANG Oct. 11, 2002 Authors: Shariq J. Rizvi Jayant R. Haritsa VLDB 2002.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Antara Ghosh Jignashu Parikh
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Frequent Pattern Mining
Waikato Environment for Knowledge Analysis
Market Basket Many-to-many relationship between different objects
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Transactional data Algorithm Applications
Data Mining Association Analysis: Basic Concepts and Algorithms
Unfolding with system identification
Presentation transcript:

Efficiency concerns in Privacy Preserving methods Optimization of MASK Shipra Agrawal

2/5/2003 E0 361 TIDS - Project Presenatation 2 Privacy Preserving Data Mining of Association rules Aim –To develop accurate model about aggregated data without access to individual information Reconstruct support Identify frequent item-sets Conflicting goals PrivacyAccuracy

2/5/2003 E0 361 TIDS - Project Presenatation 3 The third dimension - Efficiency Lot of work done on efficiency of association rule mining Mining the distorted database observed to be significantly more expensive The 3 conflicting dimensions Accuracy Efficiency Privacy

2/5/2003 E0 361 TIDS - Project Presenatation 4 Our Goal Supporting the conflicting goals of Privacy, Accuracy and Efficiency in a single privacy preserving scheme –Using MASK as base algorithm for improvement

2/5/2003 E0 361 TIDS - Project Presenatation 5 MASK (Maintaining Accuracy with Secrecy Konstraints) Miner receives a probabilistic function of true customer database –Bernoulli distribution with probability (1-p) Matrix of Order 2 n –Estimated reconstructed support

2/5/2003 E0 361 TIDS - Project Presenatation 6 Analysis of critical features Simultaneously provides –High degree of Privacy : above 80% –High level of Accuracy : above 90% Needs to keep track of 2 n components for each n-itemset ! –After distorting the true database, the original ‘11’ can now potentially be any of the four combinations 11,10,01,00 –2 n counters required (n+1 after optimization) Costly in terms of mining time –Around 300 times the Apriori running on true database!

2/5/2003 E0 361 TIDS - Project Presenatation 7 Optimization 1 Shifting the computation of components of n- itemset to end of pass Motivated by simple concept from set theory In general

2/5/2003 E0 361 TIDS - Project Presenatation 8 Optimization 1 (cont.) Count of any component of n-itemset can be calculated from counts of k-itemset where k<=n Single count of n-itemset needs to be kept track of during the pass –Components’ counts can be calculated at end of pass Can be Critical for performance improvement

2/5/2003 E0 361 TIDS - Project Presenatation 9 Optimization 2 Exploiting the symmetric properties of distortion matrix to eliminate matrix computations. Observations –Only first row of inversed matrix required for reconstruction –Highly symmetric matrix – partition method can be applied

2/5/2003 E0 361 TIDS - Project Presenatation 10 Optimization 2(cont.) By application of partition method – first row of inverse matrix in nth pass can be calculated by –multiplying p / (p 2 – q 2 ) with the first row of (n-1)th inverse to get first half elements –multiplying -q / (p 2 – q 2 ) with the first row of (n-1)th inverse to get other half elements Single row needs to be maintained Large matrix inversion eliminated –Makes implementation simple and elegant

2/5/2003 E0 361 TIDS - Project Presenatation 11 Performance Study Data Set –Synthetically generated by IBM Almaden generator –Support =0.25% unless otherwise mentioned –Parameter table SymbolParameter MeaningValue NNo of items1000 TMean Transaction Length10 IMean frequent itemset length4 DNo. of transactions1M

2/5/2003 E0 361 TIDS - Project Presenatation 12 Comparison of improved MASK performance with Apriori and MASK On Linear Scale On Logarithmic Scale Improvement by a factor of 13 approx Running time of EMASK still times Apriori !

2/5/2003 E0 361 TIDS - Project Presenatation 13 Explanation for Performance Gap Time take by Reconstruction from distorted counts? Contribution of reconstruction time to total mining time

2/5/2003 E0 361 TIDS - Project Presenatation 14 Who is the Culprit? Distortion of 0s to 1 –MASK distorts 0s and 1s with equal probability 0.9 (ideal value observed) –Sparse database  Large number of 0s  Larger number of 0s distort to 1s than 1s which distort to 0  Number of 1s in distorted database significantly increases  Transaction length increases  Counting passes of Apriori over the distorted database takes significantly more time than those over original database

2/5/2003 E0 361 TIDS - Project Presenatation 15 Experimental illustration Variation of average transaction length with p Variation of mining time with avg transaction length Average transaction length and hence mining time decreases with decreasing distortion probability

2/5/2003 E0 361 TIDS - Project Presenatation 16 Inference Decreasing distortion probability decreases transaction length  Distortion of 0 to 1 has more impact on transaction length than than the distortion of 1 to 0. We know Decreasing the distortion probability decreases privacy Preference of privacy of 1s over 0s  Distortion of 1s should have more impact on privacy than that of 0s

2/5/2003 E0 361 TIDS - Project Presenatation 17 Inference(cont.) Proposed Solution to Performance improvement –Distort 1s and 0s with different probability –Distort more 1s and less 0s, to preserve privacy and efficiency –More distortion of 1s to 0 will decrease accuracy, while less distortion of 0 will support it we may be able to get reasonable accuracy for some value of p and q which supports all the three where p=probability of 1 remaining unflipped q=probability of 0 remaining unflipped

2/5/2003 E0 361 TIDS - Project Presenatation 18 Modifications to MASK Distortion procedure Reconstruction probability

2/5/2003 E0 361 TIDS - Project Presenatation 19 Searching Space Apply Privacy constraints Search space for Reconstruction probability <=0.21

2/5/2003 E0 361 TIDS - Project Presenatation 20 Searching Space(cont.) Apply accuracy constraints Added accuracy constraint <=25 p q

2/5/2003 E0 361 TIDS - Project Presenatation 21 Most interesting points Nearest neighbor search –5-dimensional space [transaction length,,,,R(p,q)] –Query point (0,0)

2/5/2003 E0 361 TIDS - Project Presenatation 22 Observations p = 0.40, q =0.97, minsupp = –Reconstruction probability –Time (sec) [ Transaction length = 33.7] Level [3.852]2.36 [1.96]1.69 [1.302] [4.190]5 [21.875]10.76 [8.837] [1.751]0 [0] [1.638]0 [0] Time taken is 5 times Apriori 85% privacy

2/5/2003 E0 361 TIDS - Project Presenatation 23 Observations p = 0.40, q =0.99, minsupp = –Reconstruction probability 0.2 –Time (sec) [ Transaction length = 13.9] Level [3.852]0.66 [1.96]0.79 [1.302] [4.190] [21.88]9.52 [8.837] [1.751]0 [0] [1.638]0 [0] Time taken is 2 times Apriori 80% privacy

2/5/2003 E0 361 TIDS - Project Presenatation 24 Conclusion A large number of points (p,q) exist where the three metrics privacy, accuracy and efficiency could be achieved with reasonable success.