Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.

Similar presentations


Presentation on theme: "Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi."— Presentation transcript:

1 Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi

2 Title Statement This paper introduces an approach to implement noise tolerant frequent itemset mining of the binary matrix representation of the database

3 Index  Introduction to Frequent Itemset Mining Frequent Itemset Mining Binary Matrix Representation Model Problems  Motivation  Proposed Model  Proposed Algorithm  AFI Mining vs. Exact Frequent Itemset Mining  Related Works  Experimental Results  Discussion  Conclusion

4 Introduction to Frequent Itemset Mining  Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  Originally developed to discover association rules  Applications Bio-molecular applications: o DNA sequence analysis, protein structure analysis Business applications: o Market basket analysis, sale campaign analysis

5 The Binary Matrix Representation Model  Model for representing relational databases  Rows correspond to objects  Columns correspond to attributes of the objects ‘1’ indicates presence ‘0’ indicates absence Frequent Itemset mining is a key technique for analyzing such data Apply Apriori algorithm Item --> I1I2I3I4I5 Transaction T1 10110 T2 01101 T3 11101 T4 01001 T5 10000

6 Problem with Frequent Itemset Mining The traditional model for mining frequent itemsets requires that every item must occur in each supporting transaction Real data is typically subject to noise Reasons for noise o Human error o Measurement error o Vagaries of human behavior o Stochastic nature of studied biological behavior NOT a practical assumption!

7 Effect of Noise  Fragmentation of Patterns by Noise Discover multiple small fragments of the true itemset Miss the true itemset itself!  Example Exact frequent itemset mining algorithm will miss the main itemset ‘A’ Observe three fragmented itemsets – Itemset 1,2 and 3 Fragmented itemsets may not satisfy the minimum support criteria and will therefore be discarded

8 Mathematical Proof of Fragmentation With probability 1, M(Y) <= 2log a n− 2log a (log a n) when n is sufficiently large i.e. in the presence of noise, only a fraction of the initial block of 1s can be recovered Where – Matrix X: contains actual values recorded in the absence of any noise Matrix Z: binary noise matrix whose entries are independent Bernoulli’s random variable such that Z ~ Bern(p) for 0<=p<=0.5 M(Y): is the largest k such that Y contains k transactions having k common items Y = X xor Z, a = (1 − p) −1 From: Significance and Recovery of block structures in binary matrices with noise - by X. Sun & A.B. Nobel

9 Motivation  The failure of classical frequent itemset mining to detect simple patterns in the presence of random errors (i.e. noise) compromises the ability of these algorithms to detect association, cluster items or build classifiers when such errors are present

10 DRAWBACK: “Free riders” like column h (for matrix C) and row 6 (for matrix B) Possible Solutions  Let the matrix contain a small fraction of 0s SOLUTION: Limit the number of 0s in each row and column

11 Proposed Model 1. Use Approximate Frequent Itemset (AFI)  AFI characteristics Sub-matrix contains large fraction of 1s Supporting transaction should contain most of the items i.e. number of 0s in every row must fall below user defined threshold (є r ) Supporting item should occur in most of the transaction i.e. number of 0s in every column must fall below user defined threshold (є c ) Number of rows > minimum support

12 AFI  Mathematical definition For a given binary matrix D having I 0 items and T 0 transactions, an itemset I c I 0 is an approximate frequent itemset AFI(є r,є c ) if there exists a set of transactions T c T 0 with |T| ≥ |T 0 |.minsup such that Similarly, define weak AFI(є)

13 AFI example  A, B and C are weak AFI (0.25)  A: valid AFI(0.25,0.25)  B: weak AFI(*,0.25)  C: weak AFI(0.25,*)

14 Drawback of AFI  Apriori Property: all sub-itemsets of a frequent itemset must be frequent  But, sub-itemset of an AFI need not be AFI e.g. A is a valid AFI for minSupport = 4, but {b,c,e}, {b,c,d} etc are not valid AFIs PROBLEM – now minimum support can not be used as a pruning technique SOLUTION – a generalization of Apriori properties for noisy conditions (called Noise Tolerant Support Pruning)  AFI criteria violates the Apriori property!

15 Proposed Model 1. Use Approximate Frequent Itemset (AFI) 2. Noise Tolerant Support Pruning – to prune and generate candidate itemsets 3. 0/1 Extension - to count the support of a noise - tolerant itemset based on the support set of its sub-itemsets

16 Noise Tolerant Support Pruning  For a given є r, є c and minsup the noise tolerant pruning support for a length-k itemset is- Proof

17 0/1 Extensions  Starting from singleton itemsets, generate (k+1) itemsets from k itemsets in sequential manner  The number of 0s allowed in the itemset grows with the length of the itemset in a discrete manner  1 Extension If then the transaction set of a (k+1) itemset I is the intersection of the transaction sets of its length k subsets  0 Extension If then the transaction set of a (k+1) itemset I is the union of the transaction sets of its length k subsets Proof

18 Proposed Algorithm

19 AFI vs. Exact Frequent Itemset AFI Mining є r, є c = 1/3; n=8; minsup =1

20 AFI vs. Exact Frequent Itemset Exact Frequent Itemset Mining TransactionItem T1a,b,c T2a,b T3a,c T4b,c T5a,b,c,d T6d T7b,d T8a ItemsetSupport a5 b5 c4 MinSup = 0.5 i.e. 4 transactions n = 8 1-candidates Freq 1-itemsets 2-candidates Freq 2-itemsets ItemsetSupport ab3 ac3 bc3 ItemsetSupport a5 b5 c4 d3 Itemset Null

21 AFI vs. Exact Frequent Itemset - Result Approximate Frequent ItemsetExact Frequent Itemset Generates the frequent itemset {a,b,c} Can not generate any frequent itemset in the presence of noise for the given minimum support value

22 Related Works  Yang et al. (2001) proposed two error-tolerant models, termed weak error-tolerant itemsets or weak ETI [which is equivalent to weak AFI] and strong ETI which is [equivalent to AFI(є r,*)] DRAWBACK No efficient pruning technique – rely on heuristics and sampling techniques Do not preclude columns of 0  Steinbach et al. (2004) proposed a “support envelope” which is a tool for exploration and visualization of the high-level structures of association patterns. A symmetric ETI model is proposed such that the same fraction of errors are allowed in both rows and columns. DRAWBACK Implements same error co-efficient for rows and columns i.e. є r = є c Admits only a fixed number of 0s in the itemset. Fraction of noise does not vary with size of itemset sub-matrix

23 Related Works  Seppänen and Mannila (2004) proposed to mine the dense itemsets in the presence of noise where the dense itemsets are the itemsets with a sufficiently large sub-matrix that exceeds a given density threshold of attributes present. DRAWBACK Enforces the constraint that all sub-itemsets of a dense itemset must be frequent – will fail to identify larger itemsets that have sufficient support because all sub-itemsets might not have enough support Requires repeated scans of the database

24 Experimental Results - Scalability  Scalability Database of 10,000 transactions and 100 items Run time increases as noise tolerance increases Reducing item wise error constraint leads to greater reduction in run time as compared to transaction wise error constraint

25 Experimental Results – Synthetic Data  Quality Testing for single cluster Create data with an embedded pattern Add noise by flipping each entry with probability p where 0.01 ≤ p ≤ 0.2

26  Quality Testing for multiple clusters Create data with multiple embedded pattern Add noise by flipping each entry with probability p where 0.01 ≤ p ≤ 0.2 Experimental Results – Synthetic Data

27 Experimental Results – Real World Data  Zoo Data Set Database contained 101 instances and 18 attribute All the instances are classified into 7 classes e.g. Mammals, fish etc ExactETI (є r )AFI (є r, є c ) Generated subsets of animal in each class Then found subsets of their common features Identified "fins" and "domestic" as common features NOT necessarily true Only AFI was able to recover 3 classes with 100% accuracy

28 Discussion  Advantages Flexibility of placing constraints independently along rows and columns Generalized Apriori technique for pruning Avoids repeated scans of database by using 0/1 extension

29 Summary  The paper outlines an algorithm for mining approximate frequent itemsets from noisy data  It introduces an AFI model Generalized Apriori property for pruning  The proposed algorithm generates more useful itemsets compared to existing algorithms and is also computationally more efficient

30 Thank You!

31 Extra Slides for Questionnaire

32 Applying Apriori Algorithm TIDItems T1a, c, d T2b, c, e T3a, b, c, e T4b, e Min_sup=2 ItemsetSup a2 b3 c3 d1 e3 Data base D 1-candidates Scan D ItemsetSup a2 b3 c3 e3 Freq 1-itemsets Itemset ab ac ae bc be ce 2-candidates ItemsetSup ab1 ac2 ae1 bc2 be3 ce2 Counting Scan D ItemsetSup ac2 bc2 be3 ce2 Freq 2-itemsets Itemset bce 3-candidates ItemsetSup bce2 Freq 3-itemsets Scan D Item --> abcde Transaction T1 10110 T2 01101 T3 11101 T4 01001 T5 00000

33 Noise Tolerant Support Pruning - Proof

34 0/1 Extensions Proof  Number of zeroes allowed in an itemset grows with the length of the itemset


Download ppt "Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi."

Similar presentations


Ads by Google