August 26, 2008Gupta et al. KDD 20081 Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael.

August 26, 2008Gupta et al. KDD 20081 Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael Steinbach and Vipin Kumar Email: rohit@cs.umn.edu Department of Computer Science and Engineering University of Minnesota, Twin Cities

August 26, 2008Gupta et al. KDD 20082  Introduction  Background of Algorithms  Motivation  Contributions  Evaluation Methodology  Experimental Results and Discussion Synthetic Datasets Effects of higher random noise  Efficiency and Scalability  Conclusions Outline of the Talk

August 26, 2008Gupta et al. KDD 20083 Introduction  Traditional association mining algorithms use strict definition of support For an itemset I’, a transaction must have all the items in order to be counted as a supporting transaction  Real-life datasets are noisy  True patterns cannot be found at the desired level of support as they are fragmented due to noise  Too many spurious patterns at low support level 0111 1011 1101 1110 0100 0000 1000 0100 0010 0000 1000 0100 1110 1101 1011 0111

August 26, 2008Gupta et al. KDD 20084  Weak ETI [Yang et al 2001] - pattern sub- matrix overall should be no more than % sparse  Strong ETI (‘GETI’) [Yang et al 2001] – Each row in the pattern sub-matrix should be no more than % sparse  Recursive Weak ETI (‘RW’) [Seppanen et al 2004] – Itemset and all its subsets should also be weak ETI  AFI (‘AFI’) [Liu et al 2006] – Introduces additional column constraint to the concept of strong ETI to reduce spurious items  AC-CLOSE (‘AC-CLOSE’) [Cheng et al 2006] – uses Apriori to find a core pattern at low support and then builds on it to find AFI type patterns Background

August 26, 2008Gupta et al. KDD 20085 Motivation  Comprehensive and systematic evaluation of these algorithms is lacking Each proposed algorithm is shown to be far more superior than the previously proposed algorithms However this may not be true since performance of various algorithms is sensitive to the input parameters and noise in the data Different algorithms may require different set of parameters (minsup and error tolerances) to discover true patterns Most existing evaluations have been confined to a small set of parameters often appropriate to the introduced algorithm

August 26, 2008Gupta et al. KDD 20086  Performed comprehensive evaluation of various approximate pattern mining algorithms Compared different algorithms based on the quality of the patterns and robustness to the input parameters Highlighted the importance of choosing optimal parameters, which may be different for each algorithm  Proposed variations of the existing algorithms to generate higher quality patterns Contributions

August 26, 2008Gupta et al. KDD 20087  ‘GETI-PP’ uses an additional parameter ( c ) to make sure every item in each strong ETI generated by the greedy algorithm ‘GETI’ also satisfies the column constraint  ‘RW-PP’ uses an additional parameter ( c ) to make sure every item in each recursive weak ETI generated by the algorithm ‘RW’ also satisfies the column constraint  An itemset I’ is said to be recursive strong ETI (‘RS’) if I’ and all subsets of I’ are strong ETI Proposed Variations of Algorithms

August 26, 2008Gupta et al. KDD 20088 Evaluation Methodology  Recoverability - Quantifies how well an approximate pattern mining algorithm recovers the base patterns [Kum et al 2002]  Spuriousness - Measures the quality of the found patterns [Kum et al 2002] F/B B1={a,b,c,d}B2={a,e,f} F1={a,b,d} {a,b,d} *# {a} F2={a,e,f,g} {a}{a,e,f} *# F3={a,e,g,h} {a}{a,e} Base Patterns – B, Found Patterns – F Recoverability = (3+3)/(|B1|+|B2|) = 6/7 Spuriousness = (0+1+2)/(|F1|+|F2|+|F3|) = 3/11

August 26, 2008Gupta et al. KDD 20089 Evaluation Methodology (Cont.)  Significance – Balances the trade off between recoverability and spuriousness and is defined as Redundancy = (1+1+3) = 5  Redundancy – Quantify the similarity in patterns and hence the actual number of useful patterns  Robustness – As the ground truth for real dataset will not be known, algorithms that generates acceptable quality patterns for a relatively wide range of input parameters are favored F/FF1={a,b,d}F2={a,e,f,g}F3={a,e,g,h} F1={a,b,d}-11 F2={a,e,f,g}1-3 F3={a,e,g,h}13-

August 26, 2008Gupta et al. KDD 200810  Implemented AFI, GETI, RW, GETI-PP and RW-PP  AC-CLOSE was simulated using AFI  patterns generated by AFI that also have a core block are essentially same as those generated by AC-CLOSE  Datasets used for experiments 8 different synthetic datasets – easy to evaluate algorithms as the ground truth is known Real zoo dataset  Only maximal patterns were used for comparison  For each noisy dataset, each algorithm is run on wide range of parameter values (minsup,  r and  c )  As the noise is random, we repeat the process of adding noise and running the algorithm 5 times and report average results Experimental Set-Up

August 26, 2008Gupta et al. KDD 200811 Synthetic Datasets (No Noise Version)

August 26, 2008Gupta et al. KDD 200812 Results on Synthetic Dataset #8  Performance of all the algorithms deteriorates as the noise increases  GETI and RW (that uses single parameter ) tend to pick more spurious items  GETI-PP and RW-PP (variations of GETI and RW) generate patterns comparable in quality to AFI  AC-CLOSE seems to be unstable  GETI-PP and RW-PP have more stable performance than GETI and RW

August 26, 2008Gupta et al. KDD 200813 Results on Synthetic Dataset #6

August 26, 2008Gupta et al. KDD 200814 Optimal Parameters for Different Algorithms for Synthetic Data # 6 AlgorithmsNoise Levels (%) 0481216 APRIORI (sup) (100) (150) (100) GETI (sup,  r ) (100,0) (150,0) (100,0) (125,0.05) (100,0.05)(125,0.1)(100,0.1) GETI-PP (sup,  r,  c ) (100,0,0) (150,0,0) (100,0,0) (150,0.2,0.05) (100,0.05,0.05) (150,0.2,0.1) (125,0.2,0.1) RW (sup,  r ) (100,0) (200,0.25) (100,0) (200,0.25) (100,0.05) (175,0.2) (200,0.2) RW-PP (sup,  r,  c ) (100,0,0) (150,0,0) (100,0,0) (125,0.2,0.05) (100,0.2,0.05) (125,0.2,0.1) (125,0.25,0.1) AFI (sup,  r,  c ) (100,0,0) (150,0,0) (100,0,0) (125,0.2,0.1) (100,0.2,0.1) (125,0.2,0.1) (125,0.2,0.2) ACCLOSE (sup,  r,  c,a) (100,0,0,1) (150,0,0,1) (100,0,0,1)(100,0.2,0.1,0.9)(125,0.2,0.2,0.5)(125,0.2,0.2,0.4)

August 26, 2008Gupta et al. KDD 200815  In previous experiments, random noise was constrained by the smallest implanted true pattern in the data  In some real-life applications, small but true patterns can be hidden in even larger amounts of random noise  Trade off between recoverability and spuriousness becomes even more challenging  High noise experiments are computationally expensive  Scheme 1: For algorithms that could finish in time  ( = 1 hour), complete parameter space is searched  Scheme 2: For algorithms that could not finish in time  ( = 1 hour), parameter space is searched in order of complexity Effects of More Random Noise

August 26, 2008Gupta et al. KDD 200816 Results on Dataset # 8 for More Random Noise

August 26, 2008Gupta et al. KDD 200817  All algorithms were run on a linux machine with 8 Intel(R) Xeon(R) CPUs (E5310 @ 1.60GHz)(with 10 processes)  Total run-time for 144 parameter combinations (4 minsup values and 6 different row and column tolerance values) is reported in seconds Efficiency and Scalability AlgorithmsNoise Levels (%) 02468101214161820 AFI11 121112211453---- GETI-PP202687135292296406411420429427 RW-PP11 1214162981267700- Results on synthetic data 8  Though AFI is computationally more efficient than GETI-PP for less noisy dataset, it becomes very expensive as the noise increases  GETI-PP and RW-PP (simple proposed variations) are computationally efficient and generate comparable quality patterns to AFI  RW-PP shows marginal increase in run-time as the noise increases from 0% to 12%, after which it only shows rapid increase  GETI-PP seems to scale well as the noise increases

August 26, 2008Gupta et al. KDD 200818  Different algorithms have different sweet spots in the parameter space  If optimal parameters are provided, difference in their performance is not as large as was believed in the past  Enforcing the column tolerance  c improves the performance of single parameter algorithms GETI and RW  AFI, GETI-PP and RW-PP all use the column constraint  c and generate similar quality patterns. However, GETI-PP and RW- PP are computationally more efficient than AFI specially for very noisy datasets Conclusions For source codes and datasets, visit: http://www.cs.umn.edu/~kumar/ETI

August 26, 2008Gupta et al. KDD 200819  C. Blake and C. Merz. Uci repository of machine learninig databases. University of California, Irvine, 2007.  H. Cheng, P. S. Yu, and J. Han. Ac-close: Efficiently mining approximate closed itemsets by core pattern recovery. In ICDM, pages 839–844, 2006.  H. Cheng, P. S. Yu, and J. Han. Approximate frequent itemset mining in the presence of random noise. In Soft Computing for Knowledge Discovery and Data Mining, pages 363–389. Oded Maimon and Lior Rokach. Springer, 2008.  H.-C. Kum, S. Paulsen, and W. Wang. Comparative study of sequential pattern mining frameworks —support framework vs. multiple alignment framework. In ICDM Workshop, 2002.  J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel, and J. Prins. Mining approximate frequent itemsets in the presence of noise: algorithm and analysis. In SDM, pages 405–416, 2006.  J. Liu, S. Paulsen, W.Wang, A. Nobel, and J. Prins. Mining approximate frequent itemsets from noisy data. In ICDM, pages 721–724, 2005.  J. Pei, A. K. H. Tung, and J. Han. Fault-tolerant frequent pattern mining: Problems and challenges. In DMKD, 2001.  J. Seppanen and H. Mannila. Dense itemsets. In KDD, pages 683–688, 2004.  M. Steinbach, P.-N. Tan, and V. Kumar. Support envelopes: A technique for exploring the structure of association patterns. In KDD, pages 296–305, New York, NY, USA, 2004. ACM Press.  P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson Addison-Wesley, May 2005.  C. Yang, U. Fayyad, and P. Bradley. Efficient discovery of error-tolerant frequent itemsets in high dimensions. In KDD, pages 194–203, 2001. References

August 26, 2008Gupta et al. KDD 200820 Thank You Questions? All the source codes, datasets and results are publicly available at: http://www.cs.umn.edu/~kumar/ETI

August 26, 2008Gupta et al. KDD 20081 Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael.

Similar presentations

Presentation on theme: "August 26, 2008Gupta et al. KDD 20081 Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

August 26, 2008Gupta et al. KDD 20081 Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael.

Similar presentations

Presentation on theme: "August 26, 2008Gupta et al. KDD 20081 Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael."— Presentation transcript:

Similar presentations

About project

Feedback