Frequent Closed Pattern Search By Row and Feature Enumeration

Slides:

Advertisements

Similar presentations

Association Rule Mining

Advertisements

Recap: Mining association rules from large datasets

Salvatore Ruggieri SIGKDD2010 Frequent Regular Itemset Mining 2010/9/2 1.

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

LOGO Association Rule Lecturer: Dr. Bo Yuan

Zeev Dvir – GenMax From: “ Efficiently Mining Frequent Itemsets ” By : Karam Gouda & Mohammed J. Zaki.

1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.

Data Mining Association Analysis: Basic Concepts and Algorithms

Chapter 5: Mining Frequent Patterns, Association and Correlations

Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.

Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,

Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms.

1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.

Data Mining Association Analysis: Basic Concepts and Algorithms

Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)

Association Rule Mining - MaxMiner. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and.

Fast Algorithms for Association Rule Mining

Fast Vertical Mining Using Diffsets Mohammed J. Zaki Karam Gouda

1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.

Performance and Scalability: Apriori Implementation.

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.

USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

Sequential PAttern Mining using A Bitmap Representation

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.

Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?

CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者：林靜怡.

M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.

Association Analysis (3)

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Δ-Tolerance Closed Frequent Itemsets James Cheng,Yiping Ke,and Wilfred Ng ICDM ’ 06 報告者：林靜怡 2007/03/15.

Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.

Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.

1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Data Mining and Its Applications to Image Processing

Frequent Pattern Mining

CARPENTER Find Closed Patterns in Long Biological Datasets

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Mining Frequent Itemsets over Uncertain Databases

Association Rule Mining

A Parameterised Algorithm for Mining Association Rules

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms

Frequent-Pattern Tree

Association Analysis: Basic Concepts

Presentation transcript:

Frequent Closed Pattern Search By Row and Feature Enumeration

Outline Problem Definition Related Work: Feature Enumeration Algorithms CARPENTER: Row Enumeration Algorithm COBBLER: Combined Enumeration Algorithm

Problem Definition Frequent Closed Pattern: 1) frequent pattern: has support value higher than the threshold 2) closed pattern: there exists no superset which has the same support value Problem Definition: Given a dataset D which contains records consist of features, our problem is to discover all frequent closed patterns respect to a user defined support threshold.

Related Work Searching Strategy: breadth-first & depth-first search Data Format: horizontal format & vertical format Data Compression Method: diffset, fp-tree, etc.

Typical Algorithms CLOSET APRIORI CHARM feature enumeration horizontal format depth-first search fp-tree technique APRIORI feature enumeration horizontal format breadth-first search CHARM vertical format depth-first search deffset technique

CARPENTER Motivation Algorithm Prune Method Experiment CARPENTER stands for Closed Pattern Discovery by Transposing Tables that are Extremely Long Motivation Algorithm Prune Method Experiment

Motivation Bioinformatic datasets typically contain large number of features with small number of rows. Running time of most of the previous algorithms will increase exponentially with the average length of the transactions. CARPENTER’s search space is much smaller than that of the previous algorithms on these kind of datasets and therefore has a better performance.

Algorithm The main idea of CARPENTER is to mine the dataset row-wise. 2 steps: First, transpose the dataset Second , search in the row enumeration tree.

Transpose Table Feature a, b, c, d. Row r1, r2 , r3, r4. r1 a b c r2 project on (r2 r3) original table transposed table projected table

Row Enumeration Tree r1 r2 r3 {bc} r1r2r3r4 { } r1 r2 {bc} According to the transposed table, we build the row enumeration tree which enumerates row ids with a pre-defined order. We do a depth first search in the row enumeration tree with out any prune strategies. r1 r2 r4 {} r1 r3 {bc} r1 r3 r4 { } r1 {abc} minsup=2 bc: r1r2r3 bcd: r2r3 d: r2r3r4 r1 r4 {} r2 r3 {bcd} r2 r3 r4 {d } { } r2 {bcd} a r1 b r1 r2 r3 c d r2 r3 r4 r2 r4 {d} r3 {bcd} r3 r4 {d } r4 {d}

Prune Method 1 In the enumeration tree, the depth of a node is the corresponding support value. Prune a branch if there won’t be enough depth in that branch, which means the support of patterns found in the branch will not exceed the minimum support. minsup 4 r2 r3 {bcd} r2 {bcd} r2 r4 {d} depth= 1 sup =1 2 sub-nodes Max support value in branch “r2” will be 3, therefore prune this branch.

Prune Method 2 If rj has 100% support in projects table of ri, prune the branch of rj. r2 {bcd} r2 r3 {bcd} r2 r3 r4 {d} r2 r4 {d} b r3 c d r3 r4 b c d r4 r2 r3 {bcd} r2 r3 r4 {d} r3 has 100% support in the projected table of “r2”, therefore branch “r2 r3” will be pruned and whole branch is reconstructed.

Prune Method 3 At any node in the enumeration tree, if the corresponding itemset of the node has been found before, we prune the branch rooted at this node. r2 {bcd} r2 r3 {bcd} r2 r4 {d} r3 {bcd} r3 r4 {d} Since itemset {bcd} has been found before, the branch rooted at “r3” will be pruned.

Performance We compare 3 algorithms, CARPENTER, CHARM and CLOSET. Dataset (Lung Cancer) has 181 rows with 12533 features. We set 3 parameters, minsup, Length Ratio and Row Ratio.

minsup Lung Cancer, 181 rows, length ratio 0.6,row ratio 1. Running time of CARNPENTER changes from 3 to 14 second

Length Ratio Lung Cancer, 181 rows, sup 7 (4%), row ratio 1 Running time of CARPENTER changes from 3 to 33 seconds

Row Ratio Lung Cancer, 181 rows, length ratio 0.6,sup 7 (4%) Running time of CARPENTER changes from 9 to 178 seconds

Conclusion We propose an algorithm call CARPENTER for finding closed pattern on long biological datasets. CARPENTER perform row enumeration instead of column enumeration since the number of rows in such datasets are significantly smaller than the number of features. Performance studies show that CARPENTER is much more efficient in finding closed patterns compared to existing feature enumeration algorithms.

COBBLER Motivation Algorithm Performance

Motivation With the development of CARPENTER, existing algorithms can be separated into two parts. Feature enumeration: CHARM, CLOSET, etc. Row enumeration: CARPENTER We have two motivations to combine these two enumeration methods

Motivation 1. We can see that these two enumeration methods have their own advantages on different type of data set. Given a dataset, the characteristic of its sub-dataset may change. sub-dataset dataset project more rows than features more features than rows 2. Given a dataset with both large number of rows and features, a single row enumeration algorithm or a single feature enumeration method can not handle the dataset.

Algorithm There are two main points in the COBBLER algorithm How to build an enumeration tree for COBBLER. How to decide when the algorithm should switch from one enumeration to another. Therefore, we will introduce the idea of dynamic enumeration tree and switching condition

Dynamic Enumeration Tree We call the new kind of enumeration tree used in COBBLER the dynamic enumeration tree. In dynamic enumeration tree, different sub-tree may use different enumeration method. original transposed r1 a b c r2 a c d r3 b c r4 d a r1 r2 b r1 r3 c r1 r2 r3 d r2 r4 We use the table as an example in later discussion

Single Enumeration Tree abcd { } r1r2r3r4 { } r1r2r3 {c} abc {r1} ab {r1} r1r2 {ac} abd { } r1r2r4 { } ac {r1r2} acd { r2} a {r1r2} r1r3 {bc} r1r3r4 { } r1 {abc} ad {r2} r1r4 { } bc {r1r3} bcd { } r2r3r4 { } r2r3 {c} { } b {r1r3} { } r2 {acd} r1 a b c r2 a c d r3 b c r4 d bd { } r2r4 {d } c {r1r2r3} cd {r2 } r3 {bc} r3r4 { } d {r2r4} Feature enumeration r4 {d} Row enumeration

Dynamic Enumeration Tree abcd { } abc {r1} r1r2 {c} ab {r1} r1 {bc} r1 bc r2 cd abd { } a {r1r2} ac {r1r2} acd { r2} r2 {cd} a {r1r2} ad {r2} r1 {c} r1r3 { c} { } b {r1r3} r3 { c} abc: {r1} ac: {r1r2} acd: {r2} b r1 c r1 r2 d r2 c {r1r2r3} r2 {d } d {r2r4} Feature enumeration to Row enumeration

Dynamic Enumeration Tree r1r2r3r4 { } ab {} r1r2r3 {c} a {r2} r1r2 {ac} ac { r2} r1r2r4 { } r1 {abc} b {r3} bc {r3 } r1r3 {bc} r1r3r4 { } r1 {abc} c {r2r3 } r1r4 { } ac {r1 } acd { } a {r1} ad { } { } r2 {acd} cd { } ac: {r1r2} bc: {r1r3} c: {r1r2r3} c {r1r3} d {r4 } b {r1 } bc {r1 } r3 {bc} c {r1r2 } r4 {d} Row enumeration to Feature Enumeration

Dynamic Enumeration Tree When we use different condition to decide the switching, the structure of the dynamic enumeration tree will change. No matter how it switches, the result set of closed pattern will be the same as the result of the single enumeration .

Switching Condition The main idea of the switching condition is to estimate the processing time of the a enumeration sub-tree, i.e., row enumeration sub-tree or feature enumeration sub-tree. Define some characters.

Switching Condition

Switching Condition Suppose r=10, S(f1)=0.8, S(f2)=0.5, S(f3)=0.5, S(f4)=0.3 and minsup=2 Then the estimated deepest node under f1 is f1f2f3, since S(f1)*S(f2)*S(f3)*r=2 >minsup S(f1)*S(f2)*S(f3)*S(f4)*r=0.6 < minsup

Experiments We compare 3 algorithms, COBBLER, CHARM and CLOSET+. One real-life dataset and one synthetic data. We set 3 parameters, minsup, Length Ratio and Row Ratio.

minsup Synthetic data Real-life data (thrombin)

Length and Row ratio Synthetic data

Discussion The combination of row and feature enumeration also makes some disadvantage The cost to calculate the switching condition and the cost of bad decision. The increased cost in pruning, maintain two set of pruning system.

Discussion We may use other more complicated data structure in our algorithm to improve the performance, e.g., the vertical data format and diffset technique. And more efficient switching condition may improve the algorithm further.

Conclusion The COBBLER algorithm gives better performance on dataset where the advantage of switching can be shown, e.g., complex dataset or dataset has both large number of rows and features. For simple characteristic data, a single enumeration algorithm may be better.

Future Work Using other data structure and technique in the algorithm. Extend COBBLER to handle dataset that can not be fitted into memory.

Thanks