Finding Frequent Itemsets by Transaction Mapping Mingjun Song ,Sanguthevar Rajasekaan Proceedings of the 2005 ACM symposium on Applied computing 報告者:林靜怡 2006/01/13
Introduction Apriori algorithm needs many database scans for each scan, frequent itemsets are searched by pattern matching time-consuming for large frequent itemsets with long patterns.
TM Algorithm Vertical database representation Transaction mapping Transaction ids of each itemset are mapped and compressed to continuous transaction intervals in a different space reducing the number of intersections
Lexicographic Prefix Tree
Lexicographic Prefix Tree (conti.) generate candidate itemsets and test their frequency. Each node in the tree stores a collection of frequent itemsets.
Lexicographic Prefix Tree (conti.) Depth first--if the expansion of a node cannot possibly lead to the discovery of itemsets that have minimum support, then the node will not be expanded and the search will backtrack. When a frequent itemset that meets the minimum support requirement is found, it is output.
Transaction Mapping Scan through the database once and identify all frequent 1-itemsets sort them in descending order of frequency 1-itemsets
Transaction Mapping sup{1} = 5 sup{2} = 5 sup{3} = 4 sup{4} = 2 min_sup = 2 sup{1} = 5 sup{2} = 5 sup{3} = 4 sup{4} = 2 sup{5} = 1 sup{6} = 1 . sup{20}=1 identify all frequent 1-itemsets Frequent 1-itemsets: 1,2,3,4
Transaction Mapping(Conti.) Scan through the database again For each transaction, select items that are in frequent 1-itemsets sort them according to the order of frequent 1-itemsets insert them into the transaction tree
Transaction Tree At the beginning the root is the current node. if the current node has a child node whose id is equal to this item, then just increment the count of this child by 1 otherwise create a new child node and set its counter as 1.
Transaction Tree root 1:1 2:1 2:1 3:1 3:1 4:1 3:1
Node Interval a node u that has an associated interval of [s, e], where s is the relabeled start id, e is the relabeled end id. If the node is the first child of it’s parent s = start id of u’s parent If not s = the end id of its previous child+1 e = start id of u + counter - 1
Node Interval [1,5] [6,8] [1,2] [3,3] [6,6] [7,8] [1,2] not first child s=2+1=3 c=3+1-1=3 first child s=1 c=1+2-1=2 first child s=1 c=1+2-1=2 first child s=1 c=1+5-1=5 [1,5] [6,8] [1,2] [3,3] [6,6] [7,8] [1,2]
output min_sup = 2 1 2 3 4 {1,2} {1,3} intersect [1,2] >2 {1,2,3,4} <2 {1,2,4} intersect <2 {1,2} intersect [1,2] >=2 {1,2,3} intersect [1,2] >=2 2 3 4 1 3,4 2 3 {1,2,3} 4 {1,3} 2 3 4 4 3 4 3 {2,3} {2,4} 4 3
Experiments OS:Windows 2000 CPU:DELL 2.4GHz Pentium PC RAM:1GB Compiler:Visual C++
Experiments synthetic data real data
Experiments
Experiments
Experiments
Experiments