Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.

Similar presentations


Presentation on theme: "Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一."— Presentation transcript:

1 Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一

2 2 AGENDA Introduction Review & Overview Design and Construction (FP tree) Mining frequent patterns (FP growth) DB Projection Experiments Discussion

3 3 Impact Factor: 2.42 (2007) Subject category "Comp. sc., artificial intelligence": Rank 11 of 93 Subject category "Comp. sc., information systems": Rank 7 of 92

4 4 Frequent Pattern Mining Given a transaction database DB and a minimum support threshold ξ, find all frequent patterns (item sets) with support no less than ξ. Problem TIDItems bought 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300{b, f, h, j, o} 400{b, c, k, s, p} 500{a, f, c, e, l, p, m, n} Minimum support: ξ =3 Input:Output : Problem: How to efficiently find all frequent patterns? all frequent patterns f:4 c:4 a:3 b:3 m:3 p:3 cp:3 fm:3 cm:3 am:3 ca:3 fa:3 fc:3 cam:3 fam:3 fcm:3 fcam:3

5 5 內容分配 Total: 35 pages Abstract: 15 lines Introduction: 2 pages(6%) FP-tree(39%) Construction: 5 pages(14%) Use: 8.5 pages(24%) FP-growth: 3.5 pages(10%) Experiments: 8 pages(23%) Discussion: 3 pages(9%) Conclusion: 1 pages(3%)

6 6 文獻內容 - 35pages FP-tree 38.5%

7 7 Introduction Review: Apriori-like methods Overview: FP-tree based mining method Additional Progress Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In Proc. 2000 ACMSIGMOD Int. Conf. Management of Data (SIGMOD’00), Dallas, TX, pp. 1–12. Introduction

8 8 The core of the Apriori algorithm: Use frequent ( k – 1)-itemsets (L k-1 ) to generate candidates of frequent k- itemsets C k Scan database and count each pattern in C k, get frequent k - itemsets ( L k ). E.g., TIDItems bought 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300{b, f, h, j, o} 400{b, c, k, s, p} 500{a, f, c, e, l, p, m, n} Apriori iteration C1f,a,c,d,g,i,m,p,l,o,h,j,k,s,b,e,n L1f, a, c, m, b, p C2 fa, fc, fm, fp, ac, am, …bp L2 fa, fc, fm, … Apriori Algorithm Review 2 10 -1=1.02x10 3 2 20 -1=1.05x10 6 2 100 -1=1.27x10 30 n=10 n=20 n=100 # of candidates ? Agrawal, R., Imielinski, T., and Swami, A. 1993. Mining association rules between sets of items in large databases.

9 9 9 Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method (FP-growth) A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only. Ideas Overview

10 FP-tree: Design and Construction FP-Tree

11 11 Definition 1 ( FP-tree ) null Item-prefix subtree Frequent-item -header table Item-nameNode-link HItem-namecountNode-link

12 12 Construct FP-tree 1. First Scan: Collects the set of frequent items. 2. Second Scan: Construct the FP-tree. Construct FPt

13 13 First Scan Collects the set of frequent items Find frequent items (L 1 ) 建立 Frequent-item-header Table 將每筆交易內容依 frequency 遞減排序 ( 不含 infrequent item) Item frequency f4 c4 a3 b3 m3 p3 Frequent-item -header table Construct FPt

14 14 以遞減排序之內容建構 FP-Tree 會較為緊密 First Scan Collects the set of frequent items Construct FPt

15 15 TIDItems bought 100 {f, a, c, d, g, i, m, p} ↓ 100 {f, c, a, m, p} (a)Creat root-node { null }. (b)For each transaction, order its frequent items according to the order in L. (ex: TID=100) (c) Construct FP-tree by putting each frequency ordered transaction onto it. Second Scan Construct FP-tree Frequent-item -header table f:1 null c:1 a:1 m:1 p:1 ItemLink f c a b m p Construct FPt

16 16 TIDItems bought 200 {a, b, c, f, l, m, o} ↓ 200{f, c, a, b, m} Construct FP-tree Second Time (c) Construct FP-tree by putting each frequency ordered transaction onto it. (ex: TID=200) Construct FP-tree Frequent-item -header table f:2 null c:2 a:2 m:1 p:1 ItemLink f c a b m p b:1 m:1 Construct FPt

17 17 TIDItems bought 300 {b, f, h, j, o} ↓ 300{f, b} Construct FP-tree Second Time (c) Construct FP-tree by putting each frequency ordered transaction onto it. (ex: TID=300) Construct FP-tree Frequent-item -header table f:3 null c:2 a:2 m:1 p:1 ItemLink f c a b m p b:1 m:1 b:1 Construct FPt

18 18 TIDItems bought 400 {b, c, k, s, p} ↓ 400{c, b, p} Construct FP-tree Second Time (c) Construct FP-tree by putting each frequency ordered transaction onto it. (ex: TID=400) Construct FP-tree f:3 null c:2 a:2 m:1 p:1 b:1 m:1 b:1 c:1 b:1 p:1 Frequent-item -header table ItemLink f c a b m p Construct FPt

19 19 TIDItems bought 500 {a,f,c,e,l,p,m,n} ↓ 500{f,c,a,m,p} Construct FP-tree Second Time (c) Construct FP-tree by putting each frequency ordered transaction onto it. (ex: TID=500) Construct FP-tree Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Construct FPt

20 20 Efficiency Analysis FP-tree is much smaller than the size of the DB Pattern base is smaller than original FP-tree Conditional FP-tree is smaller than pattern base mining process works on a set of usually much smaller pattern bases and conditional FP-trees Divide-and-conquer and dramatic scale of shrinking

21 21 Lemma 2.1 Given a transaction database DB and a support threshold ξ,the complete set of frequent item projections of transactions in the database can be derived from DB’s FP-tree. a k =“A”, C ak =5, C’ ak =3+1=4 Support(root ~A)=5 B:8 A:5 null C:3 D:1 contains the complete information of DB 1. BACD 2. BAC 3. BAC 4. BAD 5. BA 6. B 7. B 8. B

22 22 Lemma 2.2 Given a transaction database DB and a support threshold ξ. The size of an FP-tree is bounded by Σ T ∈ DB | freq (T)|, The height of the tree is bounded by max T ∈ DB {| freq (T)|} 右圖在 DB 中為 8 筆交易, 在 FP-tree 僅需 5 個 node. FP-tree 最長的 subtree 長度為最長的交易筆數 [BACD] (root: null 不計 ) 1. BACD 2. BAC 3. BAC 4. BAD 5. BA 6. B 7. B 8. B B:8 A:5 null C:3 D:1

23 3.Mining frequent patterns using FP-tree

24 24 Starting the processing from the end of list L: Step 1: Construct conditional pattern base. Step 2 Construct conditional FP-tree. Step 3 Recursively mine conditional FP-trees and grow frequent patterns obtained so far. FP-Growth Mining FP

25 25 Conditional FP-Tree (suffix p ) Frequent-item -header table f:4 c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(fcam:2),(cb:1)} Conditional FP-tree {(c:3)}| p Frequent Pattern {(cp:3)} f c a b m 2 2 _ 1 1 _ 2 3 2 1 2 Mining FP null

26 26 Conditional FP-Tree (suffix m ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(fca:2),(fcab:1)} Frequent Pattern {(m:3), (am:3), (cm:3), (fm:3), (cam:3), (fam:3), (fcam:3), (fcm:3)} f c a b m 2 2 2 1 1 1 1 _ 3 3 3 1 Mining FP Conditional FP-tree {(f:3,c:3,a:3)}|m {(f:3,c:3)}|am {(f:3)}|cam {(f:3)}|cm (Recursively mine)

27 27 Conditional FP-Tree (suffix b ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(fca:1),(f:1)(c:1)} f c a b m 1 1 1 1 1 _ 2 2 1 Mining FP Conditional FP-tree {}|b

28 28 Conditional FP-Tree (suffix a ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(fc:3)} Frequent Pattern {(ca:3), (fca:3), (fa:3)} f c a b m 3 Mining FP Conditional FP-tree {(f:3,c:3)}|a {(f:3)}|cm (Recursively mine)

29 29 Conditional FP-Tree (suffix c ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {(f:3)} Frequent Pattern {(fc:3)} f c a b m 3 Mining FP Conditional FP-tree {(f:3)}|c

30 30 Conditional FP-Tree (suffix f ) Frequent-item -header table f:4 null c:3 a:3 m:2 p:2 ItemLink f c a b m p b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern base {} Mining FP

31 31 Frequent Pattern TIDItems bought 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300{b, f, h, j, o} 400{b, c, k, s, p} 500{a, f, c, e, l, p, m, n} Minimum support: ξ =3 Input:Output: all frequent patterns f:4 c:4 a:3 b:3 m:3 p:3 cp:3 fm:3 cm:3 am:3 ca:3 fa:3 fc:3 cam:3 fam:3 fcm:3 fcam:3

32 32 Single prefix path FP-tree freq pattern set(P) = {(a:10), (b:8), (c:7), (ab:8), (ac:7), (bc:7), (abc:7)}. freq pattern set(Q) = {(d:4), (e:3), (f:3), (df:3)} freq pattern set(Q) × freq pattern set(P) ={(ad:4),(ae:3),(af:3),(adf:3),(bd:4),(be:3),… (abcd:4), (abce:3), (abcf:3), (abcdf:3)}

33 4.Database Projection Main Memory-Based FP mining⇩ Disk-Based FP mining

34 34 Parallel Projection Item freq f4 c4 a3 b3 m3 p3 Projection

35 35 Partition Projection Projection Item freq f4 c4 a3 b3 m3 p3

36 5.Performance & Compactness

37 37 Performance & Compactness

38 38 Performance & Compactness

39 6.Discussions Improve the performance and scalability of FP-growth

40 40 Performance Improvement Projected DBs. partition the DB into a set of projected DBs and then construct an FP-tree and mine it in each projected DB. Disk-resident FP-tree Store the FP-tree in the hark disks by using B+ tree structure to reduce I/O cost. FP-tree Materialization a low ξ may usually satisfy most of the mining queries in the FP-tree construction. FP-tree Incremental update How to update an FP-tree when there are new data? ❖ Re-construct the FP-tree ❖ Or do not update the FP-tree

41 感謝聆聽 敬請指教

42 42 B:8 A:5 null C:3 D:1 A:2 C:1 D:1 E:1 D:1 E:1 C:3 D:1 E:1 B8 A7 C7 D5 E3 B:8 null B:5 A:5 null A:2 Suffix B {} {}|B {} Suffix A {B:5} {B:5}|A {BA:5} B:6 A:3 null C:3 A:1 C:1 C:3 Suffix C {BA:3,B:3,A:1} {B:6,A:4}|C {BC:6,AC:4} B:3 A:2 null C:1 D:1 A:2 C:1 D:1 C:1 D:1 Suffix D {BAC:1, BA:1,BC:1,AC:1,A:1} {B:3,A:4,C:3}|D {BD:3, AD:4, CD:3} B:1 null C:1 A:2 C:1 D:1 E:1 D:1 E:1 Suffix E {BC:1, ACD:1,AD:1} {B:1,A:2,C:2,D:2}|E {AE:2,CE:2,DE:2} Conditional FP-Tree Conditional pattern base Conditional FP-tree Frequent Pattern

43 43 6.Discussion Materialization and maintenance of FP-tree –Main memory  disk-resident FP-trees. –Select a good minimum support threshold ξ. –Incremental update of an FP-tree. Extensions of frequent-pattern growth method in data mining –FP-tree mining with constraints


Download ppt "Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一."

Similar presentations


Ads by Google