Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms for Mining Maximal Frequent Itemsets -- A Survey Chaojun Lu.

Similar presentations


Presentation on theme: "Algorithms for Mining Maximal Frequent Itemsets -- A Survey Chaojun Lu."— Presentation transcript:

1 Algorithms for Mining Maximal Frequent Itemsets -- A Survey Chaojun Lu

2  Introduction  Frequent Itemset Extension Tree  Common Techniques  Some MFI-Mining Algorithms  Concluding Remarks

3 Introduction Terminology and Notations Problem Solution

4 Terminology and Notations set of items: I = { i 1, i 2, …, i n } set of transactions: DB = {T 1,T 2,…,T m },T i  I (k-)itemset: N  I ( |N| = k ) support of itemset N: supp(N) frequent itemset (fi) maximal frequent itemset (mfi) set of all frequent (k-)itemsets: FI, FI k set of all mfi: MFI

5 Problem Discover all maximal frequent itemsets in a given transaction database Solution Traversing the search space -- subset lattice of I -- and count support for itemset in DB

6 Solution(cont.) Traversing the search space by -- Brute-force: 2 |I| Clever use of the Basic Property of itemsets: A  B  supp(A)  supp(B) BP1: All subsets of a known frequent itemset are also frequent. BP2: All supersets of a known infrequent itemset are also infrequent.

7 Introduction  Frequent Itemset Extension Tree  Common Techniques  Some MFI-Mining Algorithms  Concluding Remarks

8 Frequent Itemset eXtension Tree Purpose Idea Description Problem Re-formulated

9 Purpose To provide a general framework for analyzing and comparing existent MFI mining algorithms. Idea Larger frequent itemsets are generated by extending known smaller frequent itemsets with suitable items. FIXTree captures and illustrates this extension process.

10 Description of FIXTree Root:  Nodes: frequent itemset Each node N is associated with its candidate extensions CX(N) and frequent extensions FX(N) defined as: CX(N) = {x | x  I and N  {x} may be frequent} FX(N) = {x | x  CX(N) and N  {x} is frequent} Parent-Child P  C: C is a frequent extension of P, i.e. C = P  {x} for some x  FX(P).

11  ({1,2,3,4,5}/{1,2,3,4}) 1 ({2,3,4}/{2,4})2 ({3,4}/{3,4}) 12 ({4}/{4}) 14 (  /  ) 124 (  /  ) 23 ({4}/  ) 24 (  /  ) 3… 4… Example Problem Re-formulated Generate as small a FIXTree containing MFI as possible while searching the subset lattice of I.

12 Introduction Frequent Itemset Extension Tree  Common Techniques  Some MFI-Mining Algorithms  Concluding Remarks

13 Common Techniques Search Strategies Pruning Strategies Dynamic Reordering Data Representation for Fast Support Counting Frequency Determination

14 Search Strategies We can generate the FIXTree via: Breadth-first Depth-first Hybrid For MFI-mining, it’s unnecessary to generate and count all nodes. Instead, we try to generate as fewer nodes of the FIXTree as possible, so long as MFI can be identified.

15 Pruning Strategies BasicPS1: Prune node N’s infrequent extension subtree. 1 ({2,3,4}/{2,4}) 12 ({4}/{4}) 14 (  /  ) 13 Note: This strategy greatly improves a PURE DFS algorithm for mining long patterns.

16 Pruning Strategies(cont.) BasicPS2: Node N’s CX(N) comes from its parent-node P’s FX(P). Let N=P  {x}, x  FX(P), then CX(N) = {y | y  FX(P) and y > x} 1 ({2,3,4}/{2,4}) 12 ({4}/…) 14 (  /…)

17 Pruning Strategies (cont.) MaxPS1: At node N, if N  CX(N)  M (a known fi/mfi), then N- subtree may be pruned. MaxPS2: At node N, if N  CX(N) is frequent by support counting, then all N’s children may be pruned ( and a possible new mfi is produced). 1 ({2,3,4}/…) Look-ahead

18 Pruning Strategies(cont.) MaxPS3: At node N, N  CX(N) is frequent, then all N’s right-hand-side siblings may be pruned. (Those branches won’t produce new mfi.)  ({1,2,3,4,5}/{1,2,3,4}) 2 ({3,4}/…)3… 4… 1…

19 Pruning Strategies(cont.) DFMaxPS: In DFS, AFTER the recursive call DFS(Ni), check if the leftmost path N  {i,…,n}is frequent. If yes, then Ni’s right-hand-side siblings may be pruned. (These won’t produce new mfi.) N(…/{1,2,…n}) Ni ({i+1,…,n})N(i+1)NnN1…

20 Pruning Strategies(cont.) EquivPS: At node N, if for some x  CX(N), supp(N  {x}) = supp(N), then N can be replaced by N  {x}, with CX(N  {x}) = CX(N)-{x} N ({x,y,z}/…) Ny…Nz…Nx…  Nx ({y,z}/…) Nxy…Nxz… Nxy…Nxz… Itemsets containing N but not x cannot be mfi

21 Dynamic Reordering The item order in which to extend itemsets greatly affects MFI mining algorithms Two heuristics: DR1 At node N, reorder all x  FX(N) in supp(N  x) increasing order. 1 {2,3,4} 12 {4,3} {3} {4}

22 Dynamic Reordering(cont.) DR2 Reorder items of FX(  ) (i.e. FI 1 ) in decreasing order of IF(x) with x  FI 1, where IF(x) = {y | y  FI 1 and xy is infrequent}. Notes: 1.|M(x)|  |FI 1 |-|IF(x)| where M(x) is the size of the longest mfi containing x 2.DR2 + DR1 for FI 1. 3.Compute FI 1 and FI 2 before use of DR2.

23 Data Representation Data representation transaction set of items bitstring tid-list for each item(set) FP-tree vertical bitmap for each item(set) diffset Count support on the entire DB or sub-DB? Counting techniques

24 Frequency Determination We can determine a frequent itemset N via: Direct counting supp(N) in DB A known frequent superset of N Lower Bound of supp(N) exceeding minsup

25 Lower Bound Technique Obtain a lower-bound on supp(N) based on support information of N’s subsets. supp(N  {x}) = supp(N)-drop(N,x)  supp(N)-drop(M,x) where M  N. supp(N  X)  supp(N)-  drop(M,x) where M  N.

26 Lower Bound Technique(cont.) LB-PS We already have supp(N),supp(N1),supp(N2),supp(N3), so we can compute Supp(N1  23) = supp(N)-drop(N,1)-drop(N,2)-drop(N,3) and check if it is  minsup? If yes, then prune N2 and N3 branches. (cf. MaxPS3) N2 ({3}/…)N3N1 ({2,3}/…) N (…/{1,2,3})

27 Introduction Frequent Itemset Extension Tree Common Techniques  Some MFI-Mining Algorithms  Concluding Remarks

28 Some MFI-Mining Algorithms Apriori Pincer- Search FP-growth Max-Miner DepthProject MAFIA GenMax

29 Apriori Breadth-first Key steps: Given FI k Generate C k+1 Join (Extending FI k using BasicPS2) Prune (BP2) Support Counting C k+1 to obtain FI k+1

30 Apriori(cont.) Symmetry of FI-mining problem  FI k Count C k+1 FI k+1 IF k+1 Count C k IF k {1,2,…,n} extension reduction Extension-based vs Reduction-based Frequent vs Infrequent

31 Pincer-Search Hybrid Search (Top-down + Bottom-up) Key steps: initially CMFI={I} Given FI k-1, C k, CMFI and MFI Count C k  CMFI to obtain FI k, IFI k and new MFI Use MFI to prune FI k (BP1, MaxPS) Use IFI k to update CMFI Generate C k+1 Join (Extending FI k using BasicPS2) Recover missing candidates Prune (BP2)

32 Pincer-Search(cont.)  pruned pruned bottomup topdown

33 FP-Growth FP-tree: a compact form of DB/sub-DB Key steps: FP-growth(N,N-tree) if N-tree is a single path N{x,y,z} then a possible mfi is found Nx Ny Nz else { extend N with x  FX(N) construct Nx-tree FP-growth(N  {x},Nx-tree)}

34 FP-Growth(cont.) fcabmpfcabmp  f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 c:1 b:1 p:1  p(mbacf/c) m(bacf/acf) bacf cp pruned p’s subDB:fcam,fcam,cb p’s FP-tree: c m’s subDB: fca,fca,vcab m’s FP-tree: fca

35 FP-Growth(cont.) Depth-first MaxPS (if used for MFI-mining) Dynamic Reordering Projected subDB Without Candidate Generation? Construct subDB for N  CX(N) Single path  MaxPS Mining frequent 1-itemset in subDB  FX(N)

36 MaxMiner Breadth-first + Pruning Key Steps: At node N with CX(N) Count N  CX(N), N  {x} for x  CX(N) to get FX(N) If N  CX(N) is frequent, prune using MaxPS2 Reorder FX(N) using DR1 Generate N’s children N  {x} for x  FX(N) with CX(N  {x})={y | y  FX(N) and y > x} MaxPS3 + LB-PS

37 DepthProject Depth-first + Pruning Key Steps: At node N with CX(N), call DP(N,DB) Count N  {x} in DB to obtain FX(N) Prune using DFMaxPS, MaxPS1 Project DB to obtain subDB (if necessary) Reorder FX(N) using DR1 For each x  FX(N): DP(N  {x}, subDB) Output: a superset of MFI

38 DepthProject(cont.) Projected DB DB Proj.DB for {a} a ({b,c}) abc  FX(a) bc [101] ab ac acd c abc abe b [1010] bd

39 DepthProject(cont.) Project DB for some nodes on a path Bitstring representation Byte Counting Bucket Counting

40 MAFIA Depth-first + Pruning Key Steps: At node N, call MAFIA(N, MFI) If N  CX(N)  MFI then prune using MaxPS1 Count N  {x} obtain FX(N) using EquivPS Reorder FX(N) using DR1 For each x  FX(N) MAFIA(N  {x}, MFI) If on leftmost path, prune using DFMaxPS

41 MAFIA(cont.) Data Representation Vertical bitmap and byte counting Bitmap of item(set) N - bmp(N) Tran. j0/1 N N  {x} t(N  {x}) = t(N)  t(x) bmp(N) AND bmp(x)

42 GenMax Depth-first + Pruning Key Steps Compute FI 1 and FI 2 Reorder FI 1 using DR2 + DR1 MFI =  used for MaxPS1 LMFI( , FI 1, MFI) //use diffsets Return MFI

43 GenMax(cont.) MFI-subset check: progressive focusing LMFI(N,FX(N),LMFI) For each x  FX(N) Generate N  {x}with CX(N) If Nx  CX(Nx)  LMFI // MaxPS1 then return Count CX(Nx) to obtain FX(Nx) update LMFI to obtain newLMFI LMFI(Nx, FX(Nx), newLMFI)

44 GenMax(cont.) MFI-subset check optimization: check for local MFI DR2 Data Representation: diffsets

45 Introduction Frequent Itemset Extension Tree Common Techniques Some MFI-Mining Algorithms  Concluding Remarks

46 Concluding Remarks Independent components can fit together nicely Search strategy: hybrid Pruning strategy and dynamic reordering Data projection, bitmap representation, fast counting, compression Different algorithms perform well under different MFI distributions MAFIA and GenMax: current state-of-the-art

47 References R. C. Agarwal, et al. Depth first generation of long patterns. R. J. Bayardo. Efficiently mining long patterns from databases. D. Burdick, et al. MAFIA: a maximal frequent itemset algorithm for transactional databases. K. Gouda, et al. Efficiently mining maximal frequent itemsets. J. Han, et al. Mining frequent patterns without candidate generation. D-I Lin, et al. Pincer-search: an efficient algorithm for discovering the maximum frequent set.

48 Thank You!


Download ppt "Algorithms for Mining Maximal Frequent Itemsets -- A Survey Chaojun Lu."

Similar presentations


Ads by Google