# The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002.

## Presentation on theme: "The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002."— Presentation transcript:

The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002

Presentation Overview FP-Growth Algorithm Refresher & Example Motivation FP-Growth Complexity Vs. Apriori Complexity Saving calculation or hiding work? Real World Application Datasets are not created equal Results of real-world implementation

FP-Growth Algorithm Association Rule Mining Generate Frequent Itemsets Apriori generates candidate sets FP-Growth uses specialized data structures (no candidate sets) Find Association Rules Outside the scope of both FP-Growth & Apriori Therefore, FP-Growth is a competitor to Apriori

FP-Growth Example {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p}

FP-Growth Example Empty f {(f:3)}|c{(f:3)}c {(f:3, c:3)}|a{(fc:3)}a Empty{(fca:1), (f:1), (c:1)}b {(f:3, c:3, a:3)}|m{(fca:2), (fcab:1)}m {(c:3)}|p{(fcam:2), (cb:1)}p Conditional FP-treeConditional pattern-base Item

FP-Growth Algorithm FP-CreateTree Input: DB, min_support Output: FP-Tree 1. Scan DB & count all frequent items. 2. Create null root & set as current node. 3. For each Transaction T Sort T’s items. For each sorted Item I Insert I into tree as a child of current node. Connect new tree node to header list. Two passes through DB Tree creation is based on number of items in DB. Complexity of CreateTree is O(|DB|)

FP-Growth Example {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p}

FP-Growth Algorithm FP-Growth Input: FP-Tree, f_is (frequent itemset) Output: All freq_patterns if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is  combination else for each item i in header generate pattern f_is  i construct pattern’s conditional cFP-Tree if (FP-Tree  0) then call FP-Growth (cFP-Tree, pattern) Recursive algorithm creates FP-Tree structures and calls FP-Growth Claims no candidate generation

Two-Part Algorithm if FP-Tree contains single Path P then for each combination of nodes in P generate pattern f_is  combination e.g., { A B C D } p = |pattern| = 4 AD, BD, CD, ABD, ACD, BCD, ABCD  (n=1 to p-1) ( p-1 C n ) A B C D

Two-Part Algorithm else for each item i in header generate pattern f_is  i construct pattern’s conditional cFP-Tree if (cFP-Tree  null) then call FP-Growth (cFP-Tree, pattern) e.g., { A B C D E } for f_is = D i = A, p_base = (ABCD), (ACD), (AD) i = B, p_base = (ABCD) i = C, p_base = (ABCD), (ACD) i = E, p_base = null Pattern bases are generated by following f_is path from header to each node in tree having it, up to tree root, for each header item. A B C D D CD E E E

FP-Growth Complexity Therefore, each path in the tree will be at least partially traversed the number of items existing in that tree path (the depth of the tree path) * the number of items in the header. Complexity of searching through all paths is then bounded by O(header_count 2 * depth of tree) Creation of a new cFP-Tree occurs also.

Sample Data FP-Tree null K M EDC DC L B A … J LM M ML MLM M K … … … M ML MLM M K …

Algorithm Results (in English) Candidate Generation sets exchanged for FP-Trees. You MUST take into account all paths that contain an item set with a test item. You CANNOT determine before a conditional FP-Tree is created if new frequent item sets will occur. Trivial examples hide these assertions, leading to a belief that FP-Tree operates more efficiently.

Header: A B F G H I M FP-Growth Mining Example Null M:1 I:1 G:1 B:45 M:1 I:1 G:1 F:40 M:1 I:1 H:1 G:35 M:1 G:1 B:1 A:50 Transactions: A (49) B (44) F (39) G (34) H (30) I (20) ABGM BGIM FGIM GHIM H:30I:20

FP-Growth Mining Example Null M:1 I:1 G:1 B:45 M:1 I:1 G:1 F:40 M:1 I:1 H:1 G:35 M:1 G:1 B:1 A:50 Header: A B F G H I M i = A pattern = { A M } support = 1 pattern base cFP-Tree = null freq_itemset = { M }

FP-Growth Mining Example Null M:1 I:1 G:1 B:45 M:1 I:1 G:1 F:40 M:1 I:1 H:1 G:35 M:1 G:1 B:1 A:50 Header: A B F G H I M i = B pattern = { B M } support = 2 pattern base cFP-Tree … freq_itemset = { M }

FP-Growth Mining Example M:2 B:2 G:2 Header: G B M Patterns mined: BM GM BGM freq_itemset = { B M } All patterns: BM, GM, BGM

FP-Growth Mining Example Null M:1 I:1 G:1 B:45 M:1 I:1 G:1 F:40 M:1 I:1 H:1 G:35 M:1 G:1 B:1 A:50 Header: A B F G H I M i = F pattern = { F M } support = 1 pattern base cFP-Tree = null freq_itemset = { M }

FP-Growth Mining Example Null M:1 I:1 G:1 B:45 M:1 I:1 G:1 F:40 M:1 I:1 H:1 G:35 M:1 G:1 B:1 A:50 Header: A B F G H I M i = G pattern = { G M } support = 4 pattern base cFP-Tree … freq_itemset = { M }

FP-Growth Mining Example Null G:1 G:2B:1 I:3 M:2M:1 G:1 B:1 Header: I B G M i = I pattern = { I G M } support = 3 pattern base cFP-Tree … freq_itemset = { G M } M:1

FP-Growth Mining Example M:3 G:3 I:3 Header: I G M Patterns mined: IM GM IGM freq_itemset = { I G M } All patterns: BM, GM, BGM, IM, GM, IGM

FP-Growth Mining Example Null G:1 G:2B:1 I:3 M:2M:1 G:1 B:1 Header: I B G M i = B pattern = { B G M } support = 2 pattern base cFP-Tree … freq_itemset = { G M } M:1

FP-Growth Mining Example M:2 G:2 B:2 Header: B G M Patterns mined: BM GM BGM freq_itemset = { B G M } All patterns: BM, GM, BGM, IM, GM, IGM, BM, GM, BGM

FP-Growth Mining Example Null M:1 I:1 G:1 B:45 M:1 I:1 G:1 F:40 M:1 I:1 H:1 G:35 M:1 G:1 B:1 A:50 Header: A B F G H I M i = H pattern = { H M } support = 1 pattern base cFP-Tree = null freq_itemset = { M }

FP-Growth Mining Example Complete pass for { M } Move onto { I }, { H }, etc. Final Frequent Sets: L2 : { BM, GM, IM, GM, BM, GM, IM, GM, GI, BG } L3 : { BGM, GIM, BGM, GIM } L4 : None

FP-Growth Redundancy v. Apriori Candidate Sets FP-Growth (support=2) generates: L2 : 10 sets (5 distinct) L3 : 4 sets (2 distinct) Total : 14 sets Apriori (support=2) generates: C2 : 21 sets C3 : 2 sets Total : 23 sets What about support=1? Apriori : 23 sets FP-Growth : 28 sets in {M} with i = A, B, F alone!

FP-Growth vs. Apriori Apriori visits each transaction when generating a new candidate sets; FP- Growth does not Can use data structures to reduce transaction list FP-Growth traces the set of concurrent items; Apriori generates candidate sets FP-Growth uses more complicated data structures & mining techniques

Algorithm Analysis Results FP-Growth IS NOT inherently faster than Apriori Intuitively, it appears to condense data Mining scheme requires some new work to replace candidate set generation Recursion obscures the additional effort FP-Growth may run faster than Apriori in circumstances No guarantee through complexity which algorithm to use for efficiency

Improvements to FP-Growth None currently reported MLFPT Multiple Local Frequent Pattern Tree New algorithm that is based on FP-Growth Distributes FP-Trees among processors No reports of complexity analysis or accuracy of FP-Growth

Real World Applications Zheng, Kohavi, Mason – “Real World Performance of Association Rule Algorithms” Collected implementations of Apriori, FP- Growth, CLOSET, CHARM, MagnumOpus Tested implementations against 1 artificial and 3 real data sets Time-based comparisons generated

Apriori & FP-Growth Apriori Implementation from creator Christian Borgelt (GNU Public License) C implementation Entire dataset loaded into memory FP-Growth Implementation from creators Han & Pei Version – February 5, 2001

Other Algorithms CHARM Based on concept of Closed Itemset e.g., { A, B, C } – AB  C, A  C, B  C, AC  B, A  B, C  B, etc. CLOSET Han, Pei implementation of Closed Itemset MagnumOpus Generates rules directly through search- and-prune technique

Datasets IBM-Artificial Generated at IBM Almaden (T10I4D100K) Often used in association rule mining studies BMS-POS Years of point-of-sale data from retailer BMS-WebView-1 & BMS-WebView-2 Months of clickstream traffic from e- commerce web sites

Dataset Characteristics

Experimental Considerations Hardware Specifications Dual 550MHz Pentium III Xeon processors 1GB Memory Support { 1.00%, 0.80%, 0.60%, 0.40%, 0.20%, 0.10%, 0.08%, 0.06%, 0.04%, 0.02%, 0.01% } Confidence = 0% No other applications running (second processor handles system processes)

IBM-ArtificialBMS-POS

BMS-WebView-1BMS-WebView-2

Study Results – Real Data At support  0.20%, Apriori performs as fast as or better than FP-Growth At support < 0.20%, Apriori completes whenever FP-Growth completes Exception – BMS-WebView-2 @ 0.01% When 2 million rules are generated, Apriori finishes in 10 minutes or less Proposed – Bottleneck is NOT the rule algorithm, but rule analysis

Real Data Results

Study Results – Artificial Data At support < 0.40%, FP-Growth performs MUCH faster than Apriori At support  0.40%, FP-Growth and Apriori are comparable

Real-World Study Conclusions FP-Growth (and other non-Apriori) perform better on artificial data On all data sets, Apriori performs sufficiently well in reasonable time periods for reasonable result sets FP-Growth may be suitable when low support, large result count, fast generation are needed Future research may best be directed toward analyzing association rules

Research Conclusions FP-Growth does not have a better complexity than Apriori Common sense indicates it will run faster FP-Growth does not always have a better running time than Apriori Support, dataset appears more influential FP-Trees are very complex structures (Apriori is simple) Location of data (memory vs. disk) is non-factor in comparison of algorithms

To Use or Not To Use? Question: Should the FP-Growth be used in favor of Apriori? Difficulty to code High performance at extreme cases Personal preference More relevant questions What kind of data is it? What kind of results do I want? How will I analyze the resulting rules?

References Han, Jiawei, Pei, Jian, and Yin, Yiwen, “Mining Frequent Patterns without Candidate Generation”. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 1-12, Dallas, Texas, USA, 2000. Orlando, Salvatore, “High Performance Mining of Short and Long Patterns”. 2001. Pei, Jian, Han, Jiawei, and Mao, Runying, “CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets”. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, May 2000. Webb, Geoffrey L., “Efficient search for association rules”. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 99--107, 2000. Zaiane, Osmar, El-Hajj, Mohammad, and Lu, Paul, “Fast Parallel Association Rule Mining Without Candidacy Generation”. In Proc. of the IEEE 2001 International Conference on Data Mining (ICDM'2001), San Jose, CA, USA, November 29-December 2, 2001 Zaki, Mohammed J., “Generating Non-Redundant Association Rules”. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, 2000. Zheng, Zijian, Kohavi, Ron, and Mason, Llew, “Real World Performance of Association Rule Algorithms”. In proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, August 2001.

Download ppt "The FP-Growth/Apriori Debate Jeffrey R. Ellis CSE 300 – 01 April 11, 2002."

Similar presentations