Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Graph Mining Laks V.S. Lakshmanan
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.
FP-Growth algorithm Vasiljevic Vladica,
Data Mining Association Analysis: Basic Concepts and Algorithms
CPS : Information Management and Mining
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining: Concepts and Techniques (2nd ed.) — Chapter 5 —
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
Spring 2003Data Mining by H. Liu, ASU1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Performance and Scalability: Apriori Implementation.
SEG Tutorial 2 – Frequent Pattern Mining.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Mining Frequent Patterns without Candidate Generation.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Association Analysis (3)
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge.
MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.
Reducing Number of Candidates
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association rule mining
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Find Patterns Having P From P-conditional Database
732A02 Data Mining - Clustering and Association Analysis
Mining Frequent Patterns without Candidate Generation
Frequent-Pattern Tree
FP-Growth Wenlong Zhang.
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
Presentation transcript:

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong

2 Talk Outline  Background –Frequent pattern mining –Serial algorithm  Parallel frequent pattern mining –Parallel framework –Load balancing problem  Experimental results  Optimization  Summary

3 Frequent Pattern Analysis  Frequent pattern –A pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set  Motivation –To find the inherent regularities in data  Applications –Basket data analysis What products were often purchased together? –DNA sequence analysis What kinds of DNA are sensitive to this new drug? –Web log analysis Can we automatically classify web documents?

4 Frequent Itemset Mining  Itemset –A collection of one or more items Example: {Milk, Juice} –k-itemset An itemset that contains k items  Transaction –An itemset –A dataset is a collection of transactions  Support –Number of transactions containing an itemset Example: Support({Milk, Juice}) = 2  Frequent-itemset mining –To output all itemsets whose support values are no less than a predefined threshold in a dataset TransactionItems T1Milk, bread, cookies, juice T2Milk, juice T3Milk, eggs T4Bread, cookies, coffee Support threshold = 2 {milk}:3 {bread}:2 {cookies}:2 {juice}:2 {milk, juice}:2 {bread, cookies}:2

5 Frequent Itemset Mining  Frequent itemset mining is computationally expensive  Brute-force approach: –Given d items, there are 2 d possible candidate itemsets –Count the support of each candidate by scanning the database –Match each transaction against every candidate –Complexity ~ O(NMW) => Expensive since M= 2 d TransactionItems T1Milk, Bread, Cookies, Juice T2Milk, Juice T3Milk, Eggs T4Bread, Cookies, Coffee Transactions List of candidates N W M

6 Mining Frequent-itemset In Serial  FP-growth algorithm [ Han et 2000 ] –One of the most efficient serial algorithms to mine frequent-itemset. [ FIMI’03 ] –A divide-and-conquer algorithm.  Mining process of FP-growth –Step 1: Identify frequent 1-items with one scan of the dataset. –Step 2: Construct a tree structure (FP-tree) for the dataset with another dataset scan. –Step 3: Traverse the FP-tree and construct a projection (sub-tree) for each frequent 1- item. Recursively mine the projections.

7 Example of FP-growth Algorithm TID Items 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} Item frequency f 4 c 4 a 3 b 3 m 3 p 3 Support Threshold =3 Input Dataset: Step 1:  Example

8 root f:1 c:1 a:1 m:1 p:1 root f:2 c:2 a:2 m:1 p:1 b:1 m:1 root f:3 c:2 a:2 m:1 p:1 b:1 m:1 b:1 root f:3 c:2 a:2 m:1 p:1 b:1 m:1 b:1 c:1 b:1 p:1 root f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 c:1 b:1 p:1 TIDItems bought 100{f, a, c, d, g, i, m, p} 200{a, b, c, f, l, m, o} 300 {b, f, h, j, o} 400 {b, c, k, s, p} 500 {a, f, c, e, l, p, m, n} Support =3 Ordered Frequent Items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p} Step 2: Item frequency f4 c4 a3 b3 m3 p3 root f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1

9 Example of FP-growth Algorithm (cont’d) Step 3: Traverse the FP-tree by following the side link of each frequent 1-item and accumulate the prefix paths. Build FP-tree structure (projection) for the accumulated prefix paths If the projection contains only one path, enumerate all the combinations of the items, else recursively mine the projection. All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam m’s prefix paths: fca: 2, fcab: 1 m-projection root f:3 c:3 a:3 Item Prefix pathes cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 Item frequency f4 c4 a3 b3 m3 p3 root f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 f:4 c:3 a:3 b:1 m:2 m:1 Item frequency f4 c4 a3 b3 m3 p3 root f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 X

10 Talk Outline  Background –Frequent-itemset mining –Serial algorithm  Parallel frequent pattern mining –Parallel framework –Load balancing problem  Experimental results  Optimization  Summary

11 Parallel Mining Frequent Itemset  FP-growth algorithm:  Parallelization framework for FP-growth algorithm Identify frequent single items (1.31%) Build tree structure for the whole DB (1.91%) Make projection for each frequent single item from the tree structure and mine the projection (96.78%) Divide and conquer Identify frequent single items in parallel Partition the frequent single items and assign each subset of frequent items to a processor. Each processor builds tree structure related to the assigned items from the DB. Each processor makes projections for the assigned items from its local tree and mines the projections independently.

12 Load Balancing Problem  Reason: –The large projections takes too long to mine relative to the mining time of the overall dataset.  Solution: –The larger projections must be partitioned.  Challenge: –Identify the larger projections.

13 How To Identify The Larger Projections?  Static estimation –Based on dataset parameters Number of items, number of transactions, length of transactions, … –Based on the characteristics of the projection Depth, bushiness, tree size, number of leaves, fan-out, fan-in, … Result --- No correlation found with any of the above.

14 Dynamic Estimation  Runtime sampling –Use the relative mining time of a sample to estimate the relative mining time of the whole dataset. –Accuracy vs. overhead  Random sampling: random select a subset of records. –Not accurate e.g. Pumsb 1% random sampling (overhead 1.03%)

15 Selective Sampling  Sampling based on frequency. –Discard the infrequent items –Discard a fraction t of the most frequent 1-items. e.g. Frequent-1 items:, (Sup min =2), t=20% { f, a, c, d, g, i, m, p} { f, a, b, g, i, p}  When filtering top 20%, sampling takes average 1.41% of the sequential mining time still provides fairly good accuracy. e.g. Pumsb Top 20% (overhead 0.71%) X XX X X X X X

16 Why Selective Sampling Works?  The mining time is proportional to the number of frequent itemsets in the result. (from experiments)  Given a frequent L-itemset, all its subsets are frequent itemsets. There are 2 L -1 subsets.  Removing one item at the root reduces the total number of itemsets in the result and, therefore, reduces the mining time roughly by half.  The most frequent items are close to the root. The mining time of their projections are negligible but their presence increases the number of itemsets in the results.

17 Talk Outline  Background –Frequent-itemset mining –Serial algorithm  Parallel frequent pattern mining –Parallel framework –Load balancing problem  Experimental results  Optimization  Summary

18 Experimental Setups  A Linux cluster with 64 nodes. – 1GHz Pentium III processor and 1GB memory per node.  Implementation: C++ and MPI.  Datasets: Dataset#Transactions#ItemsMax Trans. Length mushroom8,12423 connect57,55743 pumsb49,0467,11674 pumsb_star49,0467,11663 T40I10D100K100, T50I5D500K500,0005,00094

19 Speedups: Frequent-itemset Mining Needs multi- level task partitioning

20 Experimental Results For Selective Sampling  Overhead of selective sampling (average 1.4%)  Effectiveness of selective sampling –Selective sampling can improve the performance by 37% on average. Dataset MushroomConnectPumsbPumsb_starT40I10D100KT50I5D500K Overhead 0.71%1.80%0.71%2.90%2.05%0.28%

21 Optimization: Multi-level Partitioning  Problem analysis: Optimal speedup with 1-level partitioning: 576/131=4.4  Conclusion: –We need multi-level task partitioning to obtain better speedup.  Challenges: –How many levels are necessary? –Which sub-subtasks to be further partitioned? Total 576 Max Subtask with 1-level partition: Max Subtask with 1-level partition: 65

22 Optimization: Multi-level Partitioning  Observations: –The mining time of the maximal subtask derived is about ½ of the mining time of the task itself. –The mining time of the maximal subtask derived from top1 task is about the same as the top2 task.  Reason: –There is one very long frequent pattern in the dataset. a: 2 (L-1) Max subtask -> ab: 2 (L-2) b: 2 (L-2) Max subtask -> bc: 2 (L-3) c: 2 (L-3) Max subtask -> cd: 2 (L-4) …… …… Total 576 Max Subtask with 1-level partition: 131 Max Subtask with 1-level partition: 65 L abcde: 2 (L-5) (if we partition a to abcde, the mining time of the derived subtask for abcde is about 1/16 of the task of a.) (The mining time is proportional to the number of frequent itemsets in the result.)

23 Optimization: Multi-level Partitioning  Multi-level partitioning heuristic: –If a subtask’s mining time (based on selective sampling) is greater than, partition it so that the maximal mining time of the derived sub-subtasks is less than. ( N : number of processors, M total number of tasks, T i : mining time of a subtask)  Result:

24 Summary  Data mining is an important application of parallel processing.  We developed a framework for parallel mining frequent- itemset and achieved good speedups.  We proposed the selective sampling technique to address the load balancing problem and improved the speedups by 45% on average on 64 processors.

25 Questions?