732A02 Data Mining - Clustering and Association Analysis

Slides:

Advertisements

Similar presentations

Mining Association Rules

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

CSE 634 Data Mining Techniques

Association rules and frequent itemsets mining

Graph Mining Laks V.S. Lakshmanan

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.

FP-Growth algorithm Vasiljevic Vladica,

Data Mining Association Analysis: Basic Concepts and Algorithms

FPtree/FPGrowth (Complete Example). First scan – determine frequent 1- itemsets, then build header B8 A7 C7 D5 E3.

CPS : Information Management and Mining

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Data Mining Association Analysis: Basic Concepts and Algorithms

FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.

Data Mining Association Analysis: Basic Concepts and Algorithms

1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.

Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.

Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.

Data Mining Association Analysis: Basic Concepts and Algorithms

FPtree/FPGrowth. FP-Tree/FP-Growth Algorithm Use a compressed representation of the database using an FP-tree Then use a recursive divide-and-conquer.

Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Constrained frequent itemset mining.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Performance and Scalability: Apriori Implementation.

SEG Tutorial 2 – Frequent Pattern Mining.

Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Ch5 Mining Frequent Patterns, Associations, and Correlations

Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.

Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.

AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.

Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

EFFICIENT ITEMSET EXTRACTION USING IMINE INDEX By By U.P.Pushpavalli U.P.Pushpavalli II Year ME(CSE) II Year ME(CSE)

Mining Frequent Patterns without Candidate Generation.

Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授：廖述賢博士報告人：朱佩慧班級：管科所博一.

Frequent Pattern  交易資料庫中頻繁的被一起購買的產品  可以做為推薦產品、銷售決策的依據  兩大演算法 Apriori FP-Tree.

Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.

1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.

1 Data Mining: Mining Frequent Patterns, Association and Correlations.

Association Analysis (3)

Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.

MapReduce MapReduce is one of the most popular distributed programming models Model has two phases: Map Phase: Distributed processing based on key, value.

DATA MINING ASSOCIATION RULES.

Data Mining Find information from data data ? information.

Reducing Number of Candidates

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques

Association Rule Mining

Frequent Pattern Mining

Chapter 6 Tutorial.

Market Basket Analysis and Association Rules

Big Data Analytics: HW#2

Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,

Mining Association Rules in Large Databases

Mining Complex Data COMP Seminar Spring 2011.

Find Patterns Having P From P-conditional Database

COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong

Mining Frequent Patterns without Candidate Generation

Frequent-Pattern Tree

Market Basket Analysis and Association Rules

FP-Growth Wenlong Zhang.

Department of Computer Science National Tsing Hua University

Association Rule Mining

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —

Finding Frequent Itemsets by Transaction Mapping

Mining Association Rules in Large Databases

Presentation transcript:

732A02 Data Mining - Clustering and Association Analysis FP growth algorithm Correlation analysis ………………… Jose M. Peña jospe@ida.liu.se

FP growth algorithm Apriori = candidate generate-and-test. Problems Too many candidates to generate, e.g. if there are 104 frequent 1-itemsets, then more than 107 candidate 2-itemsets. Each candidate implies expensive operations, e.g. pattern matching and subset checking. Can candidate generation be avoided ? Yes, frequent pattern (FP) growth algorithm.

FP growth algorithm f-list=f-c-a-b-m-p. TID Items bought items bought (f-list ordered) 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o, w} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 3 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Scan the database once, and find the frequent items. Record them as the frequent 1-itemsets. Sort frequent items in frequency descending order Scan the database again and construct the FP-tree. f-list=f-c-a-b-m-p.

Frequent itemsets found: FP growth algorithm For each frequent item in the header table Traverse the tree by following the corresponding link. Record all of prefix paths leading to the item. This is the item’s conditional pattern base. {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1  Frequent itemsets found: f: 4, c:4, a:3, b:3, m:3, p:3

FP growth algorithm For each conditional pattern base      Start the process again (recursion). m-conditional pattern base: fca:2, fcab:1 am-conditional pattern base: fc:3 cam-conditional pattern base: f:3 {} f:3 c:3 a:3 m-conditional FP-tree {} f:3 c:3 am-conditional FP-tree {}   f:3 cam-conditional FP-tree    Frequent itemsets found: fm: 3, cm:3, am:3 Frequent itemsets found: fam: 3, cam:3 Frequent itemset found: fcam: 3 Backtracking !!!

FP growth algorithm

FP growth algorithm With small threshold there are many and long candidates, which implies long runtime due to expensive operations such as pattern matching and subset checking.

FP growth algorithm Exercise Run the FP growth algorithm on the following database (min_sup=2) TID Items bought 100 {1,2,5} 200 {2,4} {2,3} 400 {1,2,4} 500 {1,3} 600 {2,3} 700 {1,3} 800 {1,2,3,5} 900 {1,2,3}

Frequent itemsets Frequent itemsets can be represented as a tree (the children of a node are a subset of its siblings). Different algorithms traverse the tree differently, e.g. Apriori algorithm = breadth first. FP growth algorithm = depth first. Breadth first algorithms cannot typically store the projections and, thus, have to scan the databases more times. The opposite is typically true for depth first algorithms. Breadth (resp. depth) is typically less (resp. more) efficient but more (resp. less) scalable. min_sup=3

Correlation analysis Milk Not milk Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 5000 Milk  cereal [40%, 66.7%] is misleading/uninteresting: The overall % of students buying cereal is 75% > 66.7% !!! Milk  not cereal [20%, 33.3%] is more accurate (25% < 33.3%). Measure of dependent/correlated events: lift for X  Y lift >1 positive correlation, lift <1 negative correlation, = 1 independence

Correlation analysis Exercise: Find an example where A  C has lift(A,C) < 1, but A,B  C has lift(A,B,C) > 1.