732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Techniques Association Rule
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
CPS : Information Management and Mining
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining: Association Rule Mining CSE880: Database Systems.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
2015年5月25日星期一 2015年5月25日星期一 2015年5月25日星期一 Data Mining: Concepts and Techniques1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic.
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Thanks: Jiawei Han and Jian Pei.
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
Mining Association Rules
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Mining Association Rules
SEG Tutorial 2 – Frequent Pattern Mining.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
DATA MINING LECTURE 3 Frequent Itemsets Association Rules.
Mining Frequent Patterns without Candidate Generation.
Chapter 4 – Frequent Pattern Mining Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Frequent itemset mining and temporal extensions Sunita Sarawagi
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Find information from data data ? information.
1 Data Mining: Mining Frequent Patterns, Association and Correlations.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
What is Frequent Pattern Analysis?
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Association Rule Mining CENG 514 Data Mining
Association Rule Mining CENG 514 Data Mining July 2,
Overview on Association Rule Mining CENG 770 Advanced Data Mining
Data Mining Find information from data data ? information.
Data Mining: Concepts and Techniques
Association rule mining
Mining Association Rules in Large Databases
732A02 Data Mining - Clustering and Association Analysis
Frequent-Pattern Tree
FP-Growth Wenlong Zhang.
Mining Association Rules in Large Databases
Presentation transcript:

732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm

Association rules  Mining some data for frequent patterns.  In our case, patterns will be rules of the form Antecedent  consequent, with only conjunctions of bought items in the antecedent and consequent, e.g. milk ^ eggs  bread ^ butter.  Applications: E.g., market basket analysis (to support business decisions):  Rules with “Coke” in the consequent may help to decide how to boost sales of “Coke”.  Rules with “bagels” in the antecedent may help to determine what happens if “bagels” are sold out. FREQUENT ITEMSET

Association rules  Goal: Find all the rules X  Y with minimum support and confidence  support = p(X, Y) = probability that a transaction contains X  Y  confidence = p(Y | X) = conditional probability that a transaction having X also contains Y = p(X, Y) / p(X).  Let sup min = 50%, conf min = 50%. Association rules: A  D (60%, 100%) D  A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-idItems bought 10A, B, D 20A, C, D 30A, D, E 40B, E, F 50B, C, D, E, F

 Goal: Find all the rules X  Y with minimum support and confidence.  Solution:  Find all sets of items (itemsets) with minimum support, i.e. the frequent itemsets (Apriori and FP grow algorithms).  Generate all the rules with minimum confidence from the frequent itemsets.  Note (the downward closure or apriori property): Any subset of a frequent itemset is frequent. Or, any superset of an infrequent itemset set is infrequent. Association rules

 Frequent itemsets can be represented as a tree (the children of a node are a subset of its siblings).  Different algorithms traverse the tree differently, e.g.  Apriori algorithm = breadth first.  FP grow algorithm = depth first.  Breadth first algorithms cannot typically store the projections in memory and, thus, have to scan the database more times. The opposite is typically true for depth first algorithms.  Breadth (resp. depth) is typically less (resp. more) efficient but more (resp. less) scalable. Association rules

1. Scan the database once to get the frequent 1-itemsets 2. Generate candidates to frequent (k+1)-itemsets from frequent k-itemsets 3. Test the candidates against database 4. Terminate when no frequent or candidate itemsets can be generated, otherwise Apriori algorithm

Database 1 st scan L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2 sup min = 2 apriori property C1C1

 How to generate candidates?  Step 1: self-joining L k  Step 2: pruning  Example of candidate generation.  L 3 ={abc, abd, acd, ace, bcd}  Self-joining: L 3 *L 3  abcd from abc and abd.  acde from acd and ace.  Pruning:  acde is removed because ade is not in L 3.  C 4 ={abcd} Apriori algorithm

 Suppose the items in L k-1 are listed in an order 1. Self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1, …, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 2. Pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k Apriori algorithm apriori property

 C k : candidate itemset of size k  L k : frequent itemset of size k 1. L 1 = {frequent items} 2. for (k = 1; L k !=  ; k++) do begin 3. C k+1 = candidates generated from L k 4. for each transaction t in database d 5. increment the count of all candidates in C k+1 that are contained in t 6. L k+1 = candidates in C k+1 with minimum support 7. end 8. return  k L k Apriori algorithm Prove that all the frequent (k+1)-itemsets are in C k+1

 Generate all the rules of the form a  l - a with minimum confidence from a large (= frequent) itemset l.  If a subset a of l does not generate a rule, then neither does any subset of a (≈ apriori property). Association rules R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", IBM Research Report RJ9839.

 Generate all the rules of the form l - h  h with minimum confidence from a large (= frequent) itemset l.  For a subset h of a large item l to generate a rule, so must do all the subsets of h (≈ apriori property). Association rules = Apriori algorithm candidate generation Generate the rules with one item consequent R. Agrawal, R. Srikant: "Fast Algorithms for Mining Association Rules", IBM Research Report RJ9839.

 Apriori = candidate generate-and-test.  Problems  Too many candidates to generate, e.g. if there are 10 4 frequent 1-itemsets, then more than 10 7 candidate 2-itemsets.  Each candidate implies expensive operations, e.g. pattern matching and subset checking.  Can candidate generation be avoided ? Yes, frequent pattern (FP) grow algorithm. FP grow algorithm

{} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 min_support = 3 TIDItems bought items bought (f-list ordered) 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan the database once, and find the frequent items. Record them as the frequent 1-itemsets. 2.Sort frequent items in frequency descending order 3.Scan the database again and construct the FP-tree. f-list=f-c-a-b-m-p. FP grow algorithm

 For each frequent item in the header table  Traverse the tree by following the corresponding link.  Record all of prefix paths leading to the item. This is the item’s conditional pattern base. Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 FP grow algorithm Frequent itemsets found: f: 4, c:4, a:3, b:3, m:3, p:3 

FP grow algorithm  For each conditional pattern base  Start the process again (recursion). m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree  am-conditional pattern base: fc:3 {} f:3 c:3 am-conditional FP-tree  cam-conditional pattern base: f:3 {} f:3 cam-conditional FP-tree Frequent itemset found: fcam: 3 Backtracking !!! Frequent itemsets found: fam: 3, cam:3 Frequent itemsets found: fm: 3, cm:3, am:3   

FP grow algorithm

 Exercise Run the FP grow algorithm on the following database FP grow algorithm TIDItems bought 100{1,2,5} 200{2,4} 300 {2,3} 400 {1,2,4} 500 {1,3} 600 {2,3} 700 {1,3} 800 {1,2,3,5} 900 {1,2,3}

 Frequent itemsets can be represented as a tree (the children of a node are a subset of its siblings).  Different algorithms traverse the tree differently, e.g.  Apriori algorithm = breadth first.  FP grow algorithm = depth first.  Breadth first algorithms cannot typically store the projections and, thus, have to scan the databases more times.  The opposite is typically true for depth first algorithms.  Breadth (resp. depth) is typically less (resp. more) efficient but more (resp. less) scalable. Association rules