2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.

Slides:



Advertisements
Similar presentations
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Advertisements

Data Mining Techniques Association Rule
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Mining Sequential Patterns Authors: Rakesh Agrawal and Ramakrishnan Srikant. Presenter: Jeremy Dalmer.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Rakesh Agrawal Ramakrishnan Srikant
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Data Mining Association Analysis: Basic Concepts and Algorithms
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int’l Conference on Data Engineering (ICDE) March 1995 Presenter: Phil Schlosser.
1 Mining Quantitative Association Rules in Large Relational Database Presented by Jin Jin April 1, 2004.
Data Mining Association Analysis: Basic Concepts and Algorithms
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
1 ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis) Association Rule Mining Olivia R. Liu Sheng, Ph.D. Emma Eccles Jones Presidential.
ACCTG 6910 Building Enterprise & Business Intelligence Systems (e.bis)
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
Fast Algorithms for Association Rule Mining
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Mining Association Rules
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Mining Sequential Patterns: Generalizations and Performance Improvements R. Srikant R. Agrawal IBM Almaden Research Center Advisor: Dr. Hsu Presented by:
Data Mining Techniques Sequential Patterns. Sequential Pattern Mining Progress in bar-code technology has made it possible for retail organizations to.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.
Modul 8: Sequential Pattern Mining. Terminology  Item  Itemset  Sequence (Customer-sequence)  Subsequence  Support for a sequence  Large/frequent.
Modul 8: Sequential Pattern Mining
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Course on Data Mining: Seminar Meetings Page 1/30 Course on Data Mining ( ): Seminar Meetings Ass. Rules EpisodesEpisodes Text Mining
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: İlkcan Keleş.
Data Mining Association Analysis: Basic Concepts and Algorithms
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Sequential Patterns
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Mining Sequential Patterns
Association Analysis: Basic Concepts
Presentation transcript:

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 2 Association Rules Definition: Rules that state a statistical correlation between the occurrence of certain attributes in a database table. Given a set of transactions, where each transaction is a set of items, X 1,..., X n and Y, an association rule is an expression X 1,..., X n  Y. This means that the attributes X 1,..., X n predict Y Intuitive meaning of such a rule: transactions in the database which contain the items in X tend also to contain the items in Y.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 3 Measures for an Association Rule Support : Given the association rule X 1,..., X n  Y, the support is the percentage of records for which X 1,..., X n and Y both hold. The statistical significance of the association rule. Confidence : Given the association rule X 1,..., X n  Y, the confidence is the percentage of records for which Y holds, within the group of records for which X 1,..., X n hold. The degree of correlation in the dataset between X and Y. A measure of the rule’s strength.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 4 Quiz # 2 Problem: Given a transaction table D, find the support and confidence for an association rule B,D  E. Database D TIDItems A B E A C D E B C D E A B D E B D E A B C 07A B D Answer: support = 3/7, confidence = 3/4

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 5 Apriori Algorithm An efficient algorithm to find association rules. Procedure  Procedure  Find all the frequent itemsets :  Use the frequent itemsets to generate the association rules A frequent itemset is a set of items that have support greater than a user defined minimum.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 6 Notation An itemset having k items. k-itemset LkLk Set of candidate k-itemsets (those with minimum support). Each member of this set has two fields: i) itemset and ii) support count. CkCk Set of candidate k-itemsets (potentially frequent itemsets). Each member of this set has two fields: i) itemset and ii) support count. The sample transaction database D The set of all frequent items.F

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 7 Example TIDItems A C D B C E A B C E B E Database D C1C1C1C1 Support {A}.50 {B} {C} {D} {E} L1L1L1L Y Y Y N Y (k = 1) itemset C2C2C2C2 Support {A,B}.25 {A,C} {A,E} {B,C} {B,E} L2L2L2L2 N (k = 2) itemset C3C3C3C3 Support {B,C,E} L3L3L3L3.50Y (k = 3) itemset {C,E} Y N Y Y.50Y C4C4C4C4 Support {A,B,C,E} L4L4L4L4.25N (k = 4) itemset * Suppose a user defined minimum =.49 * n items implies O(n - 2) computational complexity? 2

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 8 Procedure Apriorialgo() { F =  ; L k = {frequent 1-itemsets}; k = 2; /* k represents the pass number. */ while (L k-1 !=  ) { F = F U L k ; C k = New candidates of size k generated from L k-1 ; for all transactions t  D increment the count of all candidates in C k that are contained in t ; L k = All candidates in C k with minimum support ; k++ ; } return ( F ) ; }

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 9 Candidate Generation Given L k-1, the set of all frequent (k-1)-itemsets, generate a superset of the set of all frequent k-itemsets. Idea : if an itemset X has minimum support, so do all subsets of X. 1. Join L k-1 with L k-1 2. Prune: delete all itemsets c  C k such that some (k-1)-subset of c is not in L k-1. ex) L 2 = { {A,C}, {B,C}, {B,E}, {C,E} } 1. Join : { {A,B,C}, {A,C,E}, {B,C,E} } 2. Prune : { {A,B,C}, {A,C,E}, {B,C,E} } {A, E}  L 2 Instead of 5 C 3 = 10, we have only 1 candidate. {A, B}  L 2

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 10 Thoughts Association rules are always defined on binary attributes.  Need to flatten the tables. ex) CIDGenderEthnicityCall MFWBHADICID Phone Company DB. - Support for Asian ethnicity will never exceed.5. - No need to consider itemsets {M,F}, {W,B} nor {D,I}. - M  F or D  I are not of interest at all. * Considering the original schema before flattening may be a good idea.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 11 Finding association rules with item constraints When item constraints are considered, the Apriori candidate generation procedure does not generate all the potential frequent itemsets as candidates. Procedure  Procedure 1. Find all the frequent itemsets that satisfy the boolean expression B. 2. Find the support of all subsets of frequent itemsets that do not satisfy B. 3. Generate the association rules from the frequent itemsets found in Step 1. by computing confidences from the frequent itemsets found in Steps 1 & 2.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 12 L s(k) Set of frequent k-itemsets that contain an item in S. Additional Notation BBoolean expression with m disjuncts: B = D 1  D 2 ...  D m DiDi N conjuncts in D i, D i = a i,1  a i,2 ...  a i,n SSet of items such that any itemset that satisfies B contains an item from S. L b(k) Set of frequent k-itemsets that satisfy B. C s(k) Set of candidate k-itemsets that contain an item in S. C b(k) Set of candidate k-itemsets that satisfy B.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 13  Procedure 1. Scan the data and determine L 1 and F. 2. Find L b(1) 3. Generate C b(k+1) from L b(k) 3-1. C k+1 = L k x F 3-2. Delete all candidates in C k+1 that do not satisfy B Delete all candidates in C k+1 below the minimum support for each D i with exactly k + 1 non-negated elements, add the itemset to C k+1 if all the items are frequent. Direct Algorithm

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 14 TIDItems A C D B C E A B C E B E Database D Example Given B = (A  B)  (C   E) step 1 & 2 L b(1) = { C } C 1 = { {A}, {B}, {C}, {D}, {E} } L 1 = { {A}, {B}, {C}, {E} } C 2 = L b(1) x F = { {A,C}, {B,C}, {C,E} } step 3-2step 3-1 C b(2) = { {A,C}, {B,C} } step 3-3 L 2 = { {A,C}, {B,C} } step 3-4 L b(2) = { {A,B}, {A,C}, {B,C} } C 3 = L b(2) x F = { {A,B,C}, {A,B,E}, {A,C,E}, {B,C,E} } step 3-2step 3-1 C b(3) = { {A,B,C}, {A,B,E} } step 3-3 L 3 =  step 3-4 L b(3) = 

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 15 MultipleJoins and Reorder algorithms to find association rules with item constraints will be added.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 16 Mining Sequential Patterns Given a database D of customer transactions, the problem of mining sequential patterns is to find the maximal sequences among all sequences that have certain user-specified minimum support. - Transaction-time field is added. - Itemset in a sequence is denoted as

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 17 Sequence Version of DB Conversion CustomerID Jun Jun Jun Jun Jun Jun Jun July Jun Transaction Time , ,60,70 30,50, ,70 90 Items D CustomerID Customer Sequence Sequential version of D’ Answer set with support >.25 = {, } * Customer sequence : all the transactions of a customer is a sequence ordered by increasing transaction time.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 18 Definitions Def 1. A sequence is contained in another sequence if there exists integers i 1 < i 2 < … < i n such that a 1  b i1, a 2  b i2, …, a n  b in ex) is contained in. is contained in. Def 2. A sequence s is maximal if s is not contained in any other sequence. - T i is transaction time. - itemset(T i ) is transaction the set of items in T i. - litemset : an item set with minimum support. Yes No

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 19  Procedure 1. Convert D into a D’ of customer sequences. 2. Litemset mapping 3. Transform each customer sequence into a litemset representation.  4. Find the desired sequences using the set of litemsets AprioriAll 4-2. AprioriSome 4-3. DynamicSome 5. Find the maximal sequences among the set of large sequences. for(k = n; k > 1; k--) foreach k-sequence s k delete from S all subsequences of s k. Procedure

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha Mapped to (30) (40) (70) (40 70) (90) Large Itemsets Example step CID Customer Sequence step 3 Transformed Sequence Mapping

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 21 AprioriAll Aprioriall() { L k = {frequent 1-itemsets}; k = 2; /* k represents the pass number. */ while (L k-1 !=  ) { F = F U L k ; C k = New candidates of size k generated from L k-1 ; for each customer-sequence c  D increment the count of all candidates in C k that are contained in c ; L k = All candidates in C k with minimum support ; k++ ; } return ( F ) ; }

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha L 3 Supp Example C4C4 2 L 4 Supp. Customer Seq’s. Minimum support = L 1 Supp L 2 Supp The maximal large sequences are {,, }.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 23 AprioriSome and DynamicSome algorithms to find association rules with sequential patterns will be added.

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 24 GSC features S y (i,j) tan S x (i,j) Gradient (local) Structural (intermediate) and Concavity (global)

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 25 GSC feature table

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 26 A sample GSC features Gradient : (192) Structure : (192) Concavity : (128)

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 27 Class A 800 samples

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 28 Class A, B and C ABC

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 29 Reordered by Frequency ABC

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 30 - G  S, G  C - F 1, F 2, F 3  “A” - F 1  F 2 Associate Rules in GSC

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 31 References Agrawal, R.; Imielinski, T.; and Swami, A. “Mining Association Rules between Sets of Items in Large Databases.” Proc. Of the ACM SIGMOD Conference on Management of Data, , 1993 Agrawal, R. and Srikant, R. “Fast Algorithms for Mining Association Rules in Large Databases” Proc. Of the 20th Int’l Conference on Very Large Databases, p , Sept Agrawal, R. and Srikant, R. “Mining Sequential Patterns”, Research Report RJ 9910, IBM Almaden Research Center, San Jose, California, October 1994 Agrawal, R. and Shafer, J. “Parallel Mining of Association Rules.” IEEE Transactions on Knowledge and Data Engineering. 8(6), 1996

2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 32 Generate the Selected Item Set() { S =  ; for each D i, i = 1 to m ; { for each a i,j, j = 1 to n ; cost of conjunct = support(S U a i,j ) - support(a i,j ) ; Add a i,j with the minimum cost to S ; } MultipleJoins Algorithm TIDItems A C D B C E A B C E B E Database D B = (A  B)  (C   E) The algorithm gives S = { A, C }. Procedure  Procedure 1. Scan the data and determine F. 2. L s(1) = S  F...