1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Recap: Mining association rules from large datasets
Brian Chase.  Retailers now have massive databases full of transactional history ◦ Simply transaction date and list of items  Is it possible to gain.
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Frequent Closed Pattern Search By Row and Feature Enumeration
LOGO Association Rule Lecturer: Dr. Bo Yuan
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Mining Generalized Association Rules Ramkrishnan Strikant Rakesh Agrawal Data Mining Seminar, spring semester, 2003 Prof. Amos Fiat Student: Idit Haran.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Rakesh Agrawal Ramakrishnan Srikant
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, D. W. Cheung, B. Kao Department of Computer Science.
Fast Algorithms for Mining Association Rules * CS401 Final Presentation Presented by Lin Yang University of Missouri-Rolla * Rakesh Agrawal, Ramakrishnam.
Mining Association Rules in Large Databases
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining Part 1 Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Fast Algorithms for Association Rule Mining
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee David W. Cheung Ben Kao The University of Hong Kong.
V Storage Manager Shahram Ghandeharizadeh Computer Science Department University of Southern California.
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Mining Association Rules
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
1 Mining Association Rules Mohamed G. Elfeky. 2 Introduction Data mining is the discovery of knowledge and useful information from the large amounts of.
Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.
Data Mining Find information from data data ? information.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Data Mining (and machine learning) The A Priori Algorithm.
Elsayed Hemayed Data Mining Course
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining Find information from data data ? information.
Reducing Number of Candidates
Frequent Pattern Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Graduate Course DataMining
Fast Algorithms for Mining Association Rules
Association Analysis: Basic Concepts
Presentation transcript:

1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant

©Ofer Pasternak Data Mining Seminar Outline Introduction Formal statement Apriori Algorithm AprioriTid Algorithm Comparison AprioriHybrid Algorithm Conclusions

©Ofer Pasternak Data Mining Seminar Introduction Bar-Code technology Mining Association Rules over basket data (93) Tires ^ accessories  automotive service Cross market, Attached mail. Very large databases.

©Ofer Pasternak Data Mining Seminar Notation Items – I = {i 1,i 2, …,i m } Transaction – set of items – Items are sorted lexicographically TID – unique identifier for each transaction

©Ofer Pasternak Data Mining Seminar Notation Association Rule – X  Y

©Ofer Pasternak Data Mining Seminar Confidence and Support Association rule X  Y has confidence c, c% of transactions in D that contain X also contain Y. Association rule X  Y has support s, s% of transactions in D contain X and Y.

©Ofer Pasternak Data Mining Seminar Notice X  A doesn’t mean X+Y  A – May not have minimum support X  A and A  Z doesn’t mean X  Z – May not have minimum confidence

©Ofer Pasternak Data Mining Seminar Define the Problem Given a set of transactions D, generate all association rules that have support and confidence greater than the user-specified minimum support and minimum confidence.

©Ofer Pasternak Data Mining Seminar Previous Algorithms AIS SETM Knowledge Discovery Induction of Classification Rules Discovery of causal rules Fitting of function to data KID3 – machine learning

©Ofer Pasternak Data Mining Seminar Discovering all Association Rules Find all Large itemsets – itemsets with support above minimum support. Use Large itemsets to generate the rules.

©Ofer Pasternak Data Mining Seminar General idea Say ABCD and AB are large itemsets Compute conf = support(ABCD) / support(AB) If conf >= minconf AB  CD holds.

©Ofer Pasternak Data Mining Seminar Discovering Large Itemsets Multiple passes over the data First pass – count the support of individual items. Subsequent pass – Generate Candidates using previous pass ’ s large itemset. – Go over the data and check the actual support of the candidates. Stop when no new large itemsets are found.

©Ofer Pasternak Data Mining Seminar The Trick Any subset of large itemset is large. Therefore To find large k-itemset – Create candidates by combining large k-1 itemsets. – Delete those that contain any subset that is not large.

©Ofer Pasternak Data Mining Seminar Algorithm Apriori Count item occurrences Generate new k-itemsets candidates Find the support of all the candidates Take only those with support over minsup

©Ofer Pasternak Data Mining Seminar Candidate generation Join step Prune step P and q are 2 k-1 large itemsets identical in all k-2 first items. Join by adding the last item of q to p Check all the subsets, remove a candidate with “ small ” subset

©Ofer Pasternak Data Mining Seminar Example L 3 = { {1 2 3}, {1 2 4}, {1 3 4}, {1 3 5}, {2 3 4} } After joining { { }, { } } After pruning { } {1 4 5} and {3 4 5} Are not in L 3

©Ofer Pasternak Data Mining Seminar Correctness Show that Join is equivalent to extending L k-1 with all items and removing those whose (k-1) subsets are not in L k-1 Prevents duplications Any subset of large itemset must also be large

©Ofer Pasternak Data Mining Seminar Subset Function Candidate itemsets - C k are stored in a hash-tree Finds in O(k) time whether a candidate itemset of size k is contained in transaction t. Total time O(max(k,size(t))

©Ofer Pasternak Data Mining Seminar Problem? Every pass goes over the whole data.

©Ofer Pasternak Data Mining Seminar Algorithm AprioriTid Uses the database only once. Builds a storage set C^ k – Members has the form X k are potentially large k-items in transaction TID. For k=1, C^ 1 is the database. Uses C^ k in pass k+1. Each item is replaced by an itemset of size 1

©Ofer Pasternak Data Mining Seminar Advantage C^ k could be smaller than the database. – If a transaction does not contain k- itemset candidates, than it will be excluded from C^ k. For large k, each entry may be smaller than the transaction – The transaction might contain only few candidates.

©Ofer Pasternak Data Mining Seminar Disadvantage For small k, each entry may be larger than the corresponding transaction. – An entry includes all k-itemsets contained in the transaction.

©Ofer Pasternak Data Mining Seminar Algorithm AprioriTid Count item occurrences Generate new k-itemsets candidates Find the support of all the candidates Take only those with support over minsup The storage set is initialized with the database Build a new storage set Determine candidate itemsets which are containted in transaction TID Remove empty entries

©Ofer Pasternak Data Mining Seminar ItemsTID Set-of- itemsets TID { {1},{3},{4} }100 { {2},{3},{5} }200 { {1},{2},{3},{5} }300 { {2},{5} }400 SupportItemset 2{1} 3{2} 3{3} 3{5} itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} Set-of-itemsetsTID { {1 3} }100 { {2 3},{2 5} {3 5} }200 { {1 2},{1 3},{1 5}, {2 3}, {2 5}, {3 5} } 300 { {2 5} }400 SupportItemset 2{1 3} 3{2 3} 3{2 5} 2{3 5} itemset {2 3 5} Set-of-itemsetsTID { {2 3 5} }200 { {2 3 5} }300 SupportItemset 2{2 3 5} Database C^ 1 L2L2 C2C2 C^ 2 C^ 3 L1L1 L3L3 C3C3

©Ofer Pasternak Data Mining Seminar Correctness Show that C t generated in the kth pass is the same as set of candidate k- itemsets in C k contained in transaction with t.TID

©Ofer Pasternak Data Mining Seminar Correctness Lemma 1  k >1, if C^ k-1 is correct and complete, and L k-1 is correct, Then the set C t generated at the kth pass is the same as the set of candidate k-itemsets in C k contained in transaction with t.TID  t of C^ k t.set-of-itemsets includes all large k-itemsets contained in transaction with t.TID  t of C^ k t.set-of-itemsets doesn’t include any k-itemsets not contained in transaction with t.TID Same as the set of all large k- itemsets

©Ofer Pasternak Data Mining Seminar Proof Suppose a candidate itemset c = c[1]c[2]…c[k] is in transaction t.TID  c 1 = (c-c[k]) and c 2 =(c-c[k-1]) were in transaction t.TID  c 1 and c 2 must be large  c 1 and c 2 were members of t.set-of-items  c will be a member of C t C k was built using apriori-gen(L k-1 )  all subsets of c of C k must be large C^ k-1 is complete

©Ofer Pasternak Data Mining Seminar Proof Suppose c 1 (c 2 ) is not in transaction t.TID  c 1 (c 2 ) is not in t.set-of-itemsets  c of C k is not in transaction t.TID  c will not be a member of C t C^ k-1 is correct

©Ofer Pasternak Data Mining Seminar Correctness Lemma 2  k >1, if L k-1 is correct and the set C t generated in the kth step is the same as the set of candidate k-itemsets in C k in transaction t.TID, then the set C^ k is correct and complete.

©Ofer Pasternak Data Mining Seminar Proof Apriori-gen guarantees  C t includes all large k-itemsets in t.TID, which are added to C^ k  C^ k is complete. C t includes only itemsets in t.TID, only items in C t are added to C^ k  C^ k is correct.

©Ofer Pasternak Data Mining Seminar Correctness Theorem 1  k >1, the set C t generated in the kth pass is the same as the set of candidate k- itemsets in C k contained in transaction t.TID Show: C^ k is correct and complete and L k is correct for all k>=1.

©Ofer Pasternak Data Mining Seminar Proof (by induction on k) K=1 – C^1 is the database. Assume it holds for k=n. – C t generated in pass n+1 consists of exactly those itemsets in C n+1 contained in transaction t.TID. – Apriori-gen guarantees and C t is correct  L n+1 is correct  C^ n+1 will be correct and complete  C^ k is correct and complete for all k>=1  The theorem holds Lemma 2 Lemma 1

©Ofer Pasternak Data Mining Seminar General idea (reminder) Say ABCD and AB are large itemsets Compute conf = support(ABCD) / support(AB) If conf >= minconf AB  CD holds.

©Ofer Pasternak Data Mining Seminar Discovering Rules For every large itemset l – Find all non-empty subsets of l. – For every subset a Produce rule a  (l-a) Accept if support(l) / support(a) >= minconf

©Ofer Pasternak Data Mining Seminar Checking the subsets For efficiency, generate subsets using recursive DFS. If a subset ‘a’ doesn’t produce a rule, we don’t need to check for subsets of ‘a’. Example Given itemset :ABCD If ABC  D doesn’t have enough confidence then surely AB  CD won’t hold

©Ofer Pasternak Data Mining Seminar Why? For any subset a^ of a: Support(a^) >= support(a)  Confidence ( a^  (l-a^) ) = support(l) / support(a^) <= support(l) / support(a) = confidence ( a  (l-a) )

©Ofer Pasternak Data Mining Seminar Simple Algorithm Check all the subsets Check all the large itemsets Output the rule Continue the DFS over the subsets. If there is no confidence the DFS branch cuts here Check confidence of new rule

©Ofer Pasternak Data Mining Seminar Faster Algorithm Idea: If (l-c)  c holds than all the rules (l-c^)  c^ must hold Example: If AB  CD holds, then so do ABC  D and ABD  C C^ is a non empty subset of c

©Ofer Pasternak Data Mining Seminar Faster Algorithm From a large itemset l, – Generate all rules with one item in it’s consequent. Use those consequents and Apriori-gen to generate all possible 2 item consequents. Etc. The candidate set of the faster algorithm is a subset of the candidate set of the simple algorithm.

©Ofer Pasternak Data Mining Seminar Faster algorithm Find all 1 item consequents (using 1 pass of the simple algorithm) Generate new (m+1)- consequents Check the support of the new rule Continue for bigger consequents If a consq. Doesn ’ t hold, don ’ t look for bigger.

©Ofer Pasternak Data Mining Seminar Advantage Example Large itemset : ABCDE One item conseq. : ACDE  B ABCE  D Simple algorithm will check: ABC  DE, ABE  CD, BCE  AD and ACE  BD. Faster algorithm will check: ACE  BD which is also the only rule that holds.

©Ofer Pasternak Data Mining Seminar ABCDE ACDE  B ABCE  D ACD  BE ADE  BC CDE  AB ACE  BD BCE  AD ACE  BD ABE  CD ABC  ED Large itemset Rules with minsup Simple algorithm: Fast algorithm: ACE  BD ABCDE ACDE  B ABCE  D Example

©Ofer Pasternak Data Mining Seminar Results Compare Apriori, and AprioriTid performances to each other, and to previous known algorithms: – AIS – SETM The algorithms differ in the method of generating all large itemsets. Both methods generate candidates “ on-the-fly ” Designed for use over SQL

©Ofer Pasternak Data Mining Seminar Method Check the algorithms on the same databases – Synthetic data – Real data

©Ofer Pasternak Data Mining Seminar Synthetic Data Choose the parameters to be compared. – Transaction sizes, and large itemsets sizes are each clustered around a mean. – Parameters for data generation D – Number of transactions T – Average size of the transaction I – Average size of the maximal potentially large itemsets L – Number of maximal potentially large itemsets N – Number of Items.

©Ofer Pasternak Data Mining Seminar Synthetic Data Expriment values: – N = 1000 – L = 2000 T5.I2.D100k T10.I2.D100k T10.I4.D100k T20.I2.D100k T20.I4.D100k T20.I6.D100k D – Number of transactions T – Average size of the transaction I – Average size of the maximal potentially large itemsets L – Number of maximal potentially large itemsets N – Number of Items. T=5, I=2, D=100,000

©Ofer Pasternak Data Mining Seminar SETM values are too big to fit the graphs. Apriori always beats AIS D – Number of transactions T – Average size of the transaction I – Average size of the maximal potentially large itemsets Apriori is better than AprioriTid in large problems

©Ofer Pasternak Data Mining Seminar Explaining the Results AprioriTid uses C^ k instead of the database. If C^ k fits in memory AprioriTid is faster than Apriori. When C^ k is too big it cannot sit in memory, and the computation time is much longer. Thus Apriori is faster than AprioriTid.

©Ofer Pasternak Data Mining Seminar Reality Check Retail sales – 63 departments – transactions (avg. size 2.47) Small database, C^ k fits in memory.

©Ofer Pasternak Data Mining Seminar Reality Check Mail Order items 2.9 million transactions (avg size 2.62) Mail Customer items transactions (avg size 31)

©Ofer Pasternak Data Mining Seminar So who is better? Look At the Passes. At final stages, C^ k is small enough to fit in memory

©Ofer Pasternak Data Mining Seminar Algorithm AprioriHybrid Use Apriori in initial passes Estimate the size of C^ k Switch to AprioriTid when C^ k is expected to fit in memory The switch takes time, but it is still better in most cases.

©Ofer Pasternak Data Mining Seminar

©Ofer Pasternak Data Mining Seminar Scale up experiment

©Ofer Pasternak Data Mining Seminar Conclusions The Apriori algorithms are better than the previous algorithms. – For small problems by factors – For large problems by orders of magnitudes. The algorithms are best combined. The algorithm shows good results in scale-up experiments.

©Ofer Pasternak Data Mining Seminar Summary Association rules are an important tool in analyzing databases. We’ve seen an algorithm which finds all association rules in a database. The algorithm has better time results then previous algorithms. The algorithm maintains it’s performances for large databases.

©Ofer Pasternak Data Mining Seminar End