Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Mining Association Rules in Large Databases
Recap: Mining association rules from large datasets
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Techniques Association Rule
Data Mining of Very Large Data
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules l Mining Association Rules between Sets of Items in Large Databases (R. Agrawal, T. Imielinski & A. Swami) l Fast Algorithms for.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Association Rule Mining Instructor Qiang Yang Slides from Jiawei Han and Jian Pei And from Introduction to Data Mining By Tan, Steinbach, Kumar.
Spring 2003Data Mining by H. Liu, ASU1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
Spring 2005CSE 572, CBS 598 by H. Liu1 5. Association Rules Market Basket Analysis and Itemsets APRIORI Efficient Association Rules Multilevel Association.
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 4: Mining Frequent Patterns, Associations and Correlations
Mining Association Rules in Large Databases
Mining Association Rules in Large Databases
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Mining Association Rules
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Performance and Scalability: Apriori Implementation.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Apriori Algorithms Feapres Project. Outline 1.Association Rules Overview 2.Apriori Overview – Apriori Advantage and Disadvantage 3.Apriori Algorithms.
9/03Data Mining – Association G Dong (WSU) 1 5. Association Rules Market Basket Analysis APRIORI Efficient Mining Post-processing.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Data Mining Find information from data data ? information.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Data Mining  Association Rule  Classification  Clustering.
Association Rules & Sequential Patterns. CS583, Bing Liu, UIC 2 Road map Basic concepts of Association Rules Apriori algorithm Sequential pattern mining.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
Association Rules.
Market Basket Many-to-many relationship between different objects
Dynamic Itemset Counting
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Mining Association Rules in Large Databases
Association Analysis: Basic Concepts
Presentation transcript:

Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence thresholds. The problem is thus decomposed into two subproblems:

Association rules 1. Generate all items sets that have a support that exceeds the support threshold. These sets are called large or frequent itemsets, because they have large support (not because of their cardinality). 2. For each large itemset, all the rules that have a minimum confidence are generated: for a large itemset X and Y  X, let Z = X – Y. If support(X)/support(Z) > Confidence threshold, then Z  Y. Z  Y

Association rules Generating rules by using large (frequent) itemsets is straightforward. However, if the cardinality of the set of items is very high, the process becomes very computation-intensive: for m items, the number of distinct itemsets is 2m (power set). Basic algorithms for finding association rules try to reduce the combinatorial search space

Association rules: a Basic algorithm 1. Test the support for itemsets of size 1, called 1-itemsets. Discard those that do not meet Minimum Required Support (MRS). 2. Extend the 1-itemsets by appending one item each time, to generate all candidate 2-itemsets, test MRS and discard those that do not meet it. 3. Continue until no itemsets can be found. 4. Use itemsets found to generate rules (check confidence).

Association rules The naive version of this algorithm is a combinatorial nightmare! Many versions and variants of this algorithm They use different strategies Their resulting sets of rules are all the same: Any algorithm for association rules should find the same set of rules although their computational efficiencies and memory requirements may be different. We will see the Apriori Algorithm

Association rules: The Apriori Algorithm The key idea: The apriori property: any subsets of a frequent itemset are also frequent itemsets x x x ABC ABD ACD BCD x x x AB AC AD BC BD CD Begin with itemsets of size 1 A B C D x

Stage 1. The Apriori Algorithm: Generating large itemsets Find large (frequent, i.e., that meet MRS) itemsets of size 1: F1 From k = 2 Ck = candidates of size k: those itemsets of size k that could be frequent, given Fk-1 Fk = those itemsets that are actually large, Fk  Ck. They meet MRS

Example Minimum support allowed = 50% itemset:count TID Items 1 a, c, d 2 b, c, e 3 a, b, c, e 4 b, e Minimum support allowed = 50% itemset:count 1. scan T  C1: {a}:2, {b}:3, {c}:3, {d}:1, {e}:3  F1: {a}:2, {b}:3, {c}:3, {e}:3  C2: {a,b}, {a,c}, {a,e}, {b,c}, {b,e}, {c,e} 2. scan T  C2: {a,b}:1, {a,c}:2, {a,e}:1, {b,c}:2, {b,e}:3, {c,e}:2  F2: {a,c}:2, {b,c}:2, {b,e}:3, {c,e}:2  C3: {b,c,e} 3. scan T  C3: {b, c, e}:2  F3: {b, c, e}: 2 Sets with d are discarded It is the only candidate because all its subsets are large!

Stage 1. The Apriori Algorithm: Generating large (frequent) itemsets Algorithm Apriori(T, minsup) //T Set of n transactions //minsup = minimum support allowed C1  getC1(T); //Get 1-itemsets and their counts F1  {f | f  C1, f.count / n  minsup}; //Check support FOR (k = 2; Fk-1  ; k++) DO Ck  candidate_gen(Fk-1); //Get k-itemsets using Fk-1 FOR each transaction t  T DO FOR each candidate c  Ck DO IF c is contained in t THEN c.count++; END FOR Fk  {c  Ck | c.count / n  minsup} //Check support RETURN F  k Fk; //Union of all Fk (k >1 for rules) /*Get counts of the k-itemsets*/

Stage 2: The Apriori Algorithm: generating rules from large itemsets Note that frequent itemsets  association rules One more step is needed to generate association rules: For each frequent itemset X: - For each subset A  X, A  . Let B = X – A - A  B is an association rule if confidence(A  B) ≥ minconf, i.e., if (support(A  B)/support(A)) ≥ minconf. Minconf = Minimum confidence

Example Suppose {b,c,e} is a large (frequent) itemset with support  50% Consider subsets of {b,c,e} and their support: {b,c} = 50% {b,e} = 75%, {c,e} = 50%, {b} = 75%, {c} = 75%, {e} = 75%

Example This subset generates the following association rules: {b,c}  {e} confidence = 50/50 = 100% {b,e}  {c} confidence = 50/75 = 66.6% {c,e}  {b} confidence = 50/50 = 100% {b}  {c,e} confidence = 50/75 = 66.6% {c}  {b,e} confidence = 50/75 = 66.6% {e}  {b,c} confidence = 50/75 = 66.6% All rules have support  50% = support({b,c,e})

Hash-based improvement to A-Priori During pass 1 of A-Priori (Get 1-itemsets), most memory is idle Use that memory to keep counts of buckets into which pairs of items are hashed Gives extra condition that candidate pairs must satisfy when getting 2-itemsets (pass 2)

Hash-based improvement to A-Priori The memory is divided like this: Space to count each item (Get 1-itemsets) Use the rest of the space for the described hashing process PCY algorithm (Park, Chen, and Yu)

PCY Algorithm Pass 1: FOR each transaction t in T DO FOR each item in t DO add 1 to item’s count END FOR FOR each pair of items in t DO Hash the pair to a bucket and add 1 to the count for that bucket getC1 Hashing process

PCY Algorithm Pass 2: Count all pairs {i, j} that satisfy the conditions: Both i and j (taken individually!) are frequent items The pair {i, j} hashes to a bucket whose count  the support s (i.e., a frequent bucket) These two conditions are necessary for the pair to have a chance of being frequent.

PCY Example Support s = 3 Items: milk (1), Coke (2), bread (3), Pepsi (4), juice (5). Transactions are t1 = {1, 2, 3}  milk, Coke, bread t2 = {1, 4, 5} t3 = {1, 3} t4 = {2, 5} t5 = {1, 3, 4} t6 = {1, 2, 3, 5} t7 = {2, 3, 5} t8 = {2, 3}

PCY Example Hashing a pair {i, j} to a bucket k, where k = hash(i, j) = (i + j) mode 5. That is, for pairs: (1, 4) and (2, 3)  k = 0 (1, 5) and (2, 4)  k = 1 (2, 5) and (3, 4)  k = 2 (1, 2) and (3, 5)  k = 3 (1, 3) and (4, 5)  k = 4

PCY Example Pass 1: Item’s count: Note that Item 4 does not exceed the support. Item Count 1 5 2 3 6 4

PCY Example For each pair in each transaction: Total: 21 pairs

PCY Example The hash table is Bucket 1 does not exceed the support, i.e., (1, 5) and (2, 4) are not frequent. Bucket Count 6 1 2 4 3 5

PCY Example Pass 2: Frequent items are {1, 2, 3, 5} From the frequent items the candidate pairs are (1,2) (1,3) (1,5) (2,3) (2,5) (3,5) Candidate (1,5) is discarded because bucket 1 is not frequent!  Discarded by the PCY!

PCY Example Counts of the “surviving” pairs Note that pairs (1,2) and (3,5) does not exceed the support. Frequent itemsets are {1} {2} {3} {5} {1,3} {2,3} {2,5} Pair Count (1,2) 2 (1,3) 4 (2,3) (2,5) 3 (3,5)

Association rules Association rules among hierarchies: it is possible to divide items among disjoint hierarchies, e.g., foods in a supermarket. It could be interesting to find associations rules across hierarchies. They may occur among item groupings at different levels. Consider the following example. Levels of a dimension

Hierarchy 1: Beverages Hierarchy 2: Desserts

Association rules Associations such as {Frozen Yoghurt}  {Bottled Water} {Rich Cream}  {Wine Coolers} May produce enough confidence and support to be valid rules of interest.

Association rules: Negative associations Negative associations: a negative association is of the following type “80% of customers who buy W do not buy Z” The problem of discovering negative associations is complex: there are millions of item combinations that do not appear in the database, the problem is to find only interesting negative rules.

Association rules: Negative associations One approach is to use hierarchies. Consider the following example: Positive association Soft drinks Chips Joke Wakeup Topsy Days Nightos Parties

Association rules: Negative associations Suppose a strong positive association between chips and soft drinks If we find a large support for the fact that when customers buy Days chips they predominantly buy Topsy drinks (and not Joke, and not Wakeup), that would be interesting The discover of negative associations remains a challenge

Association rules Additional considerations for associations rules: For very large datasets, efficiency for discovering rules can be improved by sampling  Danger: discovering some false rules Transactions show variability according to geographical location and seasons Quality of data is usually also variable