Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) Warsaw University of Technology.

Slides:



Advertisements
Similar presentations
Data Mining Techniques Association Rule
Advertisements

Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Mining Multiple-level Association Rules in Large Databases
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Association rules The goal of mining association rules is to generate all possible rules that exceed some minimum user-specified support and confidence.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Chapter 5: Mining Frequent Patterns, Association and Correlations
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules
Mining Frequent Patterns I: Association Rule Discovery Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Association Rules. 2 Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Information Systems Data Analysis – Association Mining Prof. Les Sztandera.
Mining various kinds of Association Rules
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
1 Knowledge discovery & data mining Association rules and market basket analysis --introduction UCLA CS240A Course Notes* __________________________ *
Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) ICS, Polish Academy of Sciences.
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Data Mining - assoc. rules C. Faloutsos.
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining Jinze Liu.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
Introduction to Machine Learning Lecture 13 Introduction to Association Rules Albert Orriols i Puig Artificial.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Reducing Number of Candidates
Data Mining: Concepts and Techniques
Mining Association Rules
Knowledge discovery & data mining Association rules and market basket analysis--introduction UCLA CS240A Course Notes*
Frequent Pattern Mining
Association Rules.
Association Rules Zbigniew W. Ras*,#) presented by
©Jiawei Han and Micheline Kamber
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Unit 3 MINING FREQUENT PATTERNS ASSOCIATION AND CORRELATIONS
Frequent-Pattern Tree
Association Analysis: Basic Concepts
Presentation transcript:

Association Rules presented by Zbigniew W. Ras *,#) *) University of North Carolina – Charlotte #) Warsaw University of Technology

Customer buying habits by finding associations and correlations between the different items that customers place in their “shopping basket” Customer1 Customer2Customer3 Milk, eggs, sugar, bread Milk, eggs, cereal, breadEggs, sugar Market Basket Analysis (MBA)

Given: a database of customer transactions, where each transaction is a set of items Find groups of items which are frequently purchased together Market Basket Analysis

 Extract information on purchasing behavior  Actionable information: can suggest  new store layouts  new product assortments  which products to put on promotion MBA applicable whenever a customer purchases multiple things in proximity Goal of MBA

 Express how product/services relate to each other, and tend to group together  “if a customer purchases three-way calling, then will also purchase call-waiting”  Simple to understand  Actionable information: bundle three-way calling and call-waiting in a single package Association Rules

Transactions: Relational formatCompact format Item: single element, Itemset: set of items Support of an itemset I [denoted by sup(I)]: card(I) Threshold for minimum support:  Itemset I is Frequent if: sup(I)  . Frequent Itemset represents set of items which are positively correlated Basic Concepts itemset

sup({dairy}) = 3 sup({fruit}) = 3 sup({dairy, fruit}) = 2 If  = 3, then {dairy} and {fruit} are frequent while {dairy,fruit} is not. Customer 1 Customer 2 Frequent Itemsets

 {A,B} - partition of a set of items  r = [A  B] Support of r: sup(r) = sup(A  B) Confidence of r: conf(r) = sup(A  B)/sup(A)  Thresholds:  minimum support - s  minimum confidence – c r  AS(s, c), if sup(r)  s and conf(r)  c Association Rules: AR(s,c)

 For rule A  C:  sup(A  C) = 2  conf(A  C) = sup({A,C})/sup({A}) = 2/3  The Apriori principle:  Any subset of a frequent itemset must be frequent Min. support – 2 [50%] Min. confidence - 50% Association Rules - Example

The Apriori algorithm [Agrawal]  F k : Set of frequent itemsets of size k  C k : Set of candidate itemsets of size k F 1 := {frequent items}; k:=1; F 1 := {frequent items}; k:=1; while card(F k )  1 do begin while card(F k )  1 do begin C k+1 := new candidates generated from F k ; C k+1 := new candidates generated from F k ; for each transaction t in the database do for each transaction t in the database do increment the count of all candidates in C k+1 that increment the count of all candidates in C k+1 that are contained in t ; are contained in t ; F k+1 := candidates in C k+1 with minimum support F k+1 := candidates in C k+1 with minimum support k:= k+1 k:= k+1 end end Answer :=  { F k : k  1 & card(F k )  1} Answer :=  { F k : k  1 & card(F k )  1}

abcd c, db, db, ca, da, ca, b a, b, db, c, da, c, da, b, c a,b,c,d {a,d} is not frequent, so the 3-itemsets {a,b,d}, {a,c,d} and the 4- itemset {a,b,c,d}, are not generated. Apriori - Example

Algorithm Apriori: Illustration  The task of mining association rules is mainly to discover strong association rules (high confidence and strong support) in large databases.  Mining association rules is composed of two steps: TID Items 1000 A, B, C 2000 A, C 3000 A, D 4000 B, E, F 1. discover the large items, i.e., the sets of itemsets that have transaction support above a predetermined minimum support s. 2. Use the large itemsets to generate the association rules {A} 3 {B} 2 {C} 2 {A,C} 2 Large support items MinSup = 2

TID Items 100 A, C, D 200 B, C, E 300 A, B, C, E 400 B, E Database D {A} {B} {C} {D} {E} Itemset Count {A} 2 {B} 3 {C} 3 {E} 3 Itemset Count {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemset {A,B} {A,C} {A,E} {B,C} {B,E} {C,E} Itemset Count {A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2 Itemset Count {B, C, E} Itemset {B, C, E} 2 Itemset Count {B, C, E} 2 Itemset Count C1C1 F1F1 C2C2 F2F2 C2C2 C3C3 F3F3 C3C3 Scan D Scan D Scan D S = Algorithm Apriori: Illustration

 Definition 1. Cover C of a rule X  Y is denoted by C(X  Y) Cover C of a rule X  Y is denoted by C(X  Y) and defined as follows: C(X  Y) = { [X  Z]  V : Z, V are disjoint subsets of Y}. C(X  Y) = { [X  Z]  V : Z, V are disjoint subsets of Y}.  Definition 2. Set RR(s, c) of Representative Association Rules Set RR(s, c) of Representative Association Rules is defined as follows: is defined as follows: RR(s, c) = RR(s, c) = {r  AR(s, c): ~(  r l  AR(s, c)) [r l  r & r  C(r l )]} {r  AR(s, c): ~(  r l  AR(s, c)) [r l  r & r  C(r l )]} s – threshold for minimum support s – threshold for minimum support c – threshold for minimum confidence c – threshold for minimum confidence  Representative Rules (informal description): [as short as possible]  [as long as possible] [as short as possible]  [as long as possible] Representative Association Rules

Transactions: {A,B,C,D,E} {A,B,C,D,E,F} {A,B,C,D,E,H,I} {A,B,E} {B,C,D,E,H,I} Representative Association Rules Find RR(2,80%) Representative Rules From (BCDEHI):  {H}  {B,C,D,E,I}  {I}  {B,C,D,E,H} From (ABCDE):  {A,C}  {B,D,E}  {A,D}  {B,C,E} Last set: (BCDEHI, 2)

Transactions: abcde abc acde bcde bc bde cde Frequent Pattern (FP) Growth Strategy Minimum Support = 2 Frequent Items: c – 6 b – 5 d – 5 e – 5 a – 3 Transactions ordered: cbdea cba cdea cbde cb bde cde FP-tree

Frequent Pattern (FP) Growth Strategy Mining the FP-tree for frequent itemsets: Start from each item and construct a subdatabase of transactions (prefix paths) with that item listed at the end. Reorder the prefix paths in support descending order. Build a conditional FP-tree. a – 3 Prefix path: (c b d e a, 1) (c b a, 1) (c d e a, 1) Correct order: c – 3 b – 2 d – 2 e – 2

Frequent Pattern (FP) Growth Strategy a – 3 Prefix path: (c b d e a, 1) (c b a, 1) (c d e a, 1) Frequent Itemsets: (c a, 3) (c b a, 2) (c d a, 2) (c d e a, 2) (c e a, 2)

Multidimensional AR Associations between values of different attributes : RULES: [nationality = French]  [income = high] [50%, 100%] [income = high]  [nationality = French] [50%, 75%] [age = 50]  [nationality = Italian] [33%, 100%]

Multi-dimensional Single-dimensional Schema: Single-dimensional AR vs Multi-dimensional

Quantitative Attributes  Quantitative attributes (e.g. age, income)  Categorical attributes (e.g. color of car) Problem: too many distinct values Solution: transform quantitative attributes into categorical ones via discretization.

 Quantitative attributes are statically discretized by using predefined concept hierarchies:  elementary use of background knowledge Loose interaction between Apriori and Discretizer  Quantitative attributes are dynamically discretized  into “bins” based on the distribution of the data.  considering the distance between data points. Tighter interaction between Apriori and Discretizer Discretization of quantitative attributes

 Preprocessing: use constraints to focus on a subset of transactions  Example: find association rules where the prices of all items are at most 200 Euro  Optimizations: use constraints to optimize Apriori algorithm  Anti-monotonicity: when a set violates the constraint, so does any of its supersets.  Apriori algorithm uses this property for pruning  Push constraints as deep as possible inside the frequent set computation Constraint-based AR

 Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint.  Examples:  [Price(S)  v] is anti-monotone  [Price(S)  v] is not anti-monotone  [Price(S) = v] is partly anti-monotone  Application:  Push [Price(S)  1000] deeply into iterative frequent set computation. Apriori property revisited

Post processing Post processing  A naive solution: apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one. Optimization Optimization  Han’s approach: comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside the frequent set computation. Mining Association Rules with Constraints

 It is difficult to find interesting patterns at a too primitive level  high support = too few rules  low support = too many rules, most uninteresting  Approach: reason at suitable level of abstraction  A common form of background knowledge is that an attribute may be generalized or specialized according to a hierarchy of concepts  Dimensions and levels can be efficiently encoded in transactions  Multilevel Association Rules: rules which combine associations with hierarchy of concepts Multilevel AR

Hierarchy of concepts

Multilevel AR Fresh [support = 20%] Dairy [support = 6%] Fruit [support = 1%] Vegetable [support = 7%]   Fresh  Bakery [20%, 60%]   Dairy  Bread [6%, 50%] Fruit  Bread [1%, 50%] is not valid  Fruit  Bread [1%, 50%] is not valid

Support and Confidence of Multilevel Association Rules  Generalizing/specializing values of attributes affects support and confidence  from specialized to general: support of rules increases (new rules may become valid)  from general to specialized: support of rules decreases (rules may become not valid, their support falls under the threshold)

Hierarchical attributes: age, salary Association Rule: (age, young)  (salary, 40k) age young middle-aged old salary low medium high 18 … … … 80 10k…40k 50k 60k 70k 80k…100k Candidate Association Rules: (age, 18 )  (salary, 40k), (age, young)  (salary, low), (age, 18 )  (salary, low) Mining Multilevel AR

 Calculate frequent itemsets at each concept level, until no more frequent itemsets can be found  For each level use Apriori  A top_down, progressive deepening approach:  First find high-level strong rules: fresh  bakery [20%, 60%]. fresh  bakery [20%, 60%].  Then find their lower-level “weaker” rules: fruit  bread [6%, 50%]. fruit  bread [6%, 50%].  Variations at mining multiple-level association rules.  Level-crossed association rules: fruit  wheat bread fruit  wheat bread

Multi-level Association: Uniform Support vs. Reduced Support  Uniform Support: the same minimum support for all levels  One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support.  If support threshold too high  miss low level associations.too high  miss low level associations. too low  generate too many high level associations.too low  generate too many high level associations.  Reduced Support: reduced minimum support at lower levels - different strategies possible.

Beyond Support and Confidence  Example 1: (Aggarwal & Yu)  {tea} => {coffee} has high support (20%) and confidence (80%)  However, a priori probability that a customer buys coffee is 90%  There is a negative correlation between buying tea and buying coffee  {~tea} => {coffee} has higher confidence (93%)

Correlation and Interest  Two events are independent if P(A  B) = P(A)*P(B), otherwise are correlated.  Interest = P(A  B)/P(B)*P(A)  Interest expresses measure of correlation. If:  equal to 1  A and B are independent events  less than 1  A and B negatively correlated,  greater than 1  A and B positively correlated.  In our example, I(drink tea  drink coffee ) = 0.89 i.e. they are negatively correlated. I(drink tea  drink coffee ) = 0.89 i.e. they are negatively correlated.

Questions? Thank You