DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Slides:



Advertisements
Similar presentations
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Advertisements

Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CPS : Information Management and Mining
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Organization “Association Analysis”
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Minqi Zhou Minqi Zhou Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis (2). Example TIDList of item ID’s T1I1, I2, I5 T2I2, I4 T3I2, I3 T4I1, I2, I4 T5I1, I3 T6I2, I3 T7I1, I3 T8I1, I2, I3, I5 T9I1, I2,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Data Mining Association Analysis: Basic Concepts and Algorithms
DATA MINING LECTURE 2 Frequent Itemsets Association Rules.
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Data Mining Association Analysis: Basic Concepts and Algorithms Based on Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Eick, Tan, Steinbach, Kumar: Association Analysis Part1 Organization “Association Analysis” 1. What is Association Analysis? 2. Association Rules 3. The.
Supermarket shelf management – Market-basket model:  Goal: Identify items that are bought together by sufficiently many customers  Approach: Process.
Data & Text Mining1 Introduction to Association Analysis Zhangxi Lin ISQS 3358 Texas Tech University.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
1. Basic Association Analysis (IDM ch. 6) 1. Review 2. Maximal and Closed Itemsets 3. Rule Generation 4. Kuis 2. Support Vector Machines / SVM (IDM ch.
1. UTS 2. Basic Association Analysis (IDM ch. 6) 3. Practical: 1. Project Proposal 2. Association Rules Mining (DMBAR ch. 16) 1. online radio 2. predicting.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 (4) Introduction to Data Mining by Tan, Steinbach, Kumar ©
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining
COMP 5331: Knowledge Discovery and Data Mining
Association Analysis: Basic Concepts
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
CSE4334/5334 Data Mining Lecture 15: Association Rule Mining (2)
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts
Presentation transcript:

DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015

Definition: Association rule Association Rule An implication expression of the form X  Y, where X and Y are itemsets Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X

Mining Association Rules Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support  minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate This approach is expensive since M = 2 d !!!

Computational Complexity Given d unique items: Total number of itemsets = 2 d Total number of possible association rules: If d=6, R = 602 rules

Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

Reducing Number of Comparisons Candidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC  D) can be larger or smaller than c(AB  D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

Factors Affecting Complexity Choice of minimum support threshold lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border

Maximal Frequent Itemset Maximal frequent itemsets effectively provides a compact representation of frequent itemsets They form the smallest set of itemsets from which all frequent itemsets can be derived For example: Frequent itemsets that begin with item a and that may contain item c, d or e. This group includes itemsets such as {a}, {a, c}, {a,d}, {a,e} and {a,c,e} Frequent itemsets that begin with item b,c,d, or e. This group includes itemsets such as {b}, {b, c}, {c,d}, {b,c,d,e}, etc Frequent itemsets that belong in the first group are subsets of either {a,c,e} or {a,d} Frequent itemsets that belong in the second group are subsets of {b,c,d,e}

Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset

Closed Itemsets Transaction Ids Not supported by any transactions

Closed Frequent Itemsets minsup = 40% An item is a closed frequent itemset if it is closed and its support is ≥minsup

Redundant Association Rules Closed frequent itemsets are useful for removing some of the redundant association rules An association rule X→Y is redundant if there exists another rule X’→Y’, such that the support and confidence for both rules are identical Where X’ is a subset of X and Y’ is a subset of Y Example Two rules {b}→{d,e} and {b,c}→{d,e} have the same support and confidence {b} is not a closed frequent itemset while {b,c} is closed The rule {b}→{d,e} is redundant

Maximal vs Closed Frequent Itemsets Minimum support = 2 Closed but not maximal Closed and maximal # Closed = 9 # Maximal = 4

Maximal vs Closed Itemsets

Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used

Computing Interestingness Measure Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures u support, confidence, lift, etc.

Drawback of Confidence Coffee Tea15520 Tea Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Although confidence is high, rule is misleading  P(Coffee|Tea) =

Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(S  B) = 420/1000 = 0.42 P(S)  P(B) = 0.6  0.7 = 0.42 P(S  B) = P(S)  P(B) => Statistical independence P(S  B) > P(S)  P(B) => Positively correlated P(S  B) Negatively correlated

Statistical-based Measures: Lift Measures that take into account statistical dependence Interest =1, if X and Y are independent Interest > 1, if X and Y are positively correlated Interest < 1, if X and Y are negatively correlated

Example: Lift/Interest Coffee Tea15520 Tea Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Lift = 0.75/0.9= (< 1, therefore is negatively associated)

Drawback of Lift & Interest YY X100 X YY X900 X Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1

Statistical-based Measures: Correlation For binary variables, correlation can be measured using the ϕ -coefficient The value of correlation ranges from -1 to 1 If the variables are statistically independent, then ϕ = 0

Correlation Analysis Coffee Tea15520 Tea

Drawback of Correlation Analysis pp q q rr s s

Simpson’s Paradox c({HDTV = yes} → {Exercise machine = yes}) = 99/180 = 55% c ({HDTV = no} → {Exercise machine = yes}) = 54/120 = 45% Buy HDTV Buy exercise machine Total YesNo Yes No Total

Simpson’s Paradox For college students: c({HDTV = yes} → {Exercise machine = yes}) = 1/10 = 10% c ({HDTV = no} → {Exercise machine = yes}) = 4/34 = 11.8% For working adults: c({HDTV = yes} → {Exercise machine = yes}) = 98/170 = 57.7% c ({HDTV = no} → {Exercise machine = yes}) = 50/86 = 58.1% Customer groupBuy HDTV Buy exercise machine Total YesNo College studentsYes No Working adultsYes No

Lurking Variable The most customers who buy HDTVs are working adults Working adults are also the largest group of customers who buy exercise machines The lesson here is that proper stratification is needed to avoid generating spurious patterns resulting from Simpson’s paradox Caution: lurking variable!!!

Simpson’s Paradox c({Transportation = Helicopter} → {Victim = died}) = 64/200 = 32% c ({Transportation = Road} → {Victim = died}) = 260/1100 = 23.64% All accidentsVictim DiedVictim SurvivedTotal Helicopter Road Total

Lurking Variable For serious accidents: c({Transportation = Helicopter} → {Victim = died}) = 48/100 = 48% c ({Transportation = Road} → {Victim = died}) = 60/100 = 60% For less serious accidents: c({Transportation = Helicopter} → {Victim = died}) = 16/100 = 16% c ({Transportation = Road} → {Victim = died}) = 200/1000 = 20% AccidentsTranspor- tation Victim Total DiedSurvived Serious accidentsHelicopter Road Less serious accidents Helicopter Road

Thank you!