Presentation is loading. Please wait.

Presentation is loading. Please wait.

DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

Similar presentations


Presentation on theme: "DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015."— Presentation transcript:

1 DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015

2 Definition: Association rule Association Rule An implication expression of the form X  Y, where X and Y are itemsets Example: {Milk, Diaper}  {Beer} Rule Evaluation Metrics Support (s) Fraction of transactions that contain both X and Y Confidence (c) Measures how often items in Y appear in transactions that contain X

3 Mining Association Rules Two-step approach: 1. Frequent Itemset Generation – Generate all itemsets whose support  minsup 2. Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

4 Frequent Itemset Generation Brute-force approach: Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Match each transaction against every candidate This approach is expensive since M = 2 d !!!

5 Computational Complexity Given d unique items: Total number of itemsets = 2 d Total number of possible association rules: If d=6, R = 602 rules

6 Reducing Number of Candidates Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due to the following property of the support measure: Support of an itemset never exceeds the support of its subsets This is known as the anti-monotone property of support

7 Reducing Number of Comparisons Candidate counting: Scan the database of transactions to determine the support of each candidate itemset To reduce the number of comparisons, store the candidates in a hash structure Instead of matching each transaction against every candidate, match it against candidates contained in the hashed buckets

8 Rule Generation How to efficiently generate rules from frequent itemsets? In general, confidence does not have an anti-monotone property c(ABC  D) can be larger or smaller than c(AB  D) But confidence of rules generated from the same itemset has an anti-monotone property e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD) Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

9 Factors Affecting Complexity Choice of minimum support threshold lowering support threshold results in more frequent itemsets this may increase number of candidates and max length of frequent itemsets Dimensionality (number of items) of the data set more space is needed to store support count of each item if number of frequent items also increases, both computation and I/O costs may also increase Size of database since Apriori makes multiple passes, run time of algorithm may increase with number of transactions Average transaction width transaction width increases with denser data sets This may increase max length of frequent itemsets and traversals of hash tree (number of subsets in a transaction increases with its width)

10 Maximal Frequent Itemset An itemset is maximal frequent if none of its immediate supersets is frequent Maximal Itemsets Infrequent Itemsets Border

11 Maximal Frequent Itemset Maximal frequent itemsets effectively provides a compact representation of frequent itemsets They form the smallest set of itemsets from which all frequent itemsets can be derived For example: Frequent itemsets that begin with item a and that may contain item c, d or e. This group includes itemsets such as {a}, {a, c}, {a,d}, {a,e} and {a,c,e} Frequent itemsets that begin with item b,c,d, or e. This group includes itemsets such as {b}, {b, c}, {c,d}, {b,c,d,e}, etc Frequent itemsets that belong in the first group are subsets of either {a,c,e} or {a,d} Frequent itemsets that belong in the second group are subsets of {b,c,d,e}

12 Closed Itemset An itemset is closed if none of its immediate supersets has the same support as the itemset

13 Closed Itemsets Transaction Ids Not supported by any transactions

14 Closed Frequent Itemsets minsup = 40% An item is a closed frequent itemset if it is closed and its support is ≥minsup

15 Redundant Association Rules Closed frequent itemsets are useful for removing some of the redundant association rules An association rule X→Y is redundant if there exists another rule X’→Y’, such that the support and confidence for both rules are identical Where X’ is a subset of X and Y’ is a subset of Y Example Two rules {b}→{d,e} and {b,c}→{d,e} have the same support and confidence {b} is not a closed frequent itemset while {b,c} is closed The rule {b}→{d,e} is redundant

16 Maximal vs Closed Frequent Itemsets Minimum support = 2 Closed but not maximal Closed and maximal # Closed = 9 # Maximal = 4

17 Maximal vs Closed Itemsets

18 Pattern Evaluation Association rule algorithms tend to produce too many rules many of them are uninteresting or redundant Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence Interestingness measures can be used to prune/rank the derived patterns In the original formulation of association rules, support & confidence are the only measures used

19 Computing Interestingness Measure Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table Contingency table for X  Y YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 N f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures u support, confidence, lift, etc.

20 Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375

21 Statistical Independence Population of 1000 students 600 students know how to swim (S) 700 students know how to bike (B) 420 students know how to swim and bike (S,B) P(S  B) = 420/1000 = 0.42 P(S)  P(B) = 0.6  0.7 = 0.42 P(S  B) = P(S)  P(B) => Statistical independence P(S  B) > P(S)  P(B) => Positively correlated P(S  B) Negatively correlated

22 Statistical-based Measures: Lift Measures that take into account statistical dependence Interest =1, if X and Y are independent Interest > 1, if X and Y are positively correlated Interest < 1, if X and Y are negatively correlated

23 Example: Lift/Interest Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

24 Drawback of Lift & Interest YY X100 X090 1090100 YY X900 X010 9010100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1

25 Statistical-based Measures: Correlation For binary variables, correlation can be measured using the ϕ -coefficient The value of correlation ranges from -1 to 1 If the variables are statistically independent, then ϕ = 0

26 Correlation Analysis Coffee Tea15520 Tea75580 9010100

27 Drawback of Correlation Analysis pp q88050930 q502070 930701000 rr s205070 s50880930 709301000

28 Simpson’s Paradox c({HDTV = yes} → {Exercise machine = yes}) = 99/180 = 55% c ({HDTV = no} → {Exercise machine = yes}) = 54/120 = 45% Buy HDTV Buy exercise machine Total YesNo Yes No 90 54 81 66 180 120 Total153147300

29 Simpson’s Paradox For college students: c({HDTV = yes} → {Exercise machine = yes}) = 1/10 = 10% c ({HDTV = no} → {Exercise machine = yes}) = 4/34 = 11.8% For working adults: c({HDTV = yes} → {Exercise machine = yes}) = 98/170 = 57.7% c ({HDTV = no} → {Exercise machine = yes}) = 50/86 = 58.1% Customer groupBuy HDTV Buy exercise machine Total YesNo College studentsYes No 1414 9 30 10 34 Working adultsYes No 98 50 72 36 170 86

30 Lurking Variable The most customers who buy HDTVs are working adults Working adults are also the largest group of customers who buy exercise machines The lesson here is that proper stratification is needed to avoid generating spurious patterns resulting from Simpson’s paradox Caution: lurking variable!!!

31 Simpson’s Paradox c({Transportation = Helicopter} → {Victim = died}) = 64/200 = 32% c ({Transportation = Road} → {Victim = died}) = 260/1100 = 23.64% All accidentsVictim DiedVictim SurvivedTotal Helicopter64136200 Road2608401100 Total3249761300

32 Lurking Variable For serious accidents: c({Transportation = Helicopter} → {Victim = died}) = 48/100 = 48% c ({Transportation = Road} → {Victim = died}) = 60/100 = 60% For less serious accidents: c({Transportation = Helicopter} → {Victim = died}) = 16/100 = 16% c ({Transportation = Road} → {Victim = died}) = 200/1000 = 20% AccidentsTranspor- tation Victim Total DiedSurvived Serious accidentsHelicopter Road 48 60 52 40 100 Less serious accidents Helicopter Road 16 200 84 800 100 1000

33 Thank you!


Download ppt "DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015."

Similar presentations


Ads by Google