Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

Similar presentations


Presentation on theme: "Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,"— Presentation transcript:

1 Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

2 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Mining Association Rules l Two-step approach: 1.Frequent Itemset Generation – Generate all itemsets whose support  minsup 2.Rule Generation – Generate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemset

3 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Apriori Algorithm l Method: –Let k=1 –Generate frequent itemsets of length 1 –Repeat until no new frequent itemsets are identified  Generate length (k+1) candidate itemsets from length k frequent itemsets  Prune candidate itemsets containing subsets of length k that are infrequent  Count the support of each candidate by scanning the DB  Eliminate candidates that are infrequent, leaving only those that are frequent

4 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Rule Generation l Given a frequent itemset L, find all non-empty subsets f  L such that f  L – f satisfies the minimum confidence requirement –If {A,B,C,D} is a frequent itemset, candidate rules: ABC  D, ABD  C, ACD  B, BCD  A, A  BCD,B  ACD,C  ABD, D  ABC AB  CD,AC  BD, AD  BC, BC  AD, BD  AC, CD  AB, l If |L| = k, then there are 2 k – 2 candidate association rules (ignoring L   and   L) l X  Y –X: antecedent, Y: consequent

5 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Rule Generation l If {A,B,C,D} is a frequent itemset, candidate rules: ABC  D, ABD  C, ACD  B, BCD  A, A  BCD,B  ACD,C  ABD, D  ABC AB  CD,AC  BD, AD  BC, BC  AD, BD  AC, CD  AB, l Confidence of ABC  D: –count(ABCD)/count(ABC) –Do we need to scan the whole DB? l Still computationally inefficient to consider all candidate rules. –Any pruning strategy

6 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Rule Generation l Consider the following rule: –X U a  Y  Confidence = count(X a Y)/count(X a) l Assume the confidence of X U a  Y is lower than the minimum threshold l How about the following rule: X  Y U a l count(X a Y)/count(X) ≤ count(X a Y)/count(X a) –Since count(X a) ≤ count(X)

7 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Rule Generation l How to efficiently generate rules from frequent itemsets? –In general, confidence does not have an anti- monotone property c(ABC  D) can be larger or smaller than c(AB  D) –But confidence of rules generated from the same itemset has an anti-monotone property –e.g., L = {A,B,C,D}: c(ABC  D)  c(AB  CD)  c(A  BCD)  Confidence is anti-monotone w.r.t. number of items on the RHS of the rule

8 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Rule Generation for Apriori Algorithm Lattice of rules Pruned Rules Low Confidence Rule

9 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Rule Generation for Apriori Algorithm l Candidate rule is generated by merging two rules that share the same prefix in the rule consequent l join(CD=>AB,BD=>AC) would produce the candidate rule D => ABC l Prune rule D=>ABC if its subset AD=>BC does not have high confidence

10 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Confidence-Based Pruning l Initially, all the high­confidence rules that have only one item in the rule consequent are extracted. l These rules are then used to generate new candidate rules. l For example, if –{acd}  {b} and {abd}  {c} are high­confidence rules, then the candidate rule {ad}  {bc} is generated by merging the consequents of both rules.

11 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Confidence-Based Pruning {Bread,Milk}  {Diaper} (confidence = 3/3) threshold=50% {Bread,Diaper}  {Milk} (confidence = 3/3) {Diaper,Milk}  {Bread} (confidence = 3/3) Items (1-itemsets) Pairs (2-itemsets) Triplets (3-itemsets)

12 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Confidence-Based Prunning V Merge: {Bread,Milk}  {Diaper} {Bread,Diaper}  {Milk} {Bread}  {Diaper,Milk} (confidence = 3/4) …

13 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Case Study: Congressional Voting Records l List of binary attributes from 1984 US Congressional Voting Records (data source: UCI, 435 transactions with 34 items) –1. democrat, republican –2. handicapped-infants: 2 (y,n) –3. water-project-cost-sharing: 2 (y,n) –4. adoption-of-the-budget-resolution: 2 (y,n) –5. physician-fee-freeze: 2 (y,n) –6. el-salvador-aid: 2 (y,n) –7. religious-groups-in-schools: 2 (y,n) –8. anti-satellite-test-ban: 2 (y,n) –9. aid-to-nicaraguan-contras: 2 (y,n) –10. mx-missile: 2 (y,n) –11. immigration: 2 (y,n) –12. synfuels-corporation-cutback: 2 (y,n) –13. education-spending: 2 (y,n) –14. superfund-right-to-sue: 2 (y,n) –15. crime: 2 (y,n) –16. duty-free-exports: 2 (y,n) –17. export-administration-act-south-africa: 2 (y,n)

14 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Case Study: Congressional Voting Records l Sample transactions –republican,n, y, n,y,y,y,n,n,n,y,?,y,y,y,n,y –democrat, ?, y, y,?,y,y,n,n,n,n,y,n,y,y,n,n –democrat, y, y, y,n,?,y,n,n,n,n,y,n,y,n,n,y l Data pre-processing –Get 34 features (skip ?) republican democrat handcapped-infants=yes handcapped-infants=no … Transaction 1 1 4 … Transaction 2 2 … Transaction 3 2 3 …

15 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Case Study: Congressional Voting Records l Association rules extracted using minsup = 30% and minconf = 90% –{budget resolution = no, MX-missile = no, aid t o El Salvador = yes}  {Republican}  Confidence = 91% –{budget resolution = yes, MX-missile = yes, aid t o El Salvador = no}  {democrat}  Confidence = 97.5% –{crime = yes, right-to-use = yes, physican fee freeze = yes}  {republican}  Confidence = 93.5% –{crime = no, right-to-use = no, physican fee freeze = no}  {democrat}  Confidence = 100%

16 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Pattern Evaluation l Association rule algorithms tend to produce too many rules –many of them are uninteresting or redundant –Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence l Interestingness measures can be used to prune/rank the derived patterns l In the original formulation of association rules, support & confidence are the only measures used

17 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Application of Interestingness Measure Interestingness Measures

18 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Computing Interestingness Measure l Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Used to define various measures u support, confidence, lift, Gini, J-measure, etc.

19 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375

20 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Statistical Independence l Population of 1000 students –600 students know how to swim (S) –700 students know how to bike (B) –420 students know how to swim and bike (S,B) –P(S  B) = 420/1000 = 0.42 –P(S)  P(B) = 0.6  0.7 = 0.42 –P(S  B) = P(S)  P(B) => Statistical independence –P(S  B) > P(S)  P(B) => Positively correlated –P(S  B) Negatively correlated

21 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Statistical-based Measures l Measures that take into account statistical dependence

22 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Example: Lift/Interest Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

23 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Drawback of Lift & Interest YY X100 X090 1090100 YY X900 X010 9010100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1

24 There are lots of measures proposed in the literature Some measures are good for certain applications, but not for others What criteria should we use to determine whether a measure is good or bad? What about Apriori- style support based pruning? How does it affect these measures?

25 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Properties of A Good Measure l Piatetsky-Shapiro: 3 properties a good measure M must satisfy: –M(A,B) = 0 if A and B are statistically independent –M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged –M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged

26 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures:

27 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Property under Variable Permutation Does M(A,B) = M(B,A)? Symmetric measures: u support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: u confidence, conviction, Laplace, J-measure, etc

28 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Property under Row/Column Scaling MaleFemale High235 Low145 3710 MaleFemale High43034 Low24042 67076 Grade-Gender Example (Mosteller, 1968): Mosteller: Underlying association should be independent of the relative number of male and female students in the samples 2x10x

29 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Property under Inversion Operation Transaction 1 Transaction N..........

30 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 Example:  -Coefficient l  -coefficient is analogous to correlation coefficient for continuous variables YY X601070 X102030 7030100 YY X201030 X106070 3070100  Coefficient is the same for both tables

31 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Property under Null Addition Invariant measures: u support, cosine, Jaccard, etc Non-invariant measures: u correlation, Gini, mutual information, odds ratio, etc

32 © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Different Measures have Different Properties


Download ppt "Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,"

Similar presentations


Ads by Google