Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Similar presentations


Presentation on theme: "© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer."— Presentation transcript:

1 © Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar

2 © Vipin Kumar CSci 8980 Fall 2002 2 Interestingness Measures l Association rule algorithms tend to produce too many rules –many of them are uninteresting or redundant –Redundant if {A,B,C}  {D} and {A,B}  {D} have same support & confidence l Interestingness measures can be used to prune/rank the derived patterns l In the original formulation of association rules, support & confidence are the only measures used

3 © Vipin Kumar CSci 8980 Fall 2002 3 Application of Interestingness Measure Interestingness Measures

4 © Vipin Kumar CSci 8980 Fall 2002 4 Computing Interestingness Measure l Given a rule X  Y, information needed to compute rule interestingness can be obtained from a contingency table YY Xf 11 f 10 f 1+ Xf 01 f 00 f o+ f +1 f +0 |T| Contingency table for X  Y f 11 : support of X and Y f 10 : support of X and Y f 01 : support of X and Y f 00 : support of X and Y Can apply various Measures u support, confidence, lift, Gini, J-measure, etc.

5 © Vipin Kumar CSci 8980 Fall 2002 5 Drawback of Confidence Coffee Tea15520 Tea75580 9010100 Association Rule: Tea  Coffee Confidence= P(Coffee|Tea) = 0.75 but P(Coffee) = 0.9  Although confidence is high, rule is misleading  P(Coffee|Tea) = 0.9375

6 © Vipin Kumar CSci 8980 Fall 2002 6 Other Measures

7

8 © Vipin Kumar CSci 8980 Fall 2002 8 Properties of A Good Measure l Piatetsky-Shapiro: 3 properties a good measure M should satisfy: –M(A,B) = 0 if A and B are statistically independent –M(A,B) increase monotonically with P(A,B) when P(A) and P(B) remain unchanged –M(A,B) decreases monotonically with P(A) [or P(B)] when P(A,B) and P(B) [or P(A)] remain unchanged

9 © Vipin Kumar CSci 8980 Fall 2002 9 Lift & Interest YY X100 X090 1090100 YY X900 X010 9010100 Statistical independence: If P(X,Y)=P(X)P(Y) => Lift = 1

10 © Vipin Kumar CSci 8980 Fall 2002 10 Comparing Different Measures 10 examples of contingency tables: Rankings of contingency tables using various measures:

11 © Vipin Kumar CSci 8980 Fall 2002 11 Property under Variable Permutation Does M(A,B) = M(B,A)? Symmetric measures: u support, lift, collective strength, cosine, Jaccard, etc Asymmetric measures: u confidence, conviction, Laplace, J-measure, etc

12 © Vipin Kumar CSci 8980 Fall 2002 12 Property under Row/Column Scaling MaleFemale High235 Low145 3710 MaleFemale High43034 Low24042 67076 Grade-Gender Example (Mosteller, 1968): Mosteller: Underlying association should be independent of the relative number of male and female students in the samples 2x10x

13 © Vipin Kumar CSci 8980 Fall 2002 13 Property under Inversion Operation Transaction 1 Transaction N..........

14 © Vipin Kumar CSci 8980 Fall 2002 14 Example:  -Coefficient l  -coefficient is analogous to correlation coefficient for continuous variables YY X601070 X102030 7030100 YY X201030 X106070 3070100  Coefficient is the same for both tables

15 © Vipin Kumar CSci 8980 Fall 2002 15 Property under Null Addition Invariant measures: u support, cosine, Jaccard, etc Non-invariant measures: u correlation, Gini, mutual information, odds ratio, etc

16 © Vipin Kumar CSci 8980 Fall 2002 16 Different Measures have Different Properties

17 © Vipin Kumar CSci 8980 Fall 2002 17 Support-based Pruning l Most of the association rule mining algorithms use support measure to prune rules and itemsets l Study effect of support pruning on correlation of itemsets –Generate 10000 random contingency tables –Compute support and pairwise correlation for each table –Apply support-based pruning and examine the tables that are removed

18 © Vipin Kumar CSci 8980 Fall 2002 18 Effect of Support-based Pruning

19 © Vipin Kumar CSci 8980 Fall 2002 19 Effect of Support-based Pruning Support-based pruning eliminates mostly negatively correlated itemsets

20 © Vipin Kumar CSci 8980 Fall 2002 20 Effect of Support-based Pruning l Investigate how support-based pruning affects other measures l Steps: –Generate 10000 contingency tables –Rank each table according to the different measures –Compute the pair-wise correlation between the measures

21 © Vipin Kumar CSci 8980 Fall 2002 21 Effect of Support-based Pruning u Without Support Pruning (All Pairs) u Red cells indicate correlation between the pair of measures > 0.85 u 40.14% pairs have correlation > 0.85 Scatter Plot between Correlation & Jaccard Measure

22 © Vipin Kumar CSci 8980 Fall 2002 22 Effect of Support-based Pruning u 0.5%  support  50% u 61.45% pairs have correlation > 0.85 Scatter Plot between Correlation & Jaccard Measure:

23 © Vipin Kumar CSci 8980 Fall 2002 23 Effect of Support-based Pruning u 0.5%  support  30% u 76.42% pairs have correlation > 0.85 Scatter Plot between Correlation & Jaccard Measure


Download ppt "© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer."

Similar presentations


Ads by Google