Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Similar presentations


Presentation on theme: "© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer."— Presentation transcript:

1 © Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer Science University of Minnesota http://www.cs.umn.edu/~kumar

2 © Vipin Kumar CSci 8980 Fall 2002 2 Mining Continuous Attributes TidABCDE 110011 210010 310011 410100 510010 ? Example: {Refund = No, (60K  Income  80K)}  {Cheat = No}

3 © Vipin Kumar CSci 8980 Fall 2002 3 Discretize Continuous Attributes l Unsupervised: –Equal-width binning –Equal-depth binning –Clustering l Supervised: Classv1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 v8v8 v9v9 Anomalous002010200000 Normal150100000 150100 bin 1 bin 3 bin 2 Attribute values, v

4 © Vipin Kumar CSci 8980 Fall 2002 4 Discretization Issues l Size of the discretized intervals affect support & confidence –If intervals too small  may not have enough support –If intervals too large  may not have enough confidence {Refund = No, (Income = $51,250)}  {Cheat = No} {Refund = No, (60K  Income  80K)}  {Cheat = No} {Refund = No, (0K  Income  1B)}  {Cheat = No}

5 © Vipin Kumar CSci 8980 Fall 2002 5 Discretization Issues l Execution time –If intervals contain n values, there are on average O(n 2 ) possible ranges l Too many rules {Refund = No, (Income = $51,250)}  {Cheat = No} {Refund = No, (51K  Income  52K)}  {Cheat = No} {Refund = No, (50K  Income  60K)}  {Cheat = No}

6 © Vipin Kumar CSci 8980 Fall 2002 6 Approach by Srikant & Agrawal l Discretize attribute using equi-depth partitioning –Use partial completeness measure to determine number of partitions C: frequent itemsets obtained by considering all ranges of attribute values P: frequent itemsets obtained by considering all ranges over the partitions P is K-complete w.r.t C if P  C,and  X  C,  X’  P such that: 1. X’ is a generalization of X and support (X’)  K  support(X) (K  1) 2.  Y  X,  Y’  X’ such that support (Y’)  K  support(Y) Given K (partial completeness level), can determine number of intervals (N) l Merge adjacent intervals as long as support is less than max-support l Apply existing association rule mining algorithms l Determine interesting rules in the output

7 © Vipin Kumar CSci 8980 Fall 2002 7 Interestingness Measure l Given an itemset: Z = {z 1, z 2, …, z k } and its generalization Z’ = {z 1 ’, z 2 ’, …, z k ’} P(Z): support of Z E Z’ (Z): expected support of Z based on Z’ –Z is R-interesting w.r.t. Z’ if P(Z)  R  E Z’ (Z) {Refund = No, (Income = $51,250)}  {Cheat = No} {Refund = No, (51K  Income  52K)}  {Cheat = No} {Refund = No, (50K  Income  60K)}  {Cheat = No}

8 © Vipin Kumar CSci 8980 Fall 2002 8 Interestingness Measure l For S: X  Y, and its generalization S’: X’  Y’ P(Y|X): confidence of X  Y P(Y’|X’): confidence of X’  Y’ E S’ (Y|X): expected support of Z based on Z’ l Rule S is R-interesting w.r.t its ancestor rule S’ if –Support, P(S)  R  E S’ (S) or –Confidence, P(Y|X)  R  E S’ (Y|X)

9 © Vipin Kumar CSci 8980 Fall 2002 9 Min-Apriori (Han et al) Example: W1 and W2 tends to appear together in the same document Document-term matrix:

10 © Vipin Kumar CSci 8980 Fall 2002 10 Min-Apriori l Data contains only continuous attributes of the same “type” –e.g., frequency of words in a document l Discretization does not apply as users want association among words not ranges of words Normalize

11 © Vipin Kumar CSci 8980 Fall 2002 11 Min Apriori l Why normalize? versus

12 © Vipin Kumar CSci 8980 Fall 2002 12 Min-Apriori l New definition of support: Example: Sup(W1,W2,W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17

13 © Vipin Kumar CSci 8980 Fall 2002 13 Anti-monotone property of Support Example: Sup(W1) = 0.4 + 0 + 0.4 + 0 + 0.2 = 1 Sup(W1, W2) = 0.33 + 0 + 0.4 + 0 + 0.17 = 0.9 Sup(W1, W2, W3) = 0 + 0 + 0 + 0 + 0.17 = 0.17


Download ppt "© Vipin Kumar CSci 8980 Fall 2002 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer."

Similar presentations


Ads by Google