Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Similar presentations


Presentation on theme: "Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong."— Presentation transcript:

1 Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

2 Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Part 2: Rule-Based Approaches

3 Copyright © 2004 by Jinyan Li and Limsoon Wong Outline Overview of Supervised Learning Decision Trees Ensembles –Bagging –Boosting –Random forest –Randomization trees –CS4

4 Copyright © 2004 by Jinyan Li and Limsoon Wong Overview of Supervised Learning

5 Copyright © 2004 by Jinyan Li and Limsoon Wong Computational Supervised Learning Also called classification Learn from past experience, and use the learned knowledge to classify new data Knowledge learned by intelligent algorithms Examples: –Clinical diagnosis for patients –Cell type classification

6 Copyright © 2004 by Jinyan Li and Limsoon Wong Data Classification application involves > 1 class of data. E.g., –Normal vs disease cells for a diagnosis problem Training data is a set of instances (samples, points) with known class labels Test data is a set of instances whose class labels are to be predicted

7 Copyright © 2004 by Jinyan Li and Limsoon Wong Notation Training data {  x 1, y 1 ,  x 2, y 2 , …,  x m, y m  } where x j are n-dimensional vectors and y j are from a discrete space Y. E.g., Y = {normal, disease}. Test data {  u 1, ? ,  u 2, ? , …,  u k, ? , }

8 Training data: X Class labels Y f(X) A classifier, a mapping, a hypothesis Test data: U Predicted class labels f(U) Copyright © 2004 by Jinyan Li and Limsoon Wong Process

9 x 11 x 12 x 13 x 14 … x 1n x 21 x 22 x 23 x 24 … x 2n x 31 x 32 x 33 x 34 … x 3n …………………………………. x m1 x m2 x m3 x m4 … x mn n features (order of 1000) m samples class PNPNPNPN gene 1 gene 2 gene 3 gene 4 … gene n Copyright © 2004 by Jinyan Li and Limsoon Wong Relational Representation of Gene Expression Data

10 Copyright © 2004 by Jinyan Li and Limsoon Wong Features Also called attributes Categorical features –feature color = {red, blue, green} Continuous or numerical features –gene expression –age –blood pressure Discretization

11 An Example Copyright © 2004 by Jinyan Li and Limsoon Wong

12 Biomedical Financial Government Scientific Decision trees Emerging patterns SVM Neural networks Classifiers (M-Doctors) Copyright © 2004 by Jinyan Li and Limsoon Wong Overall Picture of Supervised Learning

13 Copyright © 2004 by Jinyan Li and Limsoon Wong Evaluation of a Classifier Performance on independent blind test data K-fold cross validation: Given a dataset, divide it into k even parts, k-1 of them are used for training, and the rest one part treated as test data LOOCV, a special case of K-fold CV Accuracy, error rate False positive rate, false negative rate, sensitivity, specificity, precision

14 Copyright © 2004 by Jinyan Li and Limsoon Wong Requirements of Biomedical Classification High accuracy High comprehensibility

15 Copyright © 2004 by Jinyan Li and Limsoon Wong Importance of Rule-Based Methods Systematic selection of a small number of features used for decision making. Increase the comprehensibility of the knowledge patterns C4.5 and CART are two commonly used rule induction algorithms, or called decision tree induction algorithms

16 Leaf nodes Internal nodes Root node A B B A A x1x1 x2x2 x4x4 x3x3 > a 1 > a 2 Copyright © 2004 by Jinyan Li and Limsoon Wong Structure of Decision Trees If x 1 > a 1 & x 2 > a 2, then it’s A class C4.5, CART, two of the most widely used Easy interpretation, but accuracy generally unattractive

17 Elegance of Decision Trees A B BA A Copyright © 2004 by Jinyan Li and Limsoon Wong

18 CLS (Hunt etal. 1966)--- cost drivenID3 (Quinlan, 1986 MLJ) --- Information-driven C4.5 (Quinlan, 1993) --- Gain ratio + Pruning ideas CART (Breiman et al. 1984) --- Gini Index Brief History of Decision Trees

19 9 Play samples 5 Don’t A total of 14. A Simple Dataset Copyright © 2004 by Jinyan Li and Limsoon Wong

20 2 outlook windy humidity Play Don’t sunny overcast rain <= 75 > 75 false true 2 4 3 3 A Decision Tree NP-complete problem Copyright © 2004 by Jinyan Li and Limsoon Wong

21 Construction of a Decision Tree Determination of the root node of the tree and the root node of its sub-trees

22 Copyright © 2004 by Jinyan Li and Limsoon Wong Most Discriminatory Feature Every feature can be used to partition the training data If the partitions contain a pure class of training instances, then this feature is most discriminatory

23 Copyright © 2004 by Jinyan Li and Limsoon Wong Example of Partitions Categorical feature –Number of partitions of the training data is equal to the number of values of this feature Numerical feature –Two partitions

24 OutlookTempHumidityWindy class Sunny7570truePlay Sunny8090 trueDon’t Sunny8585 falseDon’t Sunny 7295trueDon’t Sunny6970falsePlay Overcast7290truePlay Overcast8378falsePlay Overcast6465truePlay Overcast8175falsePlay Rain7180trueDon’t Rain6570trueDon’t Rain 7580false Play Rain6880false Play Rain7096falsePlay Instance # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Copyright © 2004 by Jinyan Li and Limsoon Wong

25 Total 14 training instances 1,2,3,4,5 P,D,D,D,P 6,7,8,9 P,P,P,P 10,11,12,13,14 D, D, P, P, P Outlook = sunny Outlook = overcast Outlook = rain Copyright © 2004 by Jinyan Li and Limsoon Wong

26 Total 14 training instances 5,8,11,13,14 P,P, D, P, P 1,2,3,4,6,7,9,10,12 P,D,D,D,P,P,P,D,P Temperature <= 70 Temperature > 70 Copyright © 2004 by Jinyan Li and Limsoon Wong

27 Three Measures Gini index Information gain Gain ratio

28 Copyright © 2004 by Jinyan Li and Limsoon Wong Steps of Decision Tree Construction Select the best feature as the root node of the whole tree After partition by this feature, select the best feature (wrt the subset of training data) as the root node of this sub-tree Recursively, until the partitions become pure or almost pure

29 Copyright © 2004 by Jinyan Li and Limsoon Wong Missing many globally significant rules; mislead the system Characteristics of C4.5 Trees Single coverage of training data (elegance) Divide-and-conquer splitting strategy Fragmentation problem Locally reliable but globally un-significant rules

30 Copyright © 2004 by Jinyan Li and Limsoon Wong Decision Tree Ensembles Bagging Boosting Random forest Randomization trees CS4

31 Copyright © 2004 by Jinyan Li and Limsoon Wong h 1, h 2, h 3 are indep classifiers w/ accuracy = 60% C 1, C 2 are the only classes t is a test instance in C 1 h(t) = argmax C  {C1,C2} |{h j  {h 1, h 2, h 3 } | h j (t) = C}| Then prob(h(t) = C 1 ) = prob(h 1 (t)=C 1 & h 2 (t)=C 1 & h 3 (t)=C 1 ) + prob(h 1 (t)=C 1 & h 2 (t)=C 1 & h 3 (t)=C 2 ) + prob(h 1 (t)=C 1 & h 2 (t)=C 2 & h 3 (t)=C 1 ) + prob(h 1 (t)=C 2 & h 2 (t)=C 1 & h 3 (t)=C 1 ) = 60% * 60% * 60% + 60% * 60% * 40% + 60% * 40% * 60% + 40% * 60% * 60% = 64.8% Motivating Example

32 Copyright © 2004 by Jinyan Li and Limsoon Wong Bagging Proposed by Breiman (1996) Also called Bootstrap aggregating Make use of randomness injected to training data

33 50 p + 50 n Original training set 48 p + 52 n 49 p + 51 n 53 p + 47 n … A base inducer such as C4.5 A committee H of classifiers: h 1 h 2 …. h k Main Ideas Copyright © 2004 by Jinyan Li and Limsoon Wong

34 Decision Making by Bagging Given a new test sample T Copyright © 2004 by Jinyan Li and Limsoon Wong

35 Boosting AdaBoost by Freund & Schapire (1995) Also called Adaptive Boosting Make use of weighted instances and weighted voting

36 Main Ideas 100 instances with equal weight A classifier h1 error If error is 0 or >0.5 stop Otherwise re- weight: e1/(1-e1) Renormalize to 1 100 instances with different weights A classifier h2 error Copyright © 2004 by Jinyan Li and Limsoon Wong

37 Given a new test sample T Decision Making by AdaBoost.M1 Copyright © 2004 by Jinyan Li and Limsoon Wong

38 Bagging vs Boosting Bagging –Construction of Bagging classifiers are independent –Equal voting Boosting –Construction of a new Boosting classifier depends on the performance of its previous classifier, i.e. sequential construction (a series of classifiers) –Weighted voting

39 Copyright © 2004 by Jinyan Li and Limsoon Wong Random Forest Proposed by Breiman (2001) Similar to Bagging, but the base inducer is not the standard C4.5 Make use twice of randomness

40 50 p + 50 n Original training set 48 p + 52 n 49 p + 51 n 53 p + 47 n … A base inducer (not C4.5 but revised) A committee H of classifiers: h 1 h 2 …. h k Main Ideas Copyright © 2004 by Jinyan Li and Limsoon Wong

41 Root node Original n number of features Selection is from m try number of randomly chosen features A Revised C4.5 as Base Classifier Copyright © 2004 by Jinyan Li and Limsoon Wong

42 Decision Making by Random Forest Given a new test sample T Copyright © 2004 by Jinyan Li and Limsoon Wong

43 Randomization Trees Proposed by Dietterich (2000) Make use of randomness in the selection of the best split point

44 Root node Original n number of features Select one randomly from {feature 1: choice 1,2,3 feature 2: choise 1, 2,. feature 8: choice 1, 2, 3 } Total 20 candidates Equal voting on the committee of such decision trees Main Ideas Copyright © 2004 by Jinyan Li and Limsoon Wong

45 CS4 Proposed by Li et al (2003) CS4: Cascading and Sharing for decision trees Don’t make use of randomness

46 Selection of root nodes is in a cascading manner! 1 2 k tree-1 tree-2 tree-k total k trees root nodes Main Ideas Copyright © 2004 by Jinyan Li and Limsoon Wong

47 Not equal voting Decision Making by CS4 Copyright © 2004 by Jinyan Li and Limsoon Wong

48 Bagging Random Forest AdaBoost.M1 Randomization Trees CS4 Rules may not be correct when applied to training data Rules correct Copyright © 2004 by Jinyan Li and Limsoon Wong Summary of Ensemble Classifiers

49 Copyright © 2004 by Jinyan Li and Limsoon Wong Any Question?


Download ppt "Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong."

Similar presentations


Ads by Google