Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong.

Similar presentations


Presentation on theme: "Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong."— Presentation transcript:

1 Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong

2 Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Part 4: Interesting Rules and Patterns

3 Copyright © 2004 by Jinyan Li and Limsoon Wong Outline Some interesting decision trees Performance of CS4 Demo

4 Copyright © 2004 by Jinyan Li and Limsoon Wong Some Interesting Decision Trees

5 Decision Tree on a Prostate Data Set Singh et.al, Cancer Cell 1:203-209, 2002 102 instances 52 tumor samples 50 normal samples ~12,500 numeric features –Each one represents a gene (or probe) –Its value is expression level of that gene Copyright © 2004 by Jinyan Li and Limsoon Wong

6 32598_at 40707_at 33886_at Tumor Normal <=29>29 <= 10 > 10 <= -6 > -6 > 5 34950_at Normal <=5 3(+1) 6 C4.5 Tree Copyright © 2004 by Jinyan Li and Limsoon Wong

7 Rule Translation The tree can be translated into 5 rules Two of them are significant rules, but the rest three are trivial The two significant rules dominate in the two classes: normal class and tumor class 32598_at 40707_at 33886_at Tumor Normal 34950_at Normal Copyright © 2004 by Jinyan Li and Limsoon Wong

8 32598_at 40707_at 33886_at Tumor Normal 34950_at Normal Copyright © 2004 by Jinyan Li and Limsoon Wong Significance of the Rules Two significant rules –If x <= 29 and y <=10 and z <= 5, then this is a tumor cell (94%), where x, y, z represent 32598_at, 33886_at, 34950_at respectively –If x > 29 and 40707_at > - 6, then this is a normal cell (82%) Three trivial rules: 12%, 6%, 6%

9 Another Gene Expression Data Set Yeoh et al., Cancer Cell 1:133-143, 2002 Differentiating MLL subtype from other subtypes of childhood leukemia Training data –14 MLL vs 201 others Test data –6 MLL vs 106 others Number of features –12558 Copyright © 2004 by Jinyan Li and Limsoon Wong

10 4 mistakes on test data The Decision Tree Copyright © 2004 by Jinyan Li and Limsoon Wong

11 Given a test sample, at most 3 of the 4 genes’ expression values are needed to make a decision! Translating the Tree into a Mathematical Function

12 Copyright © 2004 by Jinyan Li and Limsoon Wong Performance of CS4

13 Copyright © 2004 by Jinyan Li and Limsoon Wong Four Points to Demonstrate Whether top-ranked features have similar gain ratios Whether cascading trees have similar training performance Whether the trees have similar structure Whether the expanding tree committees can reduce the test errors gradually

14 Copyright © 2004 by Jinyan Li and Limsoon Wong For differentiation between the subtype Hyperdip>50 and some other subtypes of childhood leukemia An Example

15 Copyright © 2004 by Jinyan Li and Limsoon Wong Gain Ratios of Top 20 features Gain ratios are: 0.39, 0.36, 0.35, 0.33, 0.33, 0.33, 0.33, 0.32, 0.31, 0.30; 0.30, 0.30, 0.30, 0.29, 0.29, 0.28, 0.28, 0.28, 0.28, 0.28. The difference between the 1st and the 20th is only 0.11. In fact, the two features’ partitionings differ in a few samples

16 Copyright © 2004 by Jinyan Li and Limsoon Wong Training and Test Performance

17 Copyright © 2004 by Jinyan Li and Limsoon Wong Two Observations The first tree does not always have the best performance Alternative trees rooted by other top-ranked features may have better performance than the first tree

18 The Power of Committee Copyright © 2004 by Jinyan Li and Limsoon Wong

19 Compared to Bagging & Boosting Bagging made similar number of mistakes: 2 mistakes However, Boosting made 13 mistakes

20 Copyright © 2004 by Jinyan Li and Limsoon Wong Demo


Download ppt "Copyright © 2004 by Jinyan Li and Limsoon Wong Rule-Based Data Mining Methods for Classification Problems in Biomedical Domains Jinyan Li Limsoon Wong."

Similar presentations


Ads by Google