Presentation is loading. Please wait.

Presentation is loading. Please wait.

Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon

Similar presentations


Presentation on theme: "Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon"— Presentation transcript:

1 Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon (dycho@bi.snu.ac.kr)

2 © 2006 SNU CSE Biointelligence Lab 2 Project Purpose Medical Diagnosis  To predict the presence or absence of a disease given the results of various medical tests carried out on a patient  Human experts (M.D.) vs Machine (GP) Two Data Sets  Heart Disease  Diabetes

3 © 2006 SNU CSE Biointelligence Lab 3 Heart Disease Data Description  Number of patients (270)  Absence (150)  Presence (120)  13 attributes  age  sex  chest pain type (4 values)  resting blood pressure  serum cholestoral in mg/dl  fasting blood sugar > 120 mg/dl  resting electrocardiographic results (values 0,1,2)  maximum heart rate achieved  exercise induced angina  oldpeak = ST depression induced by exercise relative to rest  the slope of the peak exercise ST segment  number of major vessels (0-3) colored by flourosopy  thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

4 © 2006 SNU CSE Biointelligence Lab 4 Learning a Classifier GP settings  Functions  Numerical and condition operators  {+, -, *, /, exp, log, sin, cos, sqrt, iflte ifltz, …}  Some operators should be protected from the illegal operation.  Terminals  Input attributes and constants  {x 1, x 2, … x 13, R} where R  [a, b]  Additional parameters  Threshold value  For preprocessing (normalization)

5 © 2006 SNU CSE Biointelligence Lab 5 Cross Validation (1/3) K-fold Cross Validation  The data set is randomly divided into k subsets.  One of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. 45 D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1 D2D2 D3D3 D4D4 D6D6 D5D5 D2D2 D3D3 D4D4 D5D5 D6D6 D1D1

6 © 2006 SNU CSE Biointelligence Lab 6 Cross Validation (2/3) Confusion Matrix for test data sets  Number of patients = p + q + r + s  Accuracy True Predict PositiveNegative Positivepq Negativers

7 © 2006 SNU CSE Biointelligence Lab 7 Cross Validation (3/3) Cross validation and Confusion Matrix  At least 10 runs for your k value.  Show the confusion matrix for the best result of your experiments. RunAccuracy 1 2  10 Average

8 © 2006 SNU CSE Biointelligence Lab 8 Initialization Maximum initial depth of trees D max is set. Full method (each branch has depth = D max ):  nodes at depth d < D max randomly chosen from function set F  nodes at depth d = D max randomly chosen from terminal set T Grow method (each branch has depth  D max ):  nodes at depth d < D max randomly chosen from F  T  nodes at depth d = D max randomly chosen from T Common GP initialisation: ramped half-and-half, where gr ow and full method each deliver half of initial population

9 © 2006 SNU CSE Biointelligence Lab 9 Fitness Function Maximization problem  Number of the correctly classified patients Minimization problem  Number of the incorrectly classified patients  Mean Squared Error  N: number of training data

10 © 2006 SNU CSE Biointelligence Lab 10 Selection (1/2) Fitness proportional (roulette wheel) selection  The roulette wheel can be constructed as follows.  Calculate the total fitness for the population.  Calculate selection probability p k for each chromosome v k.  Calculate cumulative probability q k for each chromosome v k.

11 © 2006 SNU CSE Biointelligence Lab 11 Procedure: Proportional_Selection  Generate a random number r from the range [0,1].  If r  q 1, then select the first chromosome v 1 ; else, select the kth chromosome v k (2  k  pop_size) such that q k-1 < r  q k. pkpk qkqk 10.082407 20.1106520.193059 30.1319310.324989 40.1214230.446412 50.0725970.519009 60.1288340.647843 70.0779590.725802 80.1020130.827802 90.0836630.911479 100.0885211.000000

12 © 2006 SNU CSE Biointelligence Lab 12 Selection (2/2) Tournament selection  Tournament size q Ranking-based selection  2    POP_SIZE  1   +  2 and  - = 2 -  +

13 © 2006 SNU CSE Biointelligence Lab 13 GP Flowchart GA loopGP loop

14 © 2006 SNU CSE Biointelligence Lab 14 Bloat Bloat = “ survival of the fattest ”, i.e., the tree sizes in the population are increasing over time Ongoing research and debate about the reasons Needs countermeasures, e.g.  Prohibiting variation operators that would deliver “ too big ” children  Parsimony pressure: penalty for being oversized

15 © 2006 SNU CSE Biointelligence Lab 15

16 © 2006 SNU CSE Biointelligence Lab 16 Experiments Two problems  Heart Disease  Pima Indian diabetes Various experimental setup  Termination condition: maximum_generation  Various settings  Effects of the penalty term  Different function and terminal sets  Selection methods and their parameters  Crossover and mutation probabilities

17 © 2006 SNU CSE Biointelligence Lab 17 Results For each problem  Result table and your analysis  Present the optimal classifier  Draw a learning curve for the run where the best solution was found.  Compare with the results of neural networks (optional).  Different k for cross validation (optional) TrainingTest Average  SD BestWorst Average  SD BestWorst Setting 1 Setting 2 Setting 3

18 © 2006 SNU CSE Biointelligence Lab 18 Generation Fitness (Error)

19 © 2006 SNU CSE Biointelligence Lab 19 References Source Codes  GP libraries (C, C++, JAVA, …)  MATLAB Tool box Web sites  http://www.cs.bham.ac.uk/~cmf/GPLib/GPLib.html http://www.cs.bham.ac.uk/~cmf/GPLib/GPLib.html  http://cs.gmu.edu/~eclab/projects/ecj/ http://cs.gmu.edu/~eclab/projects/ecj/  http://www.geneticprogramming.com/GPpages/softwar e.html http://www.geneticprogramming.com/GPpages/softwar e.html  …

20 © 2006 SNU CSE Biointelligence Lab 20 Pay Attention! Due: Nov. 16, 2006 Submission  Source code and executable file(s)  Proper comments in the source code  Via e-mail  Report: Hardcopy!!  Running environments and libraries (or packages) which you used.  Results for many experiments with various parameter settings  Analysis and explanation about the results in your own way


Download ppt "Biological data mining by Genetic Programming AI Project #2 Biointelligence lab Cho, Dong-Yeon"

Similar presentations


Ads by Google