A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.

A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology 2014/10/22

Outline 2 Experimental result Microarray Datasets & Research Objective Related work & background Research method Conclusion

Microarray datasets  Microarray technology can be used to measure the expression levels of thousands of genes at the same time.  A microarray dataset records the gene expressions of different samples in a table. 3 Mobile Computing & Data Mining Lab.

Microarray datasets  N ： Number of samples (40~200)  M ： Num. of genes (2,000~30,000)  g i,j ： expression level of gene j at sampel i  Class label ： the class label of the sample Mobile Computing & Data Mining Lab. 4 (M >> N) M genesClass label N Samples Gene 1 Gene 2 Class S1S1 0.022-0.7210 S 2 -1.0340.3310 ………… S j-1 -0.2120.1231 SjSj 0.5420.4311  The Prostate cancer dataset ： (Simplified) 0 ： Absent 1 ： Present

Research objective  M>>N pose challenges in diagnosis (or Classification) Mobile Computing & Data Mining Lab. 5 To select a minimal subset of genes with high classification accuracy rate. A gene selection problem

Outline 6 Experimental result Microarray Datasets & Research Objective Related work & Background Research method Conclusion

Related work  Ding, C., & Peng, H. used the Pearson correlation coefficient to eliminate redundant genes from microarray datasets.  Minimum redundancy feature selection from microarray gene expression data.(2003 & 2005)  Yang, et al. proposed to use information gain and genetic algorithms for gene selection.  IG-GA: A Hybrid Filter/Wrapper Method for Feature Selection of Microarray Data.(2010) 7 Mobile Computing & Data Mining Lab.

Related work  Luo, et al. clustered genes into groups and treated genes in the same group as redundant genes.  Improving the Computational Efficiency of Recursive Cluster Elimination. (2011) 8 Mobile Computing & Data Mining Lab.

Background knowledge  Information Gain: Proposed by Quinlan as a basis of attribute selection in Decision Tree.  Attributes with larger information gains are better for classification (or differentiating between different class labels of data samples). Mobile Computing & Data Mining Lab. 9

Ecological correlation (Robinson)  Ecological Correlation  Divide dataset into groups, use the means of different groups to calculate the Pearson correlation coefficients.  Reduce the in-group variance, increase the value of correlation coefficient between attributes. Mobile Computing & Data Mining Lab. 10

Example  Leukemia1 dataset grouped by class labels (0,1,2)  Cor(gene1 {μ 0, μ 1, μ 2 },gene2{μ 0, μ 1, μ 2 }) = -0.9886 Mobile Computing & Data Mining Lab. 11 gene1gene2class -0.9058-0.92980 0.8371-1.30220 1.0694-0.78261 -1.5851-0.86801 -0.1908-0.65072 -1.05780.82682 μ0μ0 μ1μ1 μ2μ2 gene1-0.0344-0.2578-0.6243 gene2-1.1160-0.82530.0881 mean

Support Vector Machine  A classification method by Cortes & Vapnik(1995)  To find a good hyper-plane to separate samples with different class labels. Mobile Computing & Data Mining Lab. 12  ∣ a 1 -a ∣ > |b 1 -b ∣  Hyper-plane a is better than hyper- plane b. margin Support Vectors b1b1 b b2b2 a2a2 a1a1 a

Research method Mobile Computing & Data Mining Lab. 14

Data preprocessing － Normalization  Normalize the dataset using Z-Score  Z score of gene expression X ij :  Where ‐ X ij ： the expression gene j on sample i. ‐ ： Mean of gene i’s expression over different samples ‐ S i ： standard deviation of gene i’s expression over different samples. 15 Mobile Computing & Data Mining Lab.

Gene filtering by information gain 16

Gene filtering  Most of the genes have their IG values equal to 0.  Select the gene with IG greater than 0 for candidate genes.  For example, the Leukemia1 dataset has 5,327 genes; only 263 genes left after gene filtering with IG. Mobile Computing & Data Mining Lab. 17

Grouping of gene 18

Grouping of genes  Gene list and threshold of cor.  Build the list of candidate genes  Set threshold = 0.8 （ strong positively correlated ）  Grouping method ：  With the first gene on the list as the basis, group the rest genes with the basis gene if their correlation coefficients is greater than 0.8. Mobile Computing & Data Mining Lab. 19 Gene IDCor. Gene 1,2 0.83 Gene 1,3 0.53 Gene 1,4 0.32 Gene 1,5 0.13... gene ID Gene 1 Gene 2 Gene 3 Gene 4 Gene 5... Build a gene list Calculate correlation coefficients

 Eliminate the genes in the group from the list; repeat the same procedure on the rest of genes until no gene left on the list. Eliminate genes from the existing group Mobile Computing & Data Mining Lab. 20 Gene ID Gene 3 Gene 4 Gene 5 Gene IDCor. Gene 1,20.83 Gene 1,30.53 Gene 1,40.32 Gene 1,50.13 Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Cluster 1 :

 Select the gene with the highest IG from each group. Select one gene from each group Mobile Computing & Data Mining Lab.

 ANOVA ：  For dataset with more than three two class labels, use ANOVA to test whether the class means are all equal. Hypothesis:  Gene with no different means over different class labels are eliminated. Eliminate genes with no classification capability 22 Mobile Computing & Data Mining Lab.

 T-test  T-test is used to test whether the class means of a gene are different.  Genes with no different class means are eliminated.  The significant level α is set to 0.05. Eliminate genes 23 Mobile Computing & Data Mining Lab.

Subset refinement using GA 24

Subset refinement  Encoding ：  Binary Encoding: ” 0 ”--- gene not selected; ” 1 ”---gene is selected. example ： 011001---select the 2 nd, 3 rd, 6 th genes from the subset.  Chromosome length: the candidate gene subset from step II.  Population size=5  Number of Iteration =1,000 Mobile Computing & Data Mining Lab. 25

Subset refinement  Fitness function ： the accuracy rate of SVM of the chromosome.  Selection method: Roulette Wheel Selection probability is in proportional to the fitness value of the chromosome  Single point crossover and mutation ： Crossover Rate =0.7 Mutation Rate = 0.3 Mobile Computing & Data Mining Lab. 26

Termination condition  Termination condition ： (any of the following) Accuracy rate = 100% # of iteration = 1,000 # of iteration is greater than 100 and the accuracy rates of the last 20 iterations are all the same.  Final solution ： the chromosome with the largest fitness value in the last iteration. Mobile Computing & Data Mining Lab. 27

The datasets Mobile Computing & Data Mining Lab. 29 Data set name# of samples# of class labels# of genes 9_Tumors6095,726 Brain_Tumor19055,920 Brain_Tumor250410,367 Leukemia17235,327 Leukemia272311,225 Lung Cancer203512,600 SRBCT8342,308 11_Tumors1741112,533 Prostate Tumor102210,509 DLBCL7725,469 GEMS ： http:// www.gems-system.org /

Genes selected in 3 steps Mobile Computing & Data Mining Lab. 30 Data Set # of original genes IGGroupingGA 9_Tumors 5,726 1032513 Brain_Tumor1 5,920 1851910 Brain_Tumor2 10,367 3,099194 Leukemia1 5,327 26374 Leukemia2 11,225 3,09763 Lung_Cancer 12,600 3,1833618 SRBCT 2,308 351147 11_Tumors 12,533 3,483510255 Prostate_Tumor 10,509 671235119 DLBCL 5,469 31516984

Compare with other paper  Comparisons of Our method(Hybrid), GEPUBLIC, PAM, IG-GA Mobile Computing & Data Mining Lab. 31 Data SetGEPUBLICPAMIG-GAHybrid 9_Tumors66.67(19)43.33 (47)85.00 (52)71.67(13) Brain_Tumor184.44(30)85.56 (42)93.33 (244)91.12(10) Brain_Tumor280.00(15)66.00 (25)88.00 (489)92.00(4) Leukemia197.22(11)93.06 (11)100.00 (82)97.23(4) Leukemia291.67(31)91.67 (52)98.61 (782)100.00(3) Lung_Cancer94.58(29)93.60 (75)95.57 (2101)97.05(18) SRBCT98.80(26)98.80 (41)100.00 (56)100.00(7) 11_Tumors86.21(87)81.61 (203)92.53 (479)91.95(255) Prostate_Tumor95.10(4)93.14 (13)96.08 (343)94.12(119) DLBCL97.40(13)80.52 (70)100.00 (107)97.40(84)

 Each step in our method effectively reduces noisy genes from its previous step.  The hybrid method select fewer genes with higher classification accuracy rate.  Need to further improve the hybrid method over 2-class microarray datasets. Mobile Computing & Data Mining Lab. 33

Q & A Thank you for your listening. 34

Information Gain  For a dataset D with m different class lables, Info(D) measure how well the classes of D are evenly distributed ：  Info A ： The equivalent Info (weighted sum) of subsets of D, where D is split into subsets using attribute A ：  Gain(A) ： 35, P i ： prob. of a sample in D belongs to class i. A ： {a 1,a 2,…,a v } ， attr. A has v different values D ： is split into {D 1,D 2,…,D v } D i ： contains samples with A equal to a j

Data Mining: Concepts and Techniques Attribute Selection: Information Gain Class P: buys_computer = “yes” ： 9 Class N: buys_computer = “no” ： 5 4- 36 AgeincomestudentcreditBuy <=30highnofairno <=30highnoexcellentno 31…40highnofairyes >40mediumnofairyes >40lowyesfairyes >40lowyesexcellentno 31…40lowyesexcellentyes <=30mediumnofairno <=30lowyesfairyes >40mediumyesfairyes <=30mediumyesexcellentyes 31…40mediumnoexcellentyes 31…40highyesfairyes >40mediumnoexcellentno agePN <=3023 31…4040 >4032

A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.

Similar presentations

Presentation on theme: "A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology.

Similar presentations

Presentation on theme: "A hybrid method for gene selection in microarray datasets Yungho Leu, Chien-Pan Lee and Ai-Chen Chang National Taiwan University of Science and Technology."— Presentation transcript:

Similar presentations

About project

Feedback