1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

Random Forest Predrag Radenković 3237/10
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Sparse vs. Ensemble Approaches to Supervised Learning
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Discrimination and clustering with microarray gene expression data Terry Speed, Jane Fridlyand, Yee Hwa Yang and Sandrine Dudoit* Department of Statistics,
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Discrimination or Class prediction or Supervised Learning.
2D1431 Machine Learning Boosting.
Classification in Microarray Experiments
Discrimination Class web site: Statistics for Microarrays.
Resampling techniques
Ensemble Learning: An Introduction
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Discrimination Methods As Used In Gene Array Analysis.
Three kinds of learning
Classification 10/03/07.
Basics of discriminant analysis
Competent Undemocratic Committees Włodzisław Duch, Łukasz Itert and Karol Grudziński Department of Informatics, Nicholas Copernicus University, Torun,
1 Diagnosing Breast Cancer with Ensemble Strategies for a Medical Diagnostic Decision Support System David West East Carolina University Paul Mangiameli.
Machine Learning: Ensemble Methods
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Prelude of Machine Learning 202 Statistical Data Analysis in the Computer Age (1991) Bradely Efron and Robert Tibshirani.
Ensemble Learning (2), Tree and Forest
Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
Xuelian Wei Department of Statistics Most of Slides Adapted from by Darlene Goldstein Classification.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Model Building III – Remedial Measures KNNL – Chapter 11.
1 Classification and Clustering Methods for Gene Expression Kim-Anh Do, Ph.D. Department of Biostatistics The University of Texas M.D. Anderson Cancer.
Statistics for Microarray Data Analysis with R Session 8: Discrimination Class web site:
SLIDES RECYCLED FROM ppt slides by Darlene Goldstein Supervised Learning, Classification, Discrimination.
The Broad Institute of MIT and Harvard Classification / Prediction.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
1 Advanced analysis: Classification, clustering and other multivariate methods. Statistics for Microarray Data Analysis – Lecture 4 The Fields Institute.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
Bagging and Random Forests
Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN
Dimensionality reduction
Overview of Supervised Learning
Generally Discriminant Analysis
Multivariate Methods Berlin Chen, 2005 References:
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang

2 purpose of this paper Compare the performance of different discrimination methods Nearest Neighbor classifier Linear discriminant analysis Classification tree Machine learning approaches: bagging, boosting Investigate the use of prediction votes to assess the confidence of each prediction

3 Statistical problems: The identification of new/unknown tumor classes using gene expression profiles  Clustering analysis/unsupervised learning The classification of malignancies into known classes  Discriminant analysis/supervised learning The identification of marker genes that identified different tumor classes  Variable (Gene) selection

4 Datasets Gene expression data on p genes for n mRNA samples: n x p matrix X={x ij }, where x ij denotes the expression level of gene (variable) j in ith mRNA sample(observation) Response: k-dimensional vector Y={y i }, where y i denotes the class of observation i Lymphoma dataset (p=4682, n=81,k=3) Leukemia dataset (p=3571, n=72, k=3 or 2) NCI 60 dataset (p=5244, n=61, k=8)

5 Data preprocessing Imputation of missing data (KNN) Standardization of data (Euclidean distance) preliminary gene selection Lymphoma dataset (p=4682  p=50, n=81,k=3) Leukemia dataset (p=3571  p=40, n=72, k=3) NCI 60 dataset (p=5244  p=30, n=61, k=8)

6 Visual presentation of Leukemia dataset Correlation matrix (72x72) ordered by class Black: 0 correlation / Red: positive correlation / Green: negative correlation P=3571 p=40

7 Prediction Methods Supervised Learning Methods Machine learning approaches

8 Supervised Learning Methods Nearest Neighbor classifier(NN) Fisher Linear Discriminant Analysis (LDA) Weighted Gene Voting Classification trees (CART)

9 Nearest Neighbor The k-NN rule Find the k closest observations in the learning set Predict the class for each element in the test dataset by majority vote K is chosen by minimizing cross-validation error rate

10 Linear Discirminantion Analysis FLDA consists of finding linear functions a ’ x of the gene expression levels x=(x 1, …,x p ) with large ratio of between groups to within groups sum of squares Predicting the class of an observation by the class whose mean vector is closest to the discrimination variables

11 Maximum likelihood discriminant rules Predicts the class of an observation x as C(x)=argmaxkpr(x|y=k)

12 Weighted Gene Voting An observation x=(x 1,…x p ) is classified as 1 iff Prediction strength as the margin of victory(p9)

13 Classification tree Constructed by repeated splits of subsets (nodes) Each terminal subset is assigned a class label The size of the tree is determined by minimizing the cross validation error rate Three aspects to tree construction  the selection of the splits  the stopping criteria  the assignment of each terminal node to a class

14 Aggregated Predictors There are several ways to generate perturbed learning set: Bagging Boosting Convex Pseudo data (CPD)

15 Bagging Predictors are built for each sub-sample and aggregated by Majority voting with equal w b =1 Non-parametric bootstrap: drawing at random with replacement to form a perturbed learning sets of the same size as the original learning set By product: out of bag observations can be used to estimate misclassification rates of bagged predictors A prediction for each observation (xi, yi) is obtained by aggregating the classifiers in which (xi,yi) is out-of-bag

16 Bagging (cont.) Parametric bootstrap: Perturbed learning sets are generated according to a mixture of MVN distributions For each class k, the class sample mean and covariance matrix were taken as the estimates of distribution parameters Make sure at least one observation sampled from each class

17 Boosting The b th step of the boosting algorithm Get another learning set L b of the same size n L Build a classifier based on L b Run the learning set L let di=1 if the ith case is classified incorrectly di=0 otherwise Define  b =  P i d i and B b di =(1-  b )/  b Update by p i =p i B b di /  p i B b di Re-sampling probabilities are reset to equal if  b >=1/2 or  b =0

18 Prediction votes For aggregated classifiers, prediction votes assessing the strength of a prediction may be defined for each observation The prediction vote (PV) for an observation x

19 Study Design Randomly divide the dataset into a learning and test set (2:1 scheme) For each of N=150 runs: Select a subset of p genes from the learning set with the largest BSS/WSS Build the different predictors using the learning sets with p genes Apply the predictors to the observations in the test set to obtain test set error rates

20 Results Test set error rates: apply classifier build based on learning set to test set. Summarized by box-plot over runs Observation-wise error rates: for each observation, record the proportion of times it was classified incorrectly. Summarized by means of survival plots Variable selection: compare the effect of increasing or decreasing number of genes (variables)

21 Leukemia data, two classes

22 Leukemia data, three classes

23 Lymphoma data

24 Conclusions In the main comparison, NN and DLDA had the smallest error rates, while FLDA had the highest error rates Aggregation improved the performance of CART classifiers, the largest gains being with boosting and bagging with CPD For the lymphoma and leukemia datasets, increasing the number of variables to p=200 did not affect much the performance of the various classifiers. There was an improvement for the NCI 60 dataset. A more carefully selection of a small number of genes (p=10) improved the performance of FLDA dramatically