Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong

Slides:



Advertisements
Similar presentations
Chapter 7 Classification and Regression Trees
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Random Forest Predrag Radenković 3237/10
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
A gene expression analysis system for medical diagnosis D. Maroulis, D. Iakovidis, S. Karkanis, I. Flaounas D. Maroulis, D. Iakovidis, S. Karkanis, I.
Decision Tree.
Minimum Redundancy and Maximum Relevance Feature Selection
Classification Techniques: Decision Tree Learning
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
Principal Component Analysis
Mutual Information Mathematical Biology Seminar
Decision Tree Algorithm
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Reduced Support Vector Machine
Evaluating Hypotheses
Predictive sub-typing of subjects Retrospective and prospective studies Exploration of clinico-genomic data Identify relevant gene expression patterns.
Lecture 5 (Classification with Decision Trees)
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Efficient Model Selection for Support Vector Machines
Classification (Supervised Clustering) Naomi Altman Nov '06.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Recursive Partitioning And Its Applications in Genetic Studies Chin-Pei Tsai Assistant Professor Department of Applied Mathematics Providence University.
Chapter 9 – Classification and Regression Trees
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Machine Learning Queens College Lecture 2: Decision Trees.
Learning from Observations Chapter 18 Through
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
DECISION TREE Ge Song. Introduction ■ Decision Tree: is a supervised learning algorithm used for classification or regression. ■ Decision Tree Graph:
Lecture Notes for Chapter 4 Introduction to Data Mining
Validation methods.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
10. Decision Trees and Markov Chains for Gene Finding.
Introduction to Machine Learning
Machine Learning – Classification David Fenyő
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Classification with Gene Expression Data
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Gene Expression Classification
Data Mining Classification: Basic Concepts and Techniques
Introduction to Data Mining, 2nd Edition by
(classification & regression trees)
Texture Classification of Normal Tissues in Computed Tomography
PCA, Clustering and Classification by Agnieszka S. Juncker
Boosting For Tumor Classification With Gene Expression Data
Pattern Recognition and Image Analysis
Discriminative Frequent Pattern Analysis for Effective Classification
Lecture 05: Decision Trees
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Generally Discriminant Analysis
Learning Chapter 18 and Parts of Chapter 20
Classification with CART
Chapter 7: Transformations
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
STT : Intro. to Statistical Learning
Presentation transcript:

Recursive Partitioning for Tumor Classification with Gene Expression Microarray Data Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong Presented by Weihua Huang

Data used in the article Expression profiles of 2,000 genes using an Affymetrix oligonucleotide array in 22 normal and 40 colon cancer tissues The response is binary indicating normal or cancer tissue and the predictor variables are the 2000 genes

Classification Tree Using Recursive Partitioning Goal: To partition the feature space into disjoint regions by growing a tree so that the group in the same region are homogeneous in terms of response. Algorithm: Start with a root node containing the study sample and split it into smaller and smaller nodes according to whether a particular selected predictor is above a chosen cutoff value. At each splitting step, the selected predictor and its corresponding level are chosen to maximize the reduction in node impurity ΔI= P(A)I(A) –P(AL)I(AL) –P(AR)I(AR)

Classification Tree using Recursive Partitioning Node impurity: One example of node impurity is measured by entropy function: - P log(P) - (1-P) log(1-P), where P is the probability of a tissue being normal within the node Minimum impurity ( =0 ) When all tissues are of the same type within the node ( P = 0 or 1) Maximum impurity ( = log2) When half normal tissues and half cancer tissues are within the node (P=0.5)

Results From Classification Tree on the Data Fig 1 Results From Classification Tree on the Data Fig 1. Classification tree for tissue types by using expression data from three genes ( M26383, R15447, M28214)

Another Way to Visualize the Recursive Partitioning Fig 3 Another Way to Visualize the Recursive Partitioning Fig 3. A scatterplot of expression data from R15447 and M28214 for a subset of tissues (node 3 in Fig. 1).

Results from Recursive partitioning Quality of the tree-based classification: Using localized 5-fold cross validation error rate: The same genes to the same nodes Randomly divide the 40 cancer tissues into 5 subsamples of 8, and the 22 normal tissues into 5 subsamples of 4,4,4,5, and 5; four subsamples each from the cancer and normal tissues were used to choose the cutoff values for the three splits. The remaining samples were used to count the misclassified tissues as a result of new cutoff values. The error rate is between 6-8% from two runs of cross validation, which is much better than that obtained by existing analysis.

Correlation Analysis on Genes Functional expressions from various genes are correlated. Examine the correlation patterns of the three selected genes in Fig. 1.

Correlation Between the Three Selected Genes and the Remaining Expression Data

Another Tree Based on a Different Set of Three Genes Fig. 6 Another Tree Based on a Different Set of Three Genes Fig. 6. Classification tree for tissue types using expression data from three genes (R87126, T62947, X15183)

Correlation Matrix Among Genes in Fig.1 and Fig. 6

Advantages of the Classification Tree 1. Efficient with large number of genes 2. Automatically selects valuable and user-friendly genes as predictors 3. More precise than some other classification methods such as support vector machine and linear discriminant analysis

Conclusions: 1. It is likely that the information contained in a large number of genes can be captured by a small optimal set of genes without significant loss of information. 2. The precision of classification of recursive partitioning is important for clinical application.