Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Clustering Basic Concepts and Algorithms
Capturing Best Practice for Microarray Gene Expression Data Analysis Gregory Piatetsky-Shapiro Tom Khabaza Sridhar Ramaswamy Presented briefly by Joey.
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek.
Clinical Trial Designs for the Evaluation of Prognostic & Predictive Classifiers Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer.
September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 1 Graphical.
Effect Size and Meta-Analysis
Sensitivity Analysis for Observational Comparative Effectiveness Research Prepared for: Agency for Healthcare Research and Quality (AHRQ)
T. R. Golub, D. K. Slonim & Others Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer.
Introduction to DNA Microarrays Todd Lowe BME 88a March 11, 2003.
Microarrays Dr Peter Smooker,
Concept of Measurement
Copyright, ©, 2002, John Wiley & Sons, Inc.,Karp/CELL & MOLECULAR BIOLOGY 3E Transcriptional Control in Eukaryotes Background Information Microarrays.
Part II: Discriminative Margin Clustering Joint work with: Rob Tibshirani, Dept of Statistics Patrick O. Brown, School of Medicine Stanford University.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Evaluating Hypotheses
Classification of Microarray Data. Sample Preparation Hybridization Array design Probe design Question Experimental Design Buy Chip/Array Statistical.
Fuzzy K means.
Chapter 7 Correlational Research Gay, Mills, and Airasian
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Re-Examination of the Design of Early Clinical Trials for Molecularly Targeted Drugs Richard Simon, D.Sc. National Cancer Institute linus.nci.nih.gov/brb.
Thoughts on Biomarker Discovery and Validation Karla Ballman, Ph.D. Division of Biostatistics October 29, 2007.
Paola CASTAGNOLI Maria FOTI Microarrays. Applicazioni nella genomica funzionale e nel genotyping DIPARTIMENTO DI BIOTECNOLOGIE E BIOSCIENZE.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Expression profiling of peripheral blood cells for early detection of breast cancer Introduction Early detection of breast cancer is a key to successful.
Cancer classification by Regularized Least Square Classifiers Annarita D’Addabbo a, Rosalia Maglietta a, Sabino Liuni b, Graziano Pesole b,c and Nicola.
Gene expression profiling identifies molecular subtypes of gliomas
Sp’10Bafna/Ideker Classification (SVMs / Kernel method)
Statistical Analysis Statistical Analysis
بسم الله الرحمن الرحيم * this presentation about :- “experimental design “ * Induced to :- Dr Aidah Abu Elsoud Alkaissi * Prepared by :- 1)-Hamsa karof.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
CZ5225: Modeling and Simulation in Biology Lecture 6, Microarray Cancer Classification Prof. Chen Yu Zong Tel:
Clustering of DNA Microarray Data Michael Slifker CIS 526.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
The Research Enterprise in Psychology. The Scientific Method: Terminology Operational definitions are used to clarify precisely what is meant by each.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Testing Hypotheses about Differences among Several Means.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
Functional Genomics - clustering - classification - promoter analysis - expander tool - example - biclustering.
The Use of Predictive Biomarkers in Clinical Trial Design Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute
Chapter 9 Three Tests of Significance Winston Jackson and Norine Verberg Methods: Doing Social Research, 4e.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Whole Genome Approaches to Cancer 1. What other tumor is a given rare tumor most like? 2. Is tumor X likely to respond to drug Y?
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
How To Design a Clinical Trial
Examples of Classifying Expression Data / 7.90 Computational Functional Genomics Spring 2002.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
EBM --- Journal Reading Presenter :呂宥達 Date : 2005/10/27.
Chapter 7 Measuring of data Reliability of measuring instruments The reliability* of instrument is the consistency with which it measures the target attribute.
Sample Size Determination
Clustering by soft-constraint affinity propagation: applications to gene- expression data Michele Leone, Sumedha and Martin Weight Bioinformatics, 2007.
Hypothesis Testing Introduction to Statistics Chapter 8 Feb 24-26, 2009 Classes #12-13.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Annals of Oncology 23: 298–304, 2012 종양혈액내과 R4 김태영 / prof. 김시영.
Classifiers!!! BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
How To Design a Clinical Trial
Classifiers!!! BCH339N Systems Biology / Bioinformatics – Spring 2016
Classifiers!!! BCH364C/394P Systems Biology / Bioinformatics
Molecular Classification of Cancer
IMMUNOPHENOTYPING LEUKEMIAS AND LYMPHOMAS
Volume 1, Issue 2, Pages (March 2002)
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Presentation transcript:

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Whitehead Institute/Massachusetts Institute of Technology Center for Genome Research, Cambridge, MA 02139, USA.

Contents Background; Objective; Methods; Results; Conclusion.

Background: Cancer Classification Cancer classification is central to cancer treatment; Traditional cancer classification methods: by sites: ICD- 9/10; by morphology: ICD-O etc; Limitations of morphology classification: tumors of similar histopathological appearance can have significantly different clinical courses and response to therapy; Further subdivision of morphologically similar tumors can be made at molecular level; Traditionally cancer classification relied on specific biological insights, rather than on systematic and unbiased approaches;

Background: Cancer Classification (Continued) Cancer classification can be divided into two challenges: class discovery and class prediction. Class discovery refers to defining previously unrecognized tumor subtypes. Class prediction refers to the assignment of particular tumor samples to already-defined classes.

Background: Leukemia Acute leukemia: variability in clinical outcome and subtle differences in nuclear morphology Subtypes: acute lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML); ALL subcategories: T-lineage ALL and B-lineage ALL; Particular subtypes of acute leukemia have been found to be associated with specific chromosomal translocations; No single test is currently sufficient to establish the diagnosis, but a combination of different tests in morphology, histochemistry and immunophenotyping etc. ; Although usually accurate, leukemia classification remains imperfect and errors do occur;

Objective To develop a more systematic approach to cancer classification based on the simultaneous expression monitoring of thousands of genes using DNA microarrays with leukemia as test cases;

Method: Biological Samples Primary samples: 38 bone marrow samples (27 ALL, 11 AML) obtained from acute leukemia patients at the time of diagnosis; Independent samples: 34 leukemia samples (24 bone marrow and 10 peripheral blood samples);

Method: Microarray RNA prepared from cells was hybridized to high-density oligonucleotide Affymetrix microarrays containing probes for 6817 human genes; Samples were subjected to a priori quality control standards regarding the amount of labeled RNA and the quality of the scanned microarray image.

Statistical Method: “Neighborhood analysis" (Fig.1A): Briefly, one defines an "idealized expression pattern" corresponding to a gene that is uniformly high in one class and uniformly low in the other. One tests whether there is an unusually high density of genes "nearby" (or similar to) this idealized pattern, as compared to equivalent random patterns.

Statistical Methods (Continued) Development of class predictor Uses a fixed subset of "informative genes" chosen based on their correlation with class distinction and makes a prediction on the basis of the expression level of these genes in a new sample; Each informative gene casts a "weighted vote" for one of the classes, with the magnitude of each vote dependent on the expression level in the new sample and the degree of that gene's correlation with the class distinction (Fig. 1B); The votes were summed to determine the winning class, as well as a "prediction strength" (PS), which is a measure of the margin of victory that ranges from 0 to 1; The sample was assigned to the winning class if PS exceeded a predetermined threshold, and was otherwise considered uncertain. On the basis of previous analysis, a threshold of 0.3 was used.

Statistical Methods (continued) Validity testing of class predictors Two-step procedure: (1). The accuracy of the predictors was first tested by cross-validation on the initial data set. Briefly, one withholds a sample, builds a predictor based only on the remaining samples, and predicts the class of the withheld sample. The process is repeated for each sample, and the cumulative error rate is calculated; (2). One then builds a final predictor based on the initial data set and assesses its accuracy on an independent set of samples.

Statistical Methods (continued) Clustering methods for class discovery Self-organizing maps (SOMs) technique: The user specifies the number of clusters to be identified. The SOM finds an optimal set of "centroids" around which the data points appear to aggregate. It then partitions the data set, with each centroid defining a cluster consisting of the data points nearest to it.

Results Class prediction: (1) Whether there were genes whose expression pattern was strongly correlated with the class distinction to be predicted? For the 38 acute leukemia samples, neighborhood analysis showed that roughly 1100 genes were more highly correlated with the AML-ALL class distinction than would be expected by chance (Fig. 2). This suggested that classification could indeed be based on expression data.

Results (2). How to use a collection of known samples to create a "class predictor" capable of assigning a new sample to one of two classes? A set of informative genes to be used in the predictor was chosen to be the 50 genes most closely correlated with AML-ALL distinction in the known samples.

Results (3). How to test the validity of class predictors? Cross-validation tests: The 50-gene predictor assigned 36 of the 38 samples as either AML or ALL and the remaining two as uncertain (PS < 0.3). All 36 predictions agreed with the patients' clinical diagnosis; Independent test: The 50-gene predictor was applied to an independent collection of 34 leukemia samples. The predictor made assigned 29 of the 34 samples, and the accuracy was 100%; Prediction strength: median PS = 0.77 in cross-validation and 0.73 in independent test (Fig. 3A).

Results (3). How to test the validity of class predictors (continued)? The average prediction strength was lower for samples from one laboratory that used a very different protocol for sample preparation; should standardize of sample preparation in clinical implementation.

Results (4). How many genes should be included for class predictor? The choice to use 50 informative genes in the predictor was somewhat arbitrary: well within the total number of genes strongly correlated with the class distinction; seemed large enough to be robust against noise, and small enough to be readily applied in a clinical setting. The results were insensitive to the particular choice: Predictors based on genes were all found to be 100% accurate, reflecting the strong correlation of genes with the AML-ALL distinction.

Results (5). The list of informative genes used in the AML versus ALL predictor was highly instructive (Fig. 3B). Some genes, including CD11c, CD33, and MB-1, encode cell surface proteins useful in distinguishing lymphoid from myeloid lineage cells. Others provide new markers of acute leukemia subtype. For example, the leptin receptor, originally identified through its role in weight regulation, showed high relative expression in AML. Together, these data suggest that genes useful for cancer class prediction may also provide insight into cancer pathogenesis and pharmacology.

Results (6). The methodology of class prediction can be applied to any measurable distinction among tumors. Importantly, such distinctions could concern a future clinical outcome. Ability to predict response to chemotherapy: among the 15 adult AML patients who had been treated and for whom long-term clinical follow-up was available. No evidence of a strong multigene expression signature was correlated with clinical outcome, although this could reflect the relatively small sample size.

Results Class discovery If the AML-ALL distinction was not already known, could it has been discovered simply on the basis of gene expression?

Results Two cluster analysis (1). Cluster tumors by gene expression: A two-cluster SOM was applied to automatically group the 38 initial leukemia samples into two classes on the basis of the expression pattern of all 6817 genes.

Results (2). Determine whether putative classes produced are meaningful. The clusters were first evaluated by comparing them to the known AML-ALL classes (Fig. 4A). Class A1 contained mostly ALL (24 of 25 samples) and class A2 contained mostly AML (10 of 13 samples). The SOM was thus quite effective at automatically discovering the two types of leukemia.

Results How one could evaluate such putative clusters if the "right" answer were not already known? Class discovery could be tested by class prediction; If putative classes reflect true structure, then a class predictor based on these classes should perform well.

Results To test this hypothesis, the clusters A1 and A2 were evaluated: (a). We constructed predictors to assign new samples as "type A1" or "type A2."

Result (b). Cross-validation: Predictors that used a wide range of different numbers of informative genes performed well; The cross-validation thus not only showed high accuracy, but actually refined the SOM-defined classes except for the subset of samples accurately classified;

Results (c). Independent test: The median PS was 0.61, and 74% of samples were above threshold (Fig. 4B). High prediction strengths indicate that the structure seen in the initial data set is also seen in the independent data set.

Results (d). Same analyses with random clusters: Such clusters consistently yielded predictors with poor accuracy in cross- validation and low prediction strength on the independent data set (Fig. 4B). On the basis of such analysis, the A1-A2 distinction can be seen to be meaningful, rather than simply a statistical artifact of the initial data set. The results thus show that the AML-ALL distinction could have been automatically discovered and confirmed without previous biological knowledge.

Results Multiple cluster analysis (1). SOM divides the samples into four clusters, which largely corresponded to AML, T-lineage ALL, B-lineage ALL, and B-lineage ALL, respectively (Fig. 4C). The four- cluster SOM thus divided the samples along another key biological distinction. (2) Evaluated these classes by constructing class predictors. The four classes could be distinguished from one another, with the exception of B3 versus B4 (Fig. 4D).

Results Multiple cluster analysis (continued) The prediction tests thus confirmed the distinctions corresponding to AML, B-ALL, and T-ALL, and suggested that it may be appropriate to merge classes B3 and B4, composed primarily of B-lineage ALL.

Conclusion Class Prediction Described techniques for class prediction, whereby samples can be automatically assigned to already-recognized classes; These class predictors could be adapted to a clinical setting, with appropriate steps to standardize the protocol for sample preparation. Such a test supplementing rather than replacing existing leukemia diagnostics;

Conclusion Class Prediction (continued): Class predictors can be constructed for known pathological categories and provide diagnostic confirmation or clarify unusual cases. The technique of class prediction can be applied to distinctions relating to future clinical outcome, such as drug response or survival. Class prediction provides an unbiased, general approach to constructing such prognostic tests.

Conclusion Class Discovery In principle, the class discovery techniques discovered here can be used to identify fundamental subtypes of any cancer. In general, such studies will require careful experimental design to avoid potential experimental artifacts--especially in the case of solid tumors.

Conclusion Class Discovery (continued) Various approaches could be used to avoid such artifacts; Class discovery methods could also be used to search for fundamental mechanisms that cut across distinct types of cancers.