Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Slides:

Advertisements

Similar presentations

Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.

Advertisements

T. R. Golub, D. K. Slonim & Others Big Picture in 1999 The Need for Cancer Classification Cancer classification very important for advances in cancer.

Reduced Support Vector Machine

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

CHAPTER 6 Statistical Analysis of Experimental Data

6-2 The Standard Normal Distribution

BPS - 5th Ed. Chapter 31 The Normal Distributions.

Correlation and Regression Analysis

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Cross Tabulation and Chi-Square Testing. Cross-Tabulation While a frequency distribution describes one variable at a time, a cross-tabulation describes.

Title: Spatial Data Mining in Geo-Business. Overview  Twisting the Perspective of Map Surfaces — describes the character of spatial distributions through.

Gene expression profiling identifies molecular subtypes of gliomas

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek.

Quantitative Skills: Data Analysis and Graphing.

Supplementary Text 2. If the relationship between two random variables X and Y is approximately illustrated in the following figure, then how can we discretize.

Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.

PSYCHOLOGY: Themes and Variations Weiten and McCann Appendix B : Statistical Methods Copyright © 2007 by Nelson, a division of Thomson Canada Limited.

© Copyright McGraw-Hill CHAPTER 3 Data Description.

Quantitative Skills 1: Graphing

Essential Statistics Chapter 31 The Normal Distributions.

Scenario 6 Distinguishing different types of leukemia to target treatment.

Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

BPS - 5th Ed. Chapter 31 The Normal Distributions.

Essential Statistics Chapter 31 The Normal Distributions.

11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.

Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.

Chapter 7 Random Variables and Continuous Distributions.

Unsupervised Learning

Lecture Slides Elementary Statistics Twelfth Edition

CHAPTER 3 Data Description 9/17/2018 Kasturiarachi.

Elementary Statistics

Molecular Classification of Cancer

Lecture Slides Elementary Statistics Thirteenth Edition

Strategy Description Discovery Validation Application

Chapter 10 Correlation and Regression

Volume 1, Issue 2, Pages (March 2002)

Volume 6, Issue 5, Pages e5 (May 2018)

Volume 1, Issue 1, Pages (February 2002)

Basic Practice of Statistics - 3rd Edition The Normal Distributions

Revealing Global Regulatory Perturbations across Human Cancers

Pejman Mohammadi, Niko Beerenwinkel, Yaakov Benenson Cell Systems

Translation of Genotype to Phenotype by a Hierarchy of Cell Subsystems

Face Pareidolia in the Rhesus Monkey

Gene-Expression Variation Within and Among Human Populations

An Introduction to Correlational Research

Volume 3, Issue 1, Pages (July 2016)

Hippocampal “Time Cells”: Time versus Path Integration

Basic Practice of Statistics - 3rd Edition The Normal Distributions

Gene expression correlates of clinical prostate cancer behavior

Revealing Global Regulatory Perturbations across Human Cancers

Medial Axis Shape Coding in Macaque Inferotemporal Cortex

Volume 91, Issue 5, Pages (September 2016)

Colocalization of Multiple DNA Loci: A Physical Mechanism

Volume 89, Issue 6, Pages (March 2016)

Testing the Fit of a Quantal Model of Neurotransmission

Grid and Nongrid Cells in Medial Entorhinal Cortex Represent Spatial Location and Environmental Features with Complementary Coding Schemes Geoffrey W.

Assessing Exhaustiveness of Stochastic Sampling for Integrative Modeling of Macromolecular Structures Shruthi Viswanath, Ilan E. Chemmama, Peter Cimermancic,

Volume 50, Issue 4, Pages (May 2006)

Brandon Ho, Anastasia Baryshnikova, Grant W. Brown Cell Systems

Qing-Rong Chen, Gordon Vansant, Kahuku Oades, Maria Pickering, Jun S

MGS 3100 Business Analysis Regression Feb 18, 2016

Volume 11, Issue 7, Pages (May 2015)

Unsupervised Learning

Evolution-informed forecasting of seasonal influenza A (H3N2)‏

Presentation transcript:

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Figure 1a. Neighborhood Analysis. The class distinction is represented by an 'idealized expression pattern c, in which the expression level is uniformly high in class 1 and uniformly low in class 2. Each gene is represented by an expression vector, consisting of its expression level in each of the tumor samples. In the figure, the dataset consists of 12 samples comprised of 6 AMLs and 6 ALLs. Gene g1 is well correlated with the class distinction, while g2 is poorly correlated. Neighborhood analysis involves counting the number of genes having various levels of correlation with c. The results are compared to the corresponding distribution obtained for random idealized expression patterns c*, obtained by randomly permuting the coordinates of c. An unusually high density of genes indicates that there are many more genes correlated with the pattern than expected by chance. The precise measure of distance and other methodological details are described in notes (16,17) and on our web site.

Figure 1b. The prediction of a new sample is based on 'weighted votes' of a set of informative genes. Each such gene gi votes for either AML or ALL, depending on whether its expression level x_i in the sample is closer to mu_AML or mu_ALL (which denote, respectively, the mean expression levels of AML and ALL in a set of reference samples). The magnitude of the vote is w_i v_i, where w_i is a weighting factor that reflects how well the gene is correlated with the class distinction and v_i = |x_i - (mu_AML + mu_ALL)/2| reflects the deviation of the expression level in the sample from the average of mu_AML and mu_ALL. The votes for each class are summed to obtain total votes V_AML and V_ALL. The sample is assigned to the class with the higher vote total, provided that the prediction strength exceeds a predetermined threshold. The prediction strength reflects the margin of victory and is defined as (V_win- V_lose)/(V_win+V_lose), where as V_win and V_lose are the respective vote totals for the winning and losing classes. Methodological details are described in the paper (notes 19,20).

Figure 2. Neighborhood analysis: ALL vs AML. For the 38 leukemia samples in the initial dataset, the plot shows the number of genes within various 'neighborhoods' of the the ALL/AML class distinction together with curves showing the 5% and 1% significance levels for the number of genes within corresponding neighborhoods of the randomly permuted class distinctions (see notes 16,17 in the paper). Genes more highly expressed in ALL compared to AML are shown in the left panel; those more highly expressed in AML compared to ALL are shown in right panel. Note the large number of genes highly correlated with the class distinction. In the left panel (higher in ALL), the number of genes with correlation P(g,c) > 0.30 was 709 for the AML-ALL distinction, but had a median of 173 genes for random class distinctions. Note that P(g,c) = 0.30 is the point where the observed data intersects the 1% significance level, meaning that 1% of random neighborhoods contain as many points as the observed neighborhood round the AML- ALL distinction. Similarly, in the right panel (higher in AML), 711 genes with P(g,c) > 0.28 were observed, whereas a median of 136 genes is expected for random class distinctions.

Figure 3a. Prediction strengths. The scatterplots show the prediction strengths (PS) for the samples in cross- validation (left) and on the independent sample (right). Median PS is denoted by a horizontal line. Predictions with PS below 0.3 are considered as uncertain.

Figure 3b. Genes distinguishing ALL from AML. The 50 genes most highly correlated with the ALL/AML class distinction are shown. Each row corresponds to a gene, with the columns corresponding to expression levels in different samples. Expression levels for each gene are normalized across the samples such that the mean is 0 and the standard deviation is 1. Expression levels greater than the mean are shaded in red, and those below the mean are shaded in blue. The scale indicates standard deviations above or below the mean. The top panel shows genes highly expressed in ALL, the bottom panel shows genes more highly expressed in AML. Note that while these genes as a group appear correlated with class, no single gene is uniformly expressed across the class, illustrating the value of a multi-gene prediction method.

Figure 4. ALL/AML class discovery. (A) Schematic representation of 2-cluster SOM. A 2-cluster (2x1) SOM was generated from the 38 initial leukemia samples, using a modification of the GENECLUSTER computer package (32). Each of the 38 samples is thereby placed into one of two clusters on the basis of patterns of gene expression for the 6817 genes assayed in each sample. Note that cluster A1 contains the majority of ALL samples (grey squares) and cluster A2 contains the majority of AML samples (black circles). (B) Prediction strength (PS) distributions. The scatterplots show the distribution of PS scores for class predictors. The first two plots show the distribution for the predictor created to classify samples as 'A1-type' or 'A2-type' tested in cross-validation on the initial dataset (median PS = 0.86) and on the independent dataset (median PS = 0.61). The remaining plots show the distribution for two predictors corresponding to random classes. In these cases, the PS scores are much lower (median PS = 0.20 and 0.34, respectively) and approximately half of the samples fall below the threshold for prediction (PS = 0.3). A total of 100 such random predictors were examined, to calculate the distribution of median PS scores to evaluate statistical the significance of the predictor for A1-A2 (see note 36 in the paper). (C) Schematic representation of the 4-cluster SOM. AML samples are shown as black circles, T-lineage ALL striped squares, and B-lineage ALL as grey squares. T- and B-lineages were differentiated on the basis of cell-surface immunophenotyping. Note that class B1 is exclusively AML, class B2 contains all 8 T-ALLs, and classes B3 and B4 contain the majority of of B-ALL samples. (D) Prediction strength (PS) distributions for pairwise comparison among classes. Cross-validation prediction studies show that the four classes could be distinguished with high prediction scores, with the exception of classes B3 and B4. These two classes could not be easily distinguished from one another, consistent with their both containing-primarily B-ALL samples, and suggesting that B3 and B4 might best be merged into a single class.

Supplemetal Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"

Supplementary fig. 1. Schematic illustration of methodology. (a) Strategy for cancer classification. Tumor classes may be known a priori or discovered on the basis of the expression data by using Self-Organizing Maps (SOMs) as described in the text. Class Prediction involves assignment of an unknown tumor sample to the appropriate class on the basis of gene expression pattern. This consists of several steps: neighborhood analysis to assess whether there is a significant excess of genes correlated with the class distinction, selection of the informative genes and construction of a class predictor, initial evaluation of class prediction by cross-validation, and final evaluation by testing in an independent data set.

Supplementary fig. 2. Expression levels of predictive genes in independent dataset. The expression levels of the 50 genes most highly correlated with the ALL-AML distinction in the initial dataset were determined in the independent dataset. Each row corresponds to a gene, with the columns corresponding to expression levels in different samples. The expression level of each gene in the independent dataset is shown relative to the mean of expression levels for that gene in the initial dataset. Expression levels greater than the mean are shaded in red, and those below the mean are shaded in blue. The scale indicates standard deviations above or below the mean. The top panel shows genes highly expressed in ALL, the bottom panel shows genes more highly expressed in AML.