Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

BPS - 5th Ed. Chapter 241 One-Way Analysis of Variance: Comparing Several Means.
CHAPTER 25: One-Way Analysis of Variance Comparing Several Means
Topics Today: Case I: t-test single mean: Does a particular sample belong to a hypothesized population? Thursday: Case II: t-test independent means: Are.
Instance-based Classification Examine the training samples each time a new query instance is given. The relationship between the new query instance and.
statistics NONPARAMETRIC TEST
Comparing Two Population Means The Two-Sample T-Test and T-Interval.
© 2010 Pearson Prentice Hall. All rights reserved Single Factor ANOVA.
PSY 307 – Statistics for the Behavioral Sciences
Chapter 10 Simple Regression.
Differentially expressed genes
BCOR 1020 Business Statistics Lecture 22 – April 10, 2008.
Statistics Are Fun! Analysis of Variance
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Lecture 9: One Way ANOVA Between Subjects
EEM332 Design of Experiments En. Mohd Nazri Mahmud
Chapter 9 Hypothesis Testing.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Hypothesis Testing Using The One-Sample t-Test
Chapter 9: Introduction to the t statistic
Linear Regression/Correlation
Statistical Analysis. Purpose of Statistical Analysis Determines whether the results found in an experiment are meaningful. Answers the question: –Does.
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
AM Recitation 2/10/11.
Inference for regression - Simple linear regression
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 9 Hypothesis Testing.
Multiple testing in high- throughput biology Petter Mostad.
Ch 10 Comparing Two Proportions Target Goal: I can determine the significance of a two sample proportion. 10.1b h.w: pg 623: 15, 17, 21, 23.
Regression Analysis (2)
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 11 Section 2 – Slide 1 of 25 Chapter 11 Section 2 Inference about Two Means: Independent.
Essential Statistics in Biology: Getting the Numbers Right
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Inference for distributions: - Comparing two means IPS chapter 7.2 © 2006 W.H. Freeman and Company.
t(ea) for Two: Test between the Means of Different Groups When you want to know if there is a ‘difference’ between the two groups in the mean Use “t-test”.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 10 Comparing Two Populations or Groups 10.2.
T-TEST Statistics The t test is used to compare to groups to answer the differential research questions. Its values determines the difference by comparing.
Psychology 301 Chapters & Differences Between Two Means Introduction to Analysis of Variance Multiple Comparisons.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Testing Hypotheses about Differences among Several Means.
Class Prediction and Discovery Using Gene Expression Data Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander 발표자 : 이인희.
McGraw-Hill/Irwin Copyright © 2007 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 8 Hypothesis Testing.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Statistics for Differential Expression Naomi Altman Oct. 06.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
Chapter 10 The t Test for Two Independent Samples
Comparison of 2 Population Means Goal: To compare 2 populations/treatments wrt a numeric outcome Sampling Design: Independent Samples (Parallel Groups)
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
The Broad Institute of MIT and Harvard Differential Analysis.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Class Seven Turn In: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 For Class Eight: Chapter 20: 18, 20, 24 Chapter 22: 34, 36 Read Chapters 23 &
Chapter 15 Analysis of Variance. The article “Could Mean Platelet Volume be a Predictive Marker for Acute Myocardial Infarction?” (Medical Science Monitor,
Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.
The Practice of Statistics, 5 th Edition1 Check your pulse! Count your pulse for 15 seconds. Multiply by 4 to get your pulse rate for a minute. Write that.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 10 Comparing Two Populations or Groups 10.2.
Lecture Nine - Twelve Tests of Significance.
CHAPTER 10 Comparing Two Populations or Groups
Basic Practice of Statistics - 5th Edition
12 Inferential Analysis.
CHAPTER 10 Comparing Two Populations or Groups
12 Inferential Analysis.
CHAPTER 10 Comparing Two Populations or Groups
What are their purposes? What kinds?
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
Presentation transcript:

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

Basis of Cancer Diagnosis Pathologist makes an interpretation based upon a compendium of knowledge which may include –Morphological appearance of the tumor –Histochemistry –Immunophenotyping –Cytogenetic analysis –etc.

Clinically Distinct DLBCL Subgroups

Improved Cancer Diagnosis: Identify sub-classes Divide morphologically similar tumors into different groups based on response. Application of microarrays: Characterize molecular variations among tumors by monitoring gene expression Goal: microarrays will lead to more reliable tumor classification and sub-classification (therefore, more appropriate treatments will be administered resulting in improved outcomes)

Distinguishing two types of acute leukemia (AML vs. ALL) Golub, T.R. et al Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: (near bottom of page)

Distinguishing AML vs. ALL 38 BM samples (27 childhood ALL, 11 adult AML) were hybridized to Affymetrix GeneChips –GeneChip included 6,817 human genes. –Affymetrix MAS 4.0 software was used to perform image analysis. –MAS 4.0 Average Difference expression summary method was applied to the probe level data to obtain probe set expression summaries. –Scaling factor was used to normalize the GeneChips. –Samples were required to meet quality control criteria.

Distinguishing AML vs. ALL Class comparison –Neighborhood analysis Class prediction –Weighted voting

Class Discovery: Distinguishing AML vs. ALL The mean of a random variable X is a measure of central location of the density of X. The variance of a random variable is a measure of spread or dispersion of the density of X. Var(X)=E[(X-  ) 2 ] =∑(X -  ) 2 /(n-1) Standard deviation = =  (X)

Class Discovery: Distinguishing AML vs. ALL For each gene, compute the log of the expression values. For a given gene g, Let represent the mean log expression value; represent the stdev log expression value. represent the mean log expression value; represent the stdev log expression value. For AML Let For ALL Let

Class Discovery: Distinguishing AML vs. ALL Illustration using ALL AML example.xls

For each gene, compute a relative class separation (quasi-correlation measure) as follows Define neighborhoods of radius r about classes 1 and 2 such that P(g,c) > r or P(g,c) < -r. r was chosen to be 0.3 Class Discovery: Distinguishing AML vs. ALL

Aside This differs from Pearson’s correlation and is therefore not confined to [-1,1] interval

Aside Illustration using Correlation.xls

Class Discovery: Distinguishing AML vs. ALL A permutation test was used to calculate whether the observed number of genes in a neighborhood was significantly higher than expected.

Permutation based methods Permutation based adjusted p-values –Under the complete null, the joint distribution of the test statistics can be estimated by permuting the columns of the gene expression matrix –Permuting entire columns creates a situation in which membership to the Class 1 and Class 2 groups is independent of gene expression but preserves the dependence structure between genes

Permutation based methods

Permutation algorithm for the b th permutation, b=1,…,B –1) Permute the n labels of the data matrix X –2) Compute relative class separation P(g 1,c) b,…, P(g p,c) b for each gene g i. The permutation distribution of the relative class separation P(g,c) for gene g i, i=1,…,p is given by the empirical distribution of P(g,c) j,1,…, P(g,c) j,B.

Distinguishing AML vs. ALL Class comparisons using neighborhood analysis revealed approximately 1,100 genes were correlated with class (AML or ALL) than would be expected by chance.

Class Prediction: Distinguishing AML vs. ALL For set of informative genes, each expression value x i votes for either ALL or AML, depending on whether its expression value is closer to μ ALL or μ AML –Let μ ALL represent the mean expression value for ALL –Let μ AML represent the mean expression value for AML Informative genes were the n/2 genes with the largest P(g,c) and the n/2 genes with the smallest P(g,c) Golub et al choose n = 50

Class Prediction: Distinguishing AML vs. ALL w i is a weighting factor that reflects how well the gene is correlated with class distinction; w i v i is the weighted vote For each sample, the weighted votes for each class are summed to get V ALL and V AML The sample is assigned to the class with the higher total, provided the Prediction Strength (PS) > 0.3 where PS = (V win – V lose )/ (V win + V lose )

Class Prediction: Distinguishing AML vs. ALL

Checking model adequacy –Cross-validation of training dataset –Applied model to an independent dataset of 34 samples

Class Discovery Determine whether the samples can be divided based only on gene expression without regard to the class labels –Self-organizing maps

Hypothesis Testing The hypothesis that two means  1 and  2 are equal is called a null hypothesis, commonly abbreviated H 0. This is typically written as H 0 :  1 =  2 Its antithesis is the alternative hypothesis, H A :  1   2

Hypothesis Testing A statistical test of hypothesis is a procedure for assessing the compatibility of the data with the null hypothesis. –The data are considered compatible with H 0 if any discrepancy from H 0 could readily be due to chance (i.e., sampling error). –Data judged to be incompatible with H 0 are taken as evidence in favor of H A.

Hypothesis Testing If the sample means calculated are identical, we would suspect the null hypothesis is true. Even if the null hypothesis is true, we do not really expect the sample means to be identically equal because of sampling variability. We would feel comfortable concluding H 0 is true if the chance difference in the sample means should not exceed a couple of standard errors.

T-test In testing H 0 :  1 =  2 against H A :  1   2 note that we could have restated the null hypothesis as H 0 :  1 -  2 = 0 and H A :  1 -  2  0 To carry out the t-test, the first step is to compute the test statistic and then compare the result to a t- distribution with the appropriate degrees of freedom (df)

T-test Data must be independent random samples from their respective populations Sample size should either be large or, in the case of small sample sizes, the population distributions must be approximately normally distributed. When assumptions are not met, non- parametric alternatives are available (Wilcoxon Rank Sum/Mann-Whitney Test)

T-test: Probe set _at Sample numberALLAML s2s n88

T-test: Probe set _at P=0.039