Significance Analysis of Microarrays (SAM)

Slides:



Advertisements
Similar presentations
Shibing Deng Pfizer, Inc. Efficient Outlier Identification in Lung Cancer Study.
Advertisements

1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Relating Gene Expression to a Phenotype and External Biological Information Richard Simon, D.Sc. Chief, Biometric Research Branch, NCI
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
CHAPTER 25: One-Way Analysis of Variance Comparing Several Means
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
1 Introduction to Experimental Design 1/26/2009 Copyright © 2009 Dan Nettleton.
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
OHRI Bioinformatics Introduction to the Significance Analysis of Microarrays application Stem.
© 2010 Pearson Prentice Hall. All rights reserved Single Factor ANOVA.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
6/10/20151 Microarray Data Analysis. 6/10/20152 Copyright notice Many of the images in this power point presentation of other people. The Copyright belong.
Preprocessing Methods for Two-Color Microarray Data
Sample size computations Petter Mostad
Differentially expressed genes
Biostatistics Unit 2 Descriptive Biostatistics 1.
1 Test of significance for small samples Javier Cabrera.
Statistics 800: Quantitative Business Analysis for Decision Making Measures of Locations and Variability.
Inferences About Process Quality
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
1 Normalization Methods for Two-Color Microarray Data 1/13/2009 Copyright © 2009 Dan Nettleton.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 14 Analysis.
T-test Mechanics. Z-score If we know the population mean and standard deviation, for any value of X we can compute a z-score Z-score tells us how far.
Correlation.
Essential Statistics in Biology: Getting the Numbers Right
1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Wilcoxon rank sum test (or the Mann-Whitney U test) In statistics, the Mann-Whitney U test (also called the Mann-Whitney-Wilcoxon (MWW), Wilcoxon rank-sum.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Statistics for Differential Expression Naomi Altman Oct. 06.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
1 Introduction to Mixed Linear Models in Microarray Experiments 2/1/2011 Copyright © 2011 Dan Nettleton.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Copyright © Cengage Learning. All rights reserved. 15 Distribution-Free Procedures.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Two-Sample-Means-1 Two Independent Populations (Chapter 6) Develop a confidence interval for the difference in means between two independent normal populations.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Chapter 11 Analysis of Variance
ANOVA: Analysis of Variation
Estimation of Gene-Specific Variance
Multiple Testing Methods for the Analysis of Microarray Data
Estimation & Hypothesis Testing for Two Population Parameters
Multiple Regression Analysis: Inference
Differential Gene Expression
Basic Practice of Statistics - 5th Edition
12 Inferential Analysis.
Normalization Methods for Two-Color Microarray Data
Significance analysis of microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Inverse Transformation Scale Experimental Power Graphing
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Multiple Testing Methods for the Analysis of Gene Expression Data
Comparing Three or More Means
Chapter 11 Analysis of Variance
One-Way Analysis of Variance
12 Inferential Analysis.
Parametric Empirical Bayes Methods for Microarrays
Introduction to Mixed Linear Models in Microarray Experiments
Introduction to Experimental Design
2.3. Measures of Dispersion (Variation):
Distribution-Free Procedures
pairing data values (before-after, method1 vs
Presentation transcript:

Significance Analysis of Microarrays (SAM) 3/7/2011 Copyright © 2011 Dan Nettleton

Significance Analysis of Microarrays (SAM) Tusher, V. G., Tibshirani, R., and Chu, G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proceedings of the National Academy of Sciences 98(9), 5116-5121. See http://www-stat.stanford.edu/~tibs/SAM/ for more recent developments.

Motivating Experiment Treatment Irradiated (I) Unirradiated (U) 1 Human Cell Lines 2 One RNA sample for each combination of cell line and treatment

Motivating Experiment Treatment Irradiated (I) Unirradiated (U) 1 I1A I1B U1A U1B Human Cell Lines 2 I2A I2B U2A U2B After labeling, each RNA sample was split into two aliquots denoted A and B.

Motivating Experiment Treatment Irradiated (I) Unirradiated (U) 1 I1A I1B U1A U1B Human Cell Lines 2 I2A I2B U2A U2B 8 GeneChips, one for each sample, were used to obtain measures of expression.

The Measure of Expression The Tusher et al. paper appeared before much work had been done on computing expression measures from PM and MM intensities. It appears that they used something similar to an average of PM-MM intensities over the probe pairs in a probe set as their expression measure. They conducted a regression-based normalization in which each GeneChip was normalized to the average over all GeneChips. Furthermore they worked with normalized data on the original rather than the log scale.

Test Statistic for the ith Gene Average of 4 normalized measures from irradiated samples Average of 4 normalized measures from unirradiated samples - - xI(i) – xU(i) d(i) = s(i)+s0 The usual standard error in the denominator of a two-sample t-stat A constant common to all genes that is added to make variation in d(i) similar across genes of all intensity levels

Selecting the constant s0 At low expression levels, variance in d(i) can be high because of small values of s(i). To stabilize the variance of d(i) across genes, a small positive constant s0 was used in the denominator of the test statistic. “The coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data. The value for s0 was chosen to minimize the coefficient of variation.” s0 was chosen to be 3.3 for the ionizing radiation data.

More Detail on Selecting s0 The d(i) are separated into approximately 100 groups. The 1% of the d(i) values with the smallest s(i) values are placed in the first group, the 1% of the d(i) values with the next smallest s(i) are placed in the second group, etc. The median absolute deviation (MAD) of the d(i) values is computed separately for each group. The coefficient of variation (CV) of these 100 MAD values is computed.

More Detail on Selecting s0 (continued) This process is repeated for values of s0 equal to the minimum of s(i) over i, the 5th percentile of the s(i) values, the 10th percentile of the s(i) values,..., the 95th percentile of the s(i) values, and the maximum of the s(i) values. The value of s0 that minimizes the CV of the 100 MAD values over candidate s0 described above is selected as the constant s0.

Plot of d(i) vs. log10s(i) for the Ionizing Radiation Data

A Permutation Procedure for Assessing Significance The irradiated and unirradiated GeneChips were shuffled within each cell line. The d(i) statistic was computed for each gene and ordered across genes from smallest to largest to obtain d1(1)<d1(2)< <d1(g) where g denotes the number of genes. Steps 1 and 2 were repeated for all possible data permutations described in step 1 to obtain dp(1)<dp(2)< <dp(g) for p=1,...,36. ... ... 4 2 4 2

Example Permutations Treatment Irradiated (I) Unirradiated (U) 1 I1B U1A U1B Human Cell Lines 2 I2A I2B U2A U2B

A Permutation Procedure for Assessing Significance (continued) For each i, d1(i),...,d36(i) were averaged to obtain dE(i), the “expected relative difference.” The original d(i) statistics were also sorted so that d(1)<d(2)< <d(g). 6. Genes for which | d(i) – dE(i) | >  were declared significant, where  is a user specified cutoff for significance. ...

Plot of Observed vs. “Expected” Test Statistics Points for genes with evidence of induction d(i) 2 Points for genes with evidence of repression dE(i)

Plot of d(i) vs. log10s(i) for the Ionizing Radiation Data 24 induced genes d(i) 22 repressed genes log10s(i)

Estimating FDR for a Selected  Find the smallest d(i) among those d(i) for which d(i) – dE(i) >  and call it dup. Find the largest d(i) among those d(i) for which d(i) - dE(i) < - and call it ddown. For each permuted data set, find the number of genes with d(i) >= dup or d(i) <= ddown and denote these counts by n1,...,n36. FDR is estimated by n / n where n is the average of n1,...,n36 and n is the number of genes identified as significant in the original data. - -

Plot of Observed vs. “Expected” Test Statistics 4.054688 d(i) -4.073859 dE(i)

Plot of d(i) vs. log10s(i) for the Ionizing Radiation Data 4.054688 d(i) -4.073859 log10s(i)

Same Plot for One of the Permuted Data Sets only 5 genes beyond thresholds compared to 46 for original data 4.054688 d(i) -4.073859 log10s(i)

Counts of Genes beyond the Threshold for Each Permutation Perm Count Perm Count Perm Count 1 45 2 5 3 2 4 3 5 4 6 11 7 8 8 5 9 1 10 1 11 3 12 4 13 4 14 1 15 3 16 9 17 12 18 31 19 31 20 12 21 9 22 3 23 1 24 4 25 4 26 2 27 1 28 1 29 5 30 9 31 11 32 4 33 3 34 2 35 5 36 46

Mean Count = 8.472 FDR Estimate = 8.472/46 = 18.4% Perm Count Perm Count Perm Count 1 45 2 5 3 2 4 3 5 4 6 11 7 8 8 5 9 1 10 1 11 3 12 4 13 4 14 1 15 3 16 9 17 12 18 31 19 31 20 12 21 9 22 3 23 1 24 4 25 4 26 2 27 1 28 1 29 5 30 9 31 11 32 4 33 3 34 2 35 5 36 46

Concluding Remarks Ideally the method would be applied in situations with more true biological replications than were available in the ionizing radiation data. Can be extended in many ways. Other choices for test statistics are possible. Should permute in such a way that all permutations are equally likely under the null hypothesis of interest. The properties of SAM regarding error rate control are not well known.