1 Test of significance for small samples Javier Cabrera.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
Is it statistically significant?
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
ANOVA: Analysis of Variation
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Multiple regression analysis
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Chapter Seventeen HYPOTHESIS TESTING
More On Preprocessing Javier Cabrera. Outline 1.Transform the data into a scale suitable for analysis. 2.Remove the effects of systematic and obfuscating.
Gene Expression Data Analyses (3)
Differentially expressed genes
Statistical Analysis of Microarray Data
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Topic 2: Statistical Concepts and Market Returns
GCB/CIS 535 Microarray Topics John Tobias November 8th, 2004.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Chapter 2 Simple Comparative Experiments
Introduction to Probability and Statistics Linear Regression and Correlation.
Inferences About Process Quality
5-3 Inference on the Means of Two Populations, Variances Unknown
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Statistics for Biologists 1. Outline estimation and hypothesis testing two sample comparisons linear models non-linear models application to genome scale.
Leedy and Ormrod Ch. 11 Gray Ch. 14
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Multiple testing in high- throughput biology Petter Mostad.
Candidate marker detection and multiple testing
Things that I think are important Chapter 1 Bar graphs, histograms Outliers Mean, median, mode, quartiles of data Variance and standard deviation of.
5-1 Introduction 5-2 Inference on the Means of Two Populations, Variances Known Assumptions.
Essential Statistics in Biology: Getting the Numbers Right
Statistics & Biology Shelly’s Super Happy Fun Times February 7, 2012 Will Herrick.
10-1 Introduction 10-2 Inference for a Difference in Means of Two Normal Distributions, Variances Known Figure 10-1 Two independent populations.
1 Use of the Half-Normal Probability Plot to Identify Significant Effects for Microarray Data C. F. Jeff Wu University of Michigan (joint work with G.
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Differential Expression II Adding power by modeling all the genes Oct 06.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
1 10 Statistical Inference for Two Samples 10-1 Inference on the Difference in Means of Two Normal Distributions, Variances Known Hypothesis tests.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Relationship between two variables Two quantitative variables: correlation and regression methods Two qualitative variables: contingency table methods.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
ANOVA: Analysis of Variance. The basic ANOVA situation Two variables: 1 Nominal, 1 Quantitative Main Question: Do the (means of) the quantitative variables.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
Ex St 801 Statistical Methods Inference about a Single Population Mean.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
For a specific gene x ij = i th measurement under condition j, i=1,…,6; j=1,2 Is a Specific Gene Differentially Expressed Differential expression.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
1 Estimation of Gene-Specific Variance 2/17/2011 Copyright © 2011 Dan Nettleton.
1 Significance analysis of Microarrays (SAM) Applied to the ionizing radiation response Tusher, Tibshirani, Chu (2001) Dafna Shahaf.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
McGraw-Hill/Irwin © 2003 The McGraw-Hill Companies, Inc.,All Rights Reserved. Part Four ANALYSIS AND PRESENTATION OF DATA.
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Estimation of Gene-Specific Variance
Differential Gene Expression
Significance analysis of microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Significance Analysis of Microarrays (SAM)
Parametric Empirical Bayes Methods for Microarrays
Presentation transcript:

1 Test of significance for small samples Javier Cabrera

2 Outline

3 C1 C2 C3 T1 T2 T3 G G G G G G G G G G G G G Differential Expression for small samples 1.Preprocessed data. 2.Perform a t-test for each gene. 3.Select the most significant subset.

4 The pooled variances T-test

5 300 Plot t vs s p 1.Only genes that have small s p are differentially expressed. 2.Moderately and Highly expressed genes are unlikely to have small s p so they will not be picked up. 3.Most genes that are picked up are low expressers.

6 Is this effect statistical or biological? This graph was generated using IID normal samples

Comparison of distribution of s p for differentially and non-differentially expressed genes Differentially expressed genes have small s p

8 Often the sample size per group is small.  unreliable variances (inferences)  dependence between the test statistics ( t g ) and the standard error estimates ( s g )  borrow strength across genes (LPE/EB)  regularize the test statistics (SAM)  work with t g | s g (Conditional t ). The effect of small sample size

9 1. Determine c 2.Obtain significant genes by doing a simulation and use the False Discovery Ratio (FDR) to find . 3. Significant Genes SAM: Significance Analysis for Microarray Tibshirani(2001)

10 Start with the pairs {r g, s g } Let s  be the  th percentile of the {s g } values and let Compute the percentiles, q 1  q 2 …  q 100, of the s g values. For  {0, 5, 10, …, 100}, compute v j (  ) = mad{ T g (s  )  s g   q j, q j+1 ) }, j = 1, 2, …, n, Compute cv(  ), the coefficient of variation of the {v j (  )} values. Choose as the value of  that minimizes cv(  ). Fix as the value. Determining c

11 Determining c v 1 (  ) =mad{ T g } v 2 (  ) v 3 (  ) v 4 (  ) v 5 (  ) v 6 (  ) v 7 (  ) TgTg sgsg cv(  ) For each  cv(  1 )s1s1 cv(  2 )s2s2 cv(  3 )s3s3 cv(  4 )s4s4 cv(  5 )s5s5 cv(  6 )s6s6 cv(  7 )s7s7 Min

12 For each gene B permutations are generated. For each perm. Expected order statistic Simulation and use the False Discovery Ratio (FDR) to find .

13 SAM : The t statistics 

14 SAM output table

15 (1) Choose a value of the FDR (say 5% or 1%) and use the corresponding value of . In our example Suppose we choose FDR (90% ) = 1% this corresponds to  =1.5. (2) Some scientists find the choice of FDR a hard one to make and are more comfortable with a more ‘classical’ strategy of choosing  that correspond to a fixed proportion of false positives, say This method would produce  =1.1. (3) A third strategy would be to start with strategy (2), then check the FDR and depending on the value if the FDR is too high we may increase  as long as (i) there is an important reduction of the FDR and as long as (ii) the number of called genes does not decrease substantially. In our example we may argue that  =1.1 corresponds to an FDR of 4.5% which maybe good enough. Interpreting the SAM table

16 Concerns about SAM 1.Permutations of 6? 2.c just a 1 st order correction   

17  Let X gij denote the preprocessed intensity measurement for gene g in array i of group j.  Model: X gij =  gj +  g  gij  Effect of interest:  g =  g2 -  g1  Error model:  gij ~ F (location=0, scale=1)  Gene mean-variance model:(  g1,  g 2 ) ~ F  with marginals:  g1 ~ F  and  g 2 ~ F  Conditional t: Basic Model

18 Parametric: Assume functional forms for F and F  and apply either a Bayes or Empirical Bayes procedure. Nonparametric: Possible approaches

19 Procedure

20 Procedure (cont.)

21 Let {X ij } be a sample from the model with    F  and let the variance obtained from the {X ij } be s 2 Then Var(s 2 ) > Var(  2 ) For example, if we assume that F  =  3 2, n=4 and  ~ N(0,1), then Var(  2 )=6 and Var(s 2 )=15. Fix by target estimation. Roadblock

22 Example: Checking for the distribution of  g 1. Df= Df=23. Df=6 1. Df=0.52. Df=2 3. Df=6 Mice Data Compare the distr. of s g vs simulation with:

23 Another Example Df=0.5 Df=3Df=6 Df=0.5 Df=3 Df=6 Compare the distr. of s g vs simulation with:

24 Fixing the variance distribution

25 Fixing the variance distribution (contd) Proceed as before …

Plot t vs s p Differentially expressed genes may have large s p

27 Comparison of distribution of s p for differentially and non-differentially expressed genes selected by CT Differentially expressed genes may have large s p

28 Generating p-values

29 Extensions  F test: - Condition on the sqrt(MSE)  Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE)  Gene Ontology. - Test for the significance of groups. - Use Hypergeometric Statistic, mean t, mean p-value, or other. - Condition on log of the number of genes per group

30 Conditional F

31 Target Estimation: Cabrera, Fernholz (1999) - Bias Reduction. - MSE reduction. Recent Applications: - Ellipse Estimation (Multivariate Target). - Logistic Regression: Cabrera, Fernholz, Devas (2003) Patel (2003) Target Conditional MLE (TCMLE) Implementation in StatXact (CYTEL) and logXact Proc’s in SAS(by CYTEL). Target Estimation

32 Target Estimation T(x 1,x 2,…,x n ) E  (T) E  (T) =  g( 

33 Target Estimation: Algorithms: - Stochastic approximation. - Simulation and iteration. - Exact algorithm for TCMLE

34 GO Ontology: Conditioning on log(n) Abs(T) Log(n)