Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Test of significance for small samples Javier Cabrera.

Similar presentations


Presentation on theme: "1 Test of significance for small samples Javier Cabrera."— Presentation transcript:

1 1 Test of significance for small samples Javier Cabrera

2 2 Outline

3 3 C1 C2 C3 T1 T2 T3 G1 4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13 2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77 5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69 10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05 3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95 7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8 2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29 4.79 5.13 3.31 4.67 5.27 G10 5.12 4.85 3.79 4.13 3.12 4.79 G11 4.67 3.50 4.77 4.09 3.86 2.88 G12 6.22 6.42 5.02 6.38 6.54 6.80 G13 2.88 3.76 2.78 2.98 4.81 4.15....... Differential Expression for small samples 1.Preprocessed data. 2.Perform a t-test for each gene. 3.Select the most significant subset.

4 4 The pooled variances T-test

5 5 300 Plot t vs s p 1.Only genes that have small s p are differentially expressed. 2.Moderately and Highly expressed genes are unlikely to have small s p so they will not be picked up. 3.Most genes that are picked up are low expressers.

6 6 Is this effect statistical or biological? This graph was generated using IID normal samples

7 7 300 21983 Comparison of distribution of s p for differentially and non-differentially expressed genes Differentially expressed genes have small s p

8 8 Often the sample size per group is small.  unreliable variances (inferences)  dependence between the test statistics ( t g ) and the standard error estimates ( s g )  borrow strength across genes (LPE/EB)  regularize the test statistics (SAM)  work with t g | s g (Conditional t ). The effect of small sample size

9 9 1. Determine c 2.Obtain significant genes by doing a simulation and use the False Discovery Ratio (FDR) to find . 3. Significant Genes SAM: Significance Analysis for Microarray Tibshirani(2001)

10 10 Start with the pairs {r g, s g } Let s  be the  th percentile of the {s g } values and let Compute the percentiles, q 1  q 2 …  q 100, of the s g values. For  {0, 5, 10, …, 100}, compute v j (  ) = mad{ T g (s  )  s g   q j, q j+1 ) }, j = 1, 2, …, n, Compute cv(  ), the coefficient of variation of the {v j (  )} values. Choose as the value of  that minimizes cv(  ). Fix as the value. Determining c

11 11 Determining c v 1 (  ) =mad{ T g } v 2 (  ) v 3 (  ) v 4 (  ) v 5 (  ) v 6 (  ) v 7 (  ) TgTg sgsg cv(  ) For each  cv(  1 )s1s1 cv(  2 )s2s2 cv(  3 )s3s3 cv(  4 )s4s4 cv(  5 )s5s5 cv(  6 )s6s6 cv(  7 )s7s7 Min

12 12 For each gene B permutations are generated. For each perm. Expected order statistic Simulation and use the False Discovery Ratio (FDR) to find .

13 13 SAM : The t statistics 

14 14 SAM output table

15 15 (1) Choose a value of the FDR (say 5% or 1%) and use the corresponding value of . In our example Suppose we choose FDR (90% ) = 1% this corresponds to  =1.5. (2) Some scientists find the choice of FDR a hard one to make and are more comfortable with a more ‘classical’ strategy of choosing  that correspond to a fixed proportion of false positives, say 0.01. This method would produce  =1.1. (3) A third strategy would be to start with strategy (2), then check the FDR and depending on the value if the FDR is too high we may increase  as long as (i) there is an important reduction of the FDR and as long as (ii) the number of called genes does not decrease substantially. In our example we may argue that  =1.1 corresponds to an FDR of 4.5% which maybe good enough. Interpreting the SAM table

16 16 Concerns about SAM 1.Permutations of 6? 2.c just a 1 st order correction   

17 17  Let X gij denote the preprocessed intensity measurement for gene g in array i of group j.  Model: X gij =  gj +  g  gij  Effect of interest:  g =  g2 -  g1  Error model:  gij ~ F (location=0, scale=1)  Gene mean-variance model:(  g1,  g 2 ) ~ F  with marginals:  g1 ~ F  and  g 2 ~ F  Conditional t: Basic Model

18 18 Parametric: Assume functional forms for F and F  and apply either a Bayes or Empirical Bayes procedure. Nonparametric: Possible approaches

19 19 Procedure

20 20 Procedure (cont.)

21 21 Let {X ij } be a sample from the model with    F  and let the variance obtained from the {X ij } be s 2 Then Var(s 2 ) > Var(  2 ) For example, if we assume that F  =  3 2, n=4 and  ~ N(0,1), then Var(  2 )=6 and Var(s 2 )=15. Fix by target estimation. Roadblock

22 22 Example: Checking for the distribution of  g 1. Df=0.5 2. Df=23. Df=6 1. Df=0.52. Df=2 3. Df=6 Mice Data Compare the distr. of s g vs simulation with:

23 23 Another Example Df=0.5 Df=3Df=6 Df=0.5 Df=3 Df=6 Compare the distr. of s g vs simulation with:

24 24 Fixing the variance distribution

25 25 Fixing the variance distribution (contd) Proceed as before …

26 26 130 Plot t vs s p Differentially expressed genes may have large s p

27 27 Comparison of distribution of s p for differentially and non-differentially expressed genes selected by CT Differentially expressed genes may have large s p

28 28 Generating p-values

29 29 Extensions  F test: - Condition on the sqrt(MSE)  Multiple comparisons: - Tukey, Dunnett, Bump. - Condition on the sqrt(MSE)  Gene Ontology. - Test for the significance of groups. - Use Hypergeometric Statistic, mean t, mean p-value, or other. - Condition on log of the number of genes per group

30 30 Conditional F

31 31 Target Estimation: Cabrera, Fernholz (1999) - Bias Reduction. - MSE reduction. Recent Applications: - Ellipse Estimation (Multivariate Target). - Logistic Regression: Cabrera, Fernholz, Devas (2003) Patel (2003) Target Conditional MLE (TCMLE) Implementation in StatXact (CYTEL) and logXact Proc’s in SAS(by CYTEL). Target Estimation

32 32 Target Estimation T(x 1,x 2,…,x n ) E  (T) E  (T) =  g( 

33 33 Target Estimation: Algorithms: - Stochastic approximation. - Simulation and iteration. - Exact algorithm for TCMLE

34 34 GO Ontology: Conditioning on log(n) Abs(T) Log(n)


Download ppt "1 Test of significance for small samples Javier Cabrera."

Similar presentations


Ads by Google