Presentation is loading. Please wait.

Presentation is loading. Please wait.

Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,

Similar presentations


Presentation on theme: "Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,"— Presentation transcript:

1 Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics, North Carolina State University, 2. SAS institute, Cary, NC 27513; 3. Department of Genetics, North Carolina State University, 4. Department of Genetics and Development, Cornell University.

2 Microarray: Genome-wide gene expression 2. Introduced to Genetics/Genomics in 1996:... Thousands of DNA sequences arrayed on a glass slide – the genome-wide gene expression profiles can be investigated simultaneously. 1. Originally just a term in Engineering: millions of small electrodes arrayed on a slide (Silicon) In JSM 2004, - More than 30 sections - More than 100 stat. Papers/posters

3 Two major types of Microarray Nature cell biology Aug. 2001 v3 (8) More refs: Nature Review of Genetics, May. 2004 v5 (5) Oligonucleotide ArraycDNA Array

4 PM MM 16-20 Probe pairs / Probe Set 1Probe Set / Gene This is “per gene”. The PM/MM effects are considered as fixed effects. Chu, T., Weir, B., and Wolfinger, R. (02, 04). Lipshutz et al; 1999; Nature Genetics, 21(1): 20-24. Oligonucleotide Array

5 Statistical problems in Microarray? - Multiple testing: P-values. - Variable selection. - Discrimination. - Clustering of samples. of genes. - Time course experiments. - Clinical trails Merck, GSK … - Gene networks - pathway Terry Speed homepage: www.stat.berkeley.edu/users/terry/zarray/Html - Planning of experiments: design. sample sizes. - Quality Control: Var in RNA samples. Var among array. - Background Subtraction. - Normalization - Significance Analysis Supervised vs. Non-supervised Statistical Computing

6 Significance Analysis (challenge) Question: For the contrasts of interest (i.e., Trt1 vs. Trt2), what genes are differently (significantly) expressed? Gene n............................... Gene 3. Gene 2. Gene 1 Trt k. Trt 2Trt 1 Oligonucleotide array (supervised) An example: common problem in genome-wide studies: The “Large p, small n” problems. Small n: number of replications – low statistical power Large p: number of features (genes, probes, bio-markers…) – multiple-testing problems ? Computation …?

7 Data from McGraw Lab., Cornell Univ. Research Interest: To investigate the effects of different male Drosophila genotypes (5 lines) on post-mating gene expression in female flies … … … Chip1Chip2 Chip19Chip20… … … (Random effect 2) Male Female line1line2line3line4line5 XXXXXX 5 Trt ~15000 genes, for each gene: PM MM (fixed effect) (…Random effect 3) Female flies killed, mRNA prepared (random effect 1) (1)(2)(3)(4)(10)(9)

8 Data from McGraw Lab., Cornell Indices i: Trt1 Trt2 Trt3 Trt4 Trt5 Indices j: Prep1 Prep2 Indices k: Chip1 Chip2 … Indices l: 1, 2, 3,…, 19, 20 Gene g Gene 1 Gene 2 Gene 15000.................... Total: 5x2x2x20=400 points for each gene g ……σ gij ………..σ gijk ……………..σ gijkl y gijkl Linear Mixed Model: (for each gene g) Y gijkl = G g + (G*trt) gi + (G*Probe) gl + (G*trt*prep) gij + (G*trt*prep*chip) gijk + γ gijkl.

9 Significant Expressed Genes: by SGA 10 possible Contrasts Number of Significantly Expressed Genes Bonferroni(.05) F.D.R.(.05) Trt1 vs. Trt200 Trt1 vs. Trt300 Trt1 vs. Trt400 Trt1 vs. Trt500 Trt2 vs. Trt300 Trt2 vs. Trt400 Trt2 vs. Trt500 Trt3 vs. Trt400 Trt3 vs. Trt500 Trt4 vs. Trt500 Possible Reasons: 2.Large p: Multiple Testing problems (15000x10 tests) FWR vs. FDR? (not addressed in this study.) 3. Small n: Low power in each single test: - poorly estimated VC; - small d.f. in testing … In this study, trying to improve power in each single test… 1. … lower level analysis …

10 Our Idea: Taking advantage from “large p” This plot does contain useful “global” information on each VC (range, “HDR”…). Perform SGA (by mixed model), obtain VC estimates: Gene 1 (VC1, VC2, VC3); Gene 2 (VC1, VC2, VC3); … … Gene 15000 (VC1, VC2, VC3); Note: Not the “density” of each VC … Black: 15000 estimated VC1 Red: 15000 estimated VC2 Blue: 15000 estimated VC3 The “global” infor. is taken as the “prior”. (SGA – pilot analysis)

11 Our Empirical Bayes Approach A 7-step algorithm: 1. Apply SGA to get 15000 VC estimates; 2. Transform to the “ANOVA Components (AC)” ; 3. Apply Jeffrey’s prior (non-informative) ; 4. Fit Inverted Gamma (IG) to each AC (prior density); ----------------------------------------------------------------------------- 5. Derive the posterior density (and the posterior estimate) of each AC; 6. Transform the posterior estimate of AC back to VC (reverse step 2); 7. Mixed model analysis: fix the VC value to be the posterior estimates of VC (the EB estimator of VC), and approx. by standard normal dist.

12 Real Data Example: Cornell Data Number of significant genes Bonferroni (.05)False Discovery Rate (.05) ContrastS.G.A.E. B.S.G.A.E. B. Trt1 vs. Trt205038 Trt1 vs. Trt3024091 Trt1 vs. Trt409028 Trt1 vs. Trt507048 Trt2 vs. Trt30330276 Trt2 vs. Trt40220209 Trt2 vs. Trt50370183 Trt3 vs. Trt40500184 Trt3 vs. Trt50480190 Trt4 vs. Trt509068 Significance Test:

13 Simulation Studies Design – structure mimic the true data: Parameters are set to be the estimated value from the true data set. For the 3 VC, σ gij =0.01, σ gijk =0.015, σ gijkl =0.072 5500 genes are simulated, among which: 500 are “significantly expressed” and 5000 are “non- significantly expressed”, with Trt mean: Trt1Trt2Trt3Trt4Trt5 Significant Expressed00.150.300.450.60 Non-significantly Expressed00000

14 Simulation Results (1) VC1(0.01)VC2(0.015)VC3(0.072) SGAEBSGAEBSGAEB Bias1.3x10 -3 4.3x10 -4 9.2x10 -4 1.4x10 -4 4.4x10 -5 4.2x10 -5 Variance1.6x10 -4 4.6x10 -5 6.9x10 -5 2.1x10 -5 3.2x10 -5 8.0x10 -6 MSE1.6x10 -4 4.6x10 -5 6.9x10 -5 2.1x10 -5 3.2x10 -5 8.0x10 -6 EB estimator vs. REML estimator: Bias, Variance and MSE: The bias, variance and MSE of EB are only fractions of those of SGA

15 Simulation Results (2): The null distribution of the test t statistics: 1. SGA (red, expected to be t distribution with df=5); 2. EB with df=30 (blue); 3. EB with df=1000 (green); 4. Truth (black, expected to be standard normal distribution ).

16 Simulation Results (3): SizePower (% Power) Contrast (Trt. Diff) SGAEB(30)EB(1000)TruthSGAEB (30)EB (1000)Truth (100%) T1 vs. T2 (0.15) 0.0470.0440.0510.0440.116 (66.7%)0.172 (98.9%)0.176 (101.1%)0.174 (100%) T1 vs. T3 (0.30) 0.051 0.0600.0530.418 (70.1%)0.558 (93.6%)0.596 (100%) T1 vs. T4 (0.45) 0.0500.0510.0570.0470.692 (76.5%)0.870 (96.2%)0.894 (98.9%)0.904 (100%) T1 vs. T5 (0.60) 0.0540.0470.0550.0480.890 (91.8%)0.964 (99.4%)0.972 (100.2%)0.970 (100%) T2 vs. T3 (0.15) 0.0510.0540.0630.0540.126 (67.7%)0.164 (88.2%)0.188 (101.1%)0.186 (100%) T2 vs. T4 (0.30) 0.0480.0510.0600.0490.388 (68.3%)0.530 (93.3%)0.554 (97.5%)0.568 (100%) T2 vs. T5 (0.45) 0.0500.0480.0570.0520.672 (83.4%)0.772 (95.8%)0.796 (98.6%)0.806 (100%) T3 vs. T4 (0.15) 0.0500.0510.0600.0510.158 (81.4%)0.194 (100%)0.202 (104.1%)0.194 (100%) T3 vs. T5 (0.30) 0.0490.0470.0550.0500.380 (78.2%)0.446 (91.8%)0.468 (96.3%)0.486 (100%) T4 vs. T5 (0.15) 0.0480.0460.0540.0490.120 (75.0%)0.144 (90.0%)0.156 (97.5%)0.160 (100%) Test Size and Power Calculation: Mean: 0.0498 0.0490 0.0572 0.0497 75.91% 94.72% 99.53% 100%

17 Discussion Why EB estimator “beats” REML estimator? - Prior density contains “truth” information. Q: How to control (large p vs. small p)? ( Controlling system for EB method: determine the shrinkage process to get maximum gain in MSE ) However, the Prior is estimated from data: “large p” prior likely be good! “small p” prior may not be good … … for “small p”, EB estimator can be biased! Gain in MSE not guaranteed!

18 Applications Especially desined for Microarray: – Microarray (cDNA, Oligonucleotide); – Proteomics Extension to general data sets (Mixed model), if controlling system built (in the near future).


Download ppt "Empirical Bayes Analysis of Variance Component Models for Microarray Data S. Feng, 1 R.Wolfinger, 2 T.Chu, 2 G.Gibson, 3 L.McGraw 4 1. Department of Statistics,"

Similar presentations


Ads by Google