Presentation on theme: "An Analysis of MicroArray Quality Control Data James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for Toxicological Research."— Presentation transcript:
An Analysis of MicroArray Quality Control Data James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for Toxicological Research U.S. Food and Drug Administration 2006 FDA and Industry Workshop September 29, 2006 The views expressed in this presentation do not represent those of the U.S. Food and Drug Administration
Outline Background: MAQC experimental design and data Microarray Platform Comparisons Inter-platform analysis Intra-platform analysis and platforms performance concordance, site effects, consistency, discriminability sensitivity, specificity, and accuracy in gene selection self-consistency of titration mixture TaqMan and microarray platforms comparability Conclusion
MicroArray Quality Control Project Objective: To compare expression data generated at multiple test sites (labs) using several microarray- based and alternative technology platforms Microarray platforms Alternatives platforms Applied Biosystems ABI (1) Applied Biosystems (TAQ) Affymetrix AFX (1) Panomics (QGN) Agilent AGI (1, 2) Gene Express (GEX) Eppendorf EPP (1) GE Healthcare GEH (1) Illumina ILM (1) NCI_Operon NCI (2) Nature Biotechnology v24(9), Sep (2006)
MAQC Experimental Design Four RNA samples: Sample A: Universal human reference RNA (Stratagene) Sample B: Human brain reference RNA (Ambion) Sample C (75% A + 25% B) Sample D (25% A + 75% B) Three sites for each microarray platform (NCI: 2 sites) One site for the TAQ, QGN, GEX Five technical replicates for each microarray platform Four replicates for TAQ, three replicates for QGN & GEX EPP: 294 target genes; QGN: 245; GEX:205
MAQC Data Used for Comparisons Platform ABI AFX AGI GEH ILM TAQ Probe 32,878 54,675 43,931 54,359 47,293 1,004 Site 3 1 Array 2 58 60 56 60 59 N/A Rep 1 5 4 12,091 common genes among microarray platforms 906 TAQ genes are among the 12,091 genes 1. technical replicates; 2. a total of 293 arrays Sample 4
Hierarchical Clustering of 293 arrays on12091 genes from all pairwise correlations between two arrays.
Concordance: all pairwise Inter-platform sample correlation coefficients between two arrays from different platforms. Up to 2250 (10x15x15) correlations computed for each sample..184.108.40.206.82.45
Concordance: all pairwise Inter-platform fold-change correlation coefficients between two arrays from different platforms. 90 (10 x 3 x 3) correlations for each fold-change.220.127.116.11.18.104.22.168
Cross Platform Consistency Proportion of genes shows a significant platform*sample interaction from the (gene-by-gene) ANOVA: y = m + P + Sample + P*Sample + e Significant interaction: the patterns of expression of the four samples are inconsistent across the platforms.
Plot of the p-values versus ranking proportions ProportionProportion log 10 p The proportion of significances is 30% at = 0.01 0.3
Intra-Platform Analysis Concordance: all pairwise correlations between two arrays from different sites for samples A,B,C, and D (3 x 5 x 5 correlations). Site Effects: ANOVA: y = m + sample+ site + sample*site + e Site Effect: the variance ratio, F = MSE site /MSE e Consistency: proportion of genes shown to have a significant sample*site interaction ( Discriminability: ANOVA: y = m + sample + e Variability: residual mean square (total variation other than sample differences). Discriminability: the proportion of the genes shown to have significant sample effects (.
Individual Platforms Performance Reproducibility and Consistency Performance Median Correlation Site Consy MSE Discrty 2 A B C D F m 1 AFX.988.988.991.992 24..012.066.618 ABI.968.964.972.969 15..008.107.620 AG1.978.982.982.981 28..063.090.633 ILM.980.979.980.981 242..020.266.441 GEH.925.904.872.862 64..097.267.453 2.
Gold Standard Set A gene is differentially expressed if it was shown to be significant in at least 2 of the 5 platforms at 10 -5. H0: A - B = 0 versus H1: A - B 0 (8265 genes were selected) A gene is non-differentially expressed if its fold change was shown to be between 0.90 and 1/0.90 in at least 2 of the 5 platforms at 10 -3. Let - log 2 (0.90) Equivalence test: H0: | A - B | > versus H1: | A - B | < (498 genes were selected) Gold Standard: 8607 genes (delete 78 overlaps)
Accuracy (AC), sensitivity (SN), specificity (SP), and FDR by FWE = 0.05 * and FDR = 0.05 as threshold. AC SN SP FDR.77.76.95.004.74.73.95.004.81.80.80.003.55.53 1.0.000.54.52.95.005 AC SN SP FDR.92.94.55.024.89.91.59.023.92.94.55.024.88.88.95.023.82.82.69.019 AFX ABI AG1 ILM GEH FWE = 0.05* FDR = 0.05 = 0.05/8607 = 5.8 x 10 -6
Comment on MAQC: Gene Selection The MAQC project used technical replicates (small variance) with two distinct biological samples (large difference). The number of differential expressed genes are much more than typical microarray experiments. Generating a gene list is not a problem, the problem is determining the number of genes in the list. General principle: to identify a list of differentially expressed genes as accurately as possible.
Reproducibility of lists of differentially expressed genes – Percentage of Overlapping Genes (POG) For AFX, 6319 genes have p 2. For AB1, 6127 genes have p 2. At least more than 4,000 genes can be selected with an FDR estimate less than 2/4,000. from MAQC Fig S2 of supplements.
Assessment of Titration Trend Titration correlations: 0.75A+0.25B and C 0.25A+0.75B and D Titration model: (A two-step test) The titration relationship can be modelled by M1t: y = m + Conc + Site + e Full ANOVA model. M1 y = m + Sample + Site + e S1: Test for Sample difference M1: H0 t1 : A = B = C = D S2: Test for the goodness of fit: H0 t2 M1t = M1 Proportion of genes that reject H0 t1 and accept H0 t2
Linear Titration Model H0 t1 :A H0 t1 :R,H0 t2 :A H0 t1 :R,H0 t2 :R
Titration correlation for samples C and D, and the proportions of the genes that follow the titration relationship. Sample C Sample D (5%, 5%) (1%, 1%).909.911.963.976.916.928.954.967.930.939.923.944.930.936.937.954.923.934.988.988 AFX ABI AG1 ILM GEH Correlation Titration Model (,
Taqman and microarray platform concordance: Box- Plots of all pairwise sample correlation coefficients. 60 (4 x 15) correlations computed in each sample.22.214.171.124.126.96.36.199.188.8.131.52
Taqman and microarray platform concordance: Box- Plots of fold-change (B/A) correlation coefficients..184.108.40.206.89.82.90
Consistency of TaqMan and Microarray platforms Proportions of significances: 0.72, 0.57, 0.49, 0.65, 0.39; Proportion of significances microarray platforms: 0.30 microarray platforms Taqman and microarray
Conclusion (1) Inter platform (microarray and Taqman): Concordance Sample correlations: 0.45(D)-0.82 (A) FC correlations: Higher B/A; Lower: C/A In-consistency Microarray platforms: Thirty percent (30%) of genes show inconsistent expression patterns at = 0.01. Taqman and microarray platforms: The proportions are between 0.34 to 0.74 for the five platforms. Comparability Intensities measured by different microarray platforms, and measured between microarray and Taqman platforms are different.
Conclusion (2) Titration Trend Titration Correlation: The correlations between observed intensity and expected intensity are more than 90%. Titration trend: All five platforms follow the linear titration relationship well. Intra microarray platforms performance Concordance: Intra-platform correlations are high. Site effect: All platforms show site effects. Consistency: The patterns of expression are consistent across three sites.