Download presentation

Presentation is loading. Please wait.

Published byDevin Lee Modified over 3 years ago

1
Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for Toxicological Research Food and Drug Administration e-mail: Jchen@nctr.fda.govJchen@nctr.fda.gov FDA/Industry Workshop September 19, 2003

2
Analysis of Microarray Data Class comparison: Identifying differentially expressed genes Class prediction: Association between genes and samples, selecting a minimal combination of genes (classification). Class discovery: discovery sample sub-types of gene clusters, selecting genes with similar expression pattern (cluster analysis) Genes g 1 g 2 g 3.. g m S 1 y 11 y 21 y 31. y m1 S 2 y 12 y 22 y 32. y m2 S n y 1n y 2n y 3n. y mn …... Samples

3
Identifying Differentially Expressed Genes An important goal in the data analysis is to identify a set of genes that are differentially expressed among control and treated samples (groups). To identify disease-related, drug-response, or biomarker genes (class comparison). To enhance relationships among genes and samples for clustering or prediction (class prediction or class discovery).

4
Ranking Genes The normalized data are analyzed one gene at a time (when there is sufficient number of replicates n) using statistical methods: ANOVA, permutation tests, ROC, etc. Genes g 1 g 2 g 3.. g m S 1 y 11 y 21 y 31. y m1 S 2 y 12 y 22 y 32. y m2 S n y 1n y 2n y 3n. y mn …... Samples Rank r 1 (p 1 ) r 2 (p 2 ) r 3 (p 3 ). r m (p m )

5
P-value Approaches to Gene Selection These are the mixture of altered and unaltered genes, altered genes should have smaller p-values. How to choose a cut-off ? P-value for Gene Ranking: Use p-values to rank the genes in the order of evidence for differential expression: p (1)... p (m) (an ordered evidence of differences) Determining Cut-off: fixed p-value, number of rejections, estimating the number altered gene, decision (ROC), Multiple testing Issue: FWE or FDR approach..

6
Approaches to Multiplicity Testing Family-wise error (FWE) rate approach – controlling the probability of false rejection of unaltered genes among all hypotheses (genes in the array) tested. False discovery rate (FDR) approach – estimating the probability of false rejection of unaltered genes among the rejected hypotheses (significant genes) Two approaches to multiplicity testing:

7
Testing m hypotheses Decision True State Significance Non-significance Total Unaltered V S 1 - m 0 Altered U 1- T m 1 Total R m-R m The number of true null hypotheses m 0 is fixed but unknown. V and U are unobservable; R=U+V is observable. The FWE is the probability Pr(V 0). The FDR is E(V/R) (rejecting unaltered genes among the significances).

8
P-Value FWE Approach FWE : The probability of rejecting at least one true null hypothesis in the given family of the hypotheses. Bonferroni adjustment: set CWE at /m then FWE Improvements: Holm (Scand J., 1979) step-down procedure: (mp (1), (m-1)p (2), (m-2)p (3),... ) Estimating the number of un-altered genes m 0 : =FWE/m 0 (m 0 p (1), m 0 p (2), m 0 p (3),... ) Since m 0 << m, great improvement!

9
Estimating Number of True Nulls Difference of two adjacent p-values: d j = p (j) - p (j-1), j=1,..,(m+1), p (0) = 0, p (m+1) = 1 Under independence and H 0, d i Beta(1,m 0 ) with mean E(d j ) =1/(m 0 +1). An estimate of m 0 is m 0 {MD} = 1/d -1 1/E(d) –1. Graphic algorithm to estimate m 0 Benjamini and Hochberg (J Edu Behav. Stat. 2000) Hsueh et al., J. Biopharm. Stat. (2003) _

10
Simulation results for the m 0 {MD} estimator for m = 1,000, based on 10,000 replicates. Estimation: The effect size is set to have 80% power at the FWE = 25. The means and standard deviations (s.d.) Independence Hypotheses Correlated Hypotheses ( =.25) m 0 Mean s.d. Mean s.d. 1000 999.35 10.89 992.30 42.29 900 904.43 3.43 899.16 36.47 700 709.40 5.26 703.13 37.07 Testing: Empirical familywise error rates at the FWE = 0.05, 010, 0.25. Independence Hypotheses Correlated Hypotheses ( =.25) m 0 0.05 0.10 0.25 0.05 0.10 0.25 1000 0.049 0.098 0.223 0.039 0.071 0.151 900 0.049 0.095 0.224 0.040 0.070 0.142 700 0.047 0.090 0.213 0.039 0.070 0.142

11
P-value FDR Methods FDR : The probability of falsely rejected null hypotheses. FDR-controlled (BH, 1995): q-value = mp (r) /r < FDR Fixed CWE = (Storey, 2002): estimate pFDR Fixed R = r (Tsai, 2003): estimate cFDR = E(V |R=r)/r. The expected number of false significances is (r x cFDR) FDRs depend on the distributions of R and the conditional distribution V|R. FDR = pFDR P(R>0) = cFDR Pr(R = r) Chen (ICSA Bulletin, 2003)

12
Distribution of R and the cFDR for m = 1000 and m 0 =900 at =.01 and 1 = 2. Assume paired t-test with five replicated arrays. r Pr(R=r) cFDR r Pr(R=r) cFDR r Pr(R=r) cFDR 68.0009.0748 79.0509.0947 90.0369.1231 69.0016.0763 80.0592.0969 91.0289.1262 70.0026.0779 81.0664.0992 92.0218.1293 71.0042.0795 82.0719.1015 93.0158.1326 72.0065.0812 83.0750.1039 94.0111.1359 73.0097.0830 84.0756.1064 95.0075.1393 74.0140.0848 85.0734.1090 96.0049.1428 75.0195.0866 86.0688.1117 97.0031.1463 76.0261.0885 87.0622.1144 98.0019.1500 77.0338.0905 88.0542.1172 99.0011.1537 78.0422.0926 89.0455.1201 100.0006.1574 Unconditional estimates: FDR =.1067, pFDR =.1067, mFDR =.1075 Condition at E(R) = 83.7 84 (mode), cFDR =.1064, eFDR=.1071.

13
FDR, pFDR, cFDR, and mFDR, at =.01 and.001; m = 100, and 1000, F 0 F 1 under independence. The cFDR are evaluated at [E(R)+1] =.01 =.001 m m0 FDR pFDR cFDR mFDR FDR pFDR cFDR mFDR 100 50.0257.0257.0261.0262.0071.0071.0071.0072 80.0933.0933.0960.0971.0258.0271.0270.0282 90.1824.1831.1857.1948.0462.0583.0586.0613 95.3012.3129.3119.3380.0650.1147.1163.1212 100.6340 1. 1. 1..0952 1. 1. 1. 1000 500.0261.0261.0261.0262.0072.0072.0072.0072 800.0967.0967.0969.0971.0281.0281.0282.0282 900.1935.1935.1946.1948.0608.0608.0609.0613 950.3351.3351.3383.3380.1193.1194.1194.1212 1000.9999 1. 1. 1..6324 1. 1. 1.

14
Conditional Distribution of V | R=r Given m 0 and, the number of rejections R = V+U, where V Bin(m 0, ) and U Bin(m 1,1- ) The conditional distribution V|R = r has the non-central hypergeometric distribution. The cFDR = E(V |R=r)/r estimated from the mean of V|R. It can also be computed from distribution of R To estimate cFDR: m o {MD} and distribution of R (parametric or bootstrap method)

15
Taiwan Academia Sinica (Metal) Data* Control and 8 metals, 55 one-channel arrays, 684 genes * Data from Dr. D. T. Lees laboratory

16
Identifying DE Genes: Sinica Data Objective: Control vs. As vs. Cd. Design: 6 arrays per group (I, III, IV, VI, VII, IX ; 18 arrays) Microarray: As-chip-TCL01 (one-channel membrane array) Probes: 708 genes with 16 house keeping genes. Data filtering: Spots with more than 3 zero/negative intensity were removed resulted in 540 genes. Gene Expression matrix: 540 (genes) x 18 (arrays). Normalization: GAM (lowess) to adjust for array effects. Significance test:The p-values were computed using the F statistic from all 18 C 12 12 C 6 permutations.

17
MCP Analysis of Sinica Data Total number of genes: m = 540 Estimated number of un-altered genes:m 0 {MD} = 444 Number of rejections (r): FWE = 0.05, 0.05/444: r = 11 0.05/540: r = 9 FDR = 0.05, = (0.05 x r)/444: r = 39 0.05 x r)/540: r = 27 CWE = = 0.01: r = 50 m 1 {MD} = 96: r = 96 The FDR, pFDR, cFDR, and eFDR estimates are close.

18
pFDR and cFDR Estimates using Different MCP Methods MCP r p (r) pFDR cFDR v* FWE(0.05) 1.13 x10 -4* 11 1.12 x10 -4 4.52 x10 -3 4.50 x10 -3.5 FDR(0.05) 4.39 x10 -3* 39 4.29 x10 -3 4.87 x10 -2 4.88 x10 -2 2 CWE(0.01) 0.01 50 9.97 x10 -3 8.85 x10 -2 8.59 x10 -2 4 M 1 {MD} 96 3.28 x10 -2 1.51 x10 -1 1.53 x10 -1 15 * FWE( 0.05/444; FDR( 0.05 x r)/444; *v = r x cFDR * m = 540 and m 0 {MD} = 444

19
Association Study Relationships between genes and samples: Effects of drugs (toxicants) on gene expression profiles, DNA diagnostic testing, or pathogen detection (classification). Relationships among samples: Molecular classification of different tissue types or samples on the basis of gene expression (cluster analysis). Relationships among genes: Genes of similar function yield similar expression patterns in microarray experiments (metabolic pathways, molecular function, biological process, etc.) (cluster analysis)

20
Class Prediction Class prediction (classification): to develop a decision rule to predict the class membership of a new sample based on the expression profiles of some key genes. Three Steps: Selection of the discriminatory (key) gene set. 1.Formation of the discrimination rule : Fishers linear discriminant function, nearest-neighbor classifiers, support vector machines, and classification tree. 2.Cross-validation to estimate accuracy of the prediction

21
Class Prediction: Sinica Data Nine different treatments: Control, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV for a total of 55 samples (arrays). Number of Genes: 684 genes (some 2- or 3-plicates). Gene Expression matrix: 684 (genes) x 55 (arrays). Normalization: GAM (lowess) to adjust for array effects. Gene Sets: Five gene sets are considered. Classification methods: Fishers linear discriminant function, nearest-neighbor classifiers (k-nn) Cross-validation: 10-fold cross-validation, 11 arrays/group.

22
Selections of Discriminatory genes Significance testing approach to gene selection: 1. F : Differential expression (global) genes among the 9 groups using F test with FWE = 0.05. 38 genes T Treatment-specific marker genes, One-Vs-All t-test compares each group with 8 remaining groups with adjusted p = 0.01, G i. T = G 1 U … U G 9 89 genes I = F T Intersection of F and T 25 genes 4. U = F U T Union of F and T 102 genes 5. Original gene set 684 genes

23
Average accuracy (%) of k-NN multi-class classification, based on 11-fold cross-validation over 1,000 permutations. Metal n I F T U A # of genes 25 38 89 102 684 100 99.1 75.5 82.0 61.6 76.3 60.0 81.3 51.8 81.6 100 99.1 78.6 84.4 78.5 99.7 42.4 81.4 72.7 85.3 98.4 98.7 99.8 81.8 38.2 99.5 37.1 98.7 46.0 82.0 98.8 97.1 81.5 41.5 97.8 97.1 81.7 45.8 80.5 79.0 96.6 38.7 50.4 57.1 94.9 18.3 78.7 45.8 65.6 14 7 5 6 5 4 5 7 5 55 Control As AsV Cd Cu Ni Cr Sb Pb Total The FLDA algorithm performed poorly, for example, the overall accuracies are 67.9% and 40.5% for I and F respectively.

24
Cluster analysis with a 2-MDS plot for the treatment- specific marker genes in I : Each gene is labeled with the compound to which it gives a unique expression. Metal I Ctrl 7 As 1 AsV 1 Cd 3 Cu 2 Ni 4 Cr 1 Sb 8 Pb 0 (1- ) metric, complete linkage

25
Clustering results with 2-MDS plots for the 55 arrays for the genes I and A Gene set I (25 genes) Gene set A (684 genes)

26
Acknowledgements Collaborators and Contributors Dr. Frank Sistare & Staff (CDER/FDA; Merck) Dr. Sue-Jane Wang (CDER/FDA) Dr. T-C Lee & Staff (Academia Sinica,Taiwan) Dr. C-h Chen & Staff (Academia Sinica,Taiwan) Dr. Suzanne Morris & Staff (NCTR) Dr. Jim Fuscoe & Staff (NCTR) Dr. Ralph Kodell NCTR) Dr. Robert Delongchamp (NCTR) Dr. Hueymiin Hsueh (Cheng-chi Univ.,Taiwan) Dr. Chen-an Tsai (NCTR) Ms. Yi-Ju Chen (Pen State, NCTR)

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google