Presentation is loading. Please wait.

Presentation is loading. Please wait.

Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for.

Similar presentations

Presentation on theme: "Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for."— Presentation transcript:

1 Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for Toxicological Research Food and Drug Administration FDA/Industry Workshop September 19, 2003

2 Analysis of Microarray Data Class comparison: Identifying differentially expressed genes Class prediction: Association between genes and samples, selecting a minimal combination of genes (classification). Class discovery: discovery sample sub-types of gene clusters, selecting genes with similar expression pattern (cluster analysis) Genes g 1 g 2 g 3.. g m S 1 y 11 y 21 y 31. y m1 S 2 y 12 y 22 y 32. y m2 S n y 1n y 2n y 3n. y mn …... Samples

3 Identifying Differentially Expressed Genes An important goal in the data analysis is to identify a set of genes that are differentially expressed among control and treated samples (groups). To identify disease-related, drug-response, or biomarker genes (class comparison). To enhance relationships among genes and samples for clustering or prediction (class prediction or class discovery).

4 Ranking Genes The normalized data are analyzed one gene at a time (when there is sufficient number of replicates n) using statistical methods: ANOVA, permutation tests, ROC, etc. Genes g 1 g 2 g 3.. g m S 1 y 11 y 21 y 31. y m1 S 2 y 12 y 22 y 32. y m2 S n y 1n y 2n y 3n. y mn …... Samples Rank r 1 (p 1 ) r 2 (p 2 ) r 3 (p 3 ). r m (p m )

5 P-value Approaches to Gene Selection These are the mixture of altered and unaltered genes, altered genes should have smaller p-values. How to choose a cut-off ? P-value for Gene Ranking: Use p-values to rank the genes in the order of evidence for differential expression: p (1)... p (m) (an ordered evidence of differences) Determining Cut-off: fixed p-value, number of rejections, estimating the number altered gene, decision (ROC), Multiple testing Issue: FWE or FDR approach..

6 Approaches to Multiplicity Testing Family-wise error (FWE) rate approach – controlling the probability of false rejection of unaltered genes among all hypotheses (genes in the array) tested. False discovery rate (FDR) approach – estimating the probability of false rejection of unaltered genes among the rejected hypotheses (significant genes) Two approaches to multiplicity testing:

7 Testing m hypotheses Decision True State Significance Non-significance Total Unaltered V S 1 - m 0 Altered U 1- T m 1 Total R m-R m The number of true null hypotheses m 0 is fixed but unknown. V and U are unobservable; R=U+V is observable. The FWE is the probability Pr(V 0). The FDR is E(V/R) (rejecting unaltered genes among the significances).

8 P-Value FWE Approach FWE : The probability of rejecting at least one true null hypothesis in the given family of the hypotheses. Bonferroni adjustment: set CWE at /m then FWE Improvements: Holm (Scand J., 1979) step-down procedure: (mp (1), (m-1)p (2), (m-2)p (3),... ) Estimating the number of un-altered genes m 0 : =FWE/m 0 (m 0 p (1), m 0 p (2), m 0 p (3),... ) Since m 0 << m, great improvement!

9 Estimating Number of True Nulls Difference of two adjacent p-values: d j = p (j) - p (j-1), j=1,..,(m+1), p (0) = 0, p (m+1) = 1 Under independence and H 0, d i Beta(1,m 0 ) with mean E(d j ) =1/(m 0 +1). An estimate of m 0 is m 0 {MD} = 1/d -1 1/E(d) –1. Graphic algorithm to estimate m 0 Benjamini and Hochberg (J Edu Behav. Stat. 2000) Hsueh et al., J. Biopharm. Stat. (2003) _

10 Simulation results for the m 0 {MD} estimator for m = 1,000, based on 10,000 replicates. Estimation: The effect size is set to have 80% power at the FWE = 25. The means and standard deviations (s.d.) Independence Hypotheses Correlated Hypotheses ( =.25) m 0 Mean s.d. Mean s.d Testing: Empirical familywise error rates at the FWE = 0.05, 010, Independence Hypotheses Correlated Hypotheses ( =.25) m

11 P-value FDR Methods FDR : The probability of falsely rejected null hypotheses. FDR-controlled (BH, 1995): q-value = mp (r) /r < FDR Fixed CWE = (Storey, 2002): estimate pFDR Fixed R = r (Tsai, 2003): estimate cFDR = E(V |R=r)/r. The expected number of false significances is (r x cFDR) FDRs depend on the distributions of R and the conditional distribution V|R. FDR = pFDR P(R>0) = cFDR Pr(R = r) Chen (ICSA Bulletin, 2003)

12 Distribution of R and the cFDR for m = 1000 and m 0 =900 at =.01 and 1 = 2. Assume paired t-test with five replicated arrays. r Pr(R=r) cFDR r Pr(R=r) cFDR r Pr(R=r) cFDR Unconditional estimates: FDR =.1067, pFDR =.1067, mFDR =.1075 Condition at E(R) = (mode), cFDR =.1064, eFDR=.1071.

13 FDR, pFDR, cFDR, and mFDR, at =.01 and.001; m = 100, and 1000, F 0 F 1 under independence. The cFDR are evaluated at [E(R)+1] =.01 =.001 m m0 FDR pFDR cFDR mFDR FDR pFDR cFDR mFDR

14 Conditional Distribution of V | R=r Given m 0 and, the number of rejections R = V+U, where V Bin(m 0, ) and U Bin(m 1,1- ) The conditional distribution V|R = r has the non-central hypergeometric distribution. The cFDR = E(V |R=r)/r estimated from the mean of V|R. It can also be computed from distribution of R To estimate cFDR: m o {MD} and distribution of R (parametric or bootstrap method)

15 Taiwan Academia Sinica (Metal) Data* Control and 8 metals, 55 one-channel arrays, 684 genes * Data from Dr. D. T. Lees laboratory

16 Identifying DE Genes: Sinica Data Objective: Control vs. As vs. Cd. Design: 6 arrays per group (I, III, IV, VI, VII, IX ; 18 arrays) Microarray: As-chip-TCL01 (one-channel membrane array) Probes: 708 genes with 16 house keeping genes. Data filtering: Spots with more than 3 zero/negative intensity were removed resulted in 540 genes. Gene Expression matrix: 540 (genes) x 18 (arrays). Normalization: GAM (lowess) to adjust for array effects. Significance test:The p-values were computed using the F statistic from all 18 C C 6 permutations.

17 MCP Analysis of Sinica Data Total number of genes: m = 540 Estimated number of un-altered genes:m 0 {MD} = 444 Number of rejections (r): FWE = 0.05, 0.05/444: r = /540: r = 9 FDR = 0.05, = (0.05 x r)/444: r = x r)/540: r = 27 CWE = = 0.01: r = 50 m 1 {MD} = 96: r = 96 The FDR, pFDR, cFDR, and eFDR estimates are close.

18 pFDR and cFDR Estimates using Different MCP Methods MCP r p (r) pFDR cFDR v* FWE(0.05) 1.13 x10 -4* x x x FDR(0.05) 4.39 x10 -3* x x x CWE(0.01) x x x M 1 {MD} x x x * FWE( 0.05/444; FDR( 0.05 x r)/444; *v = r x cFDR * m = 540 and m 0 {MD} = 444

19 Association Study Relationships between genes and samples: Effects of drugs (toxicants) on gene expression profiles, DNA diagnostic testing, or pathogen detection (classification). Relationships among samples: Molecular classification of different tissue types or samples on the basis of gene expression (cluster analysis). Relationships among genes: Genes of similar function yield similar expression patterns in microarray experiments (metabolic pathways, molecular function, biological process, etc.) (cluster analysis)

20 Class Prediction Class prediction (classification): to develop a decision rule to predict the class membership of a new sample based on the expression profiles of some key genes. Three Steps: Selection of the discriminatory (key) gene set. 1.Formation of the discrimination rule : Fishers linear discriminant function, nearest-neighbor classifiers, support vector machines, and classification tree. 2.Cross-validation to estimate accuracy of the prediction

21 Class Prediction: Sinica Data Nine different treatments: Control, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV for a total of 55 samples (arrays). Number of Genes: 684 genes (some 2- or 3-plicates). Gene Expression matrix: 684 (genes) x 55 (arrays). Normalization: GAM (lowess) to adjust for array effects. Gene Sets: Five gene sets are considered. Classification methods: Fishers linear discriminant function, nearest-neighbor classifiers (k-nn) Cross-validation: 10-fold cross-validation, 11 arrays/group.

22 Selections of Discriminatory genes Significance testing approach to gene selection: 1. F : Differential expression (global) genes among the 9 groups using F test with FWE = genes T Treatment-specific marker genes, One-Vs-All t-test compares each group with 8 remaining groups with adjusted p = 0.01, G i. T = G 1 U … U G 9 89 genes I = F T Intersection of F and T 25 genes 4. U = F U T Union of F and T 102 genes 5. Original gene set 684 genes

23 Average accuracy (%) of k-NN multi-class classification, based on 11-fold cross-validation over 1,000 permutations. Metal n I F T U A # of genes Control As AsV Cd Cu Ni Cr Sb Pb Total The FLDA algorithm performed poorly, for example, the overall accuracies are 67.9% and 40.5% for I and F respectively.

24 Cluster analysis with a 2-MDS plot for the treatment- specific marker genes in I : Each gene is labeled with the compound to which it gives a unique expression. Metal I Ctrl 7 As 1 AsV 1 Cd 3 Cu 2 Ni 4 Cr 1 Sb 8 Pb 0 (1- ) metric, complete linkage

25 Clustering results with 2-MDS plots for the 55 arrays for the genes I and A Gene set I (25 genes) Gene set A (684 genes)

26 Acknowledgements Collaborators and Contributors Dr. Frank Sistare & Staff (CDER/FDA; Merck) Dr. Sue-Jane Wang (CDER/FDA) Dr. T-C Lee & Staff (Academia Sinica,Taiwan) Dr. C-h Chen & Staff (Academia Sinica,Taiwan) Dr. Suzanne Morris & Staff (NCTR) Dr. Jim Fuscoe & Staff (NCTR) Dr. Ralph Kodell NCTR) Dr. Robert Delongchamp (NCTR) Dr. Hueymiin Hsueh (Cheng-chi Univ.,Taiwan) Dr. Chen-an Tsai (NCTR) Ms. Yi-Ju Chen (Pen State, NCTR)

Download ppt "Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for."

Similar presentations

Ads by Google