Download presentation

Presentation is loading. Please wait.

Published byDevin Lee Modified over 2 years ago

1
Selection of Differential Expression Genes in Microarray Experiments James J. Chen, Ph.D. Division of Biometry and Risk Assessment National Center for Toxicological Research Food and Drug Administration FDA/Industry Workshop September 19, 2003

2
Analysis of Microarray Data Class comparison: Identifying differentially expressed genes Class prediction: Association between genes and samples, selecting a minimal combination of genes (classification). Class discovery: discovery sample sub-types of gene clusters, selecting genes with similar expression pattern (cluster analysis) Genes g 1 g 2 g 3.. g m S 1 y 11 y 21 y 31. y m1 S 2 y 12 y 22 y 32. y m2 S n y 1n y 2n y 3n. y mn …... Samples

3
Identifying Differentially Expressed Genes An important goal in the data analysis is to identify a set of genes that are differentially expressed among control and treated samples (groups). To identify disease-related, drug-response, or biomarker genes (class comparison). To enhance relationships among genes and samples for clustering or prediction (class prediction or class discovery).

4
Ranking Genes The normalized data are analyzed one gene at a time (when there is sufficient number of replicates n) using statistical methods: ANOVA, permutation tests, ROC, etc. Genes g 1 g 2 g 3.. g m S 1 y 11 y 21 y 31. y m1 S 2 y 12 y 22 y 32. y m2 S n y 1n y 2n y 3n. y mn …... Samples Rank r 1 (p 1 ) r 2 (p 2 ) r 3 (p 3 ). r m (p m )

5
P-value Approaches to Gene Selection These are the mixture of altered and unaltered genes, altered genes should have smaller p-values. How to choose a cut-off ? P-value for Gene Ranking: Use p-values to rank the genes in the order of evidence for differential expression: p (1)... p (m) (an ordered evidence of differences) Determining Cut-off: fixed p-value, number of rejections, estimating the number altered gene, decision (ROC), Multiple testing Issue: FWE or FDR approach..

6
Approaches to Multiplicity Testing Family-wise error (FWE) rate approach – controlling the probability of false rejection of unaltered genes among all hypotheses (genes in the array) tested. False discovery rate (FDR) approach – estimating the probability of false rejection of unaltered genes among the rejected hypotheses (significant genes) Two approaches to multiplicity testing:

7
Testing m hypotheses Decision True State Significance Non-significance Total Unaltered V S 1 - m 0 Altered U 1- T m 1 Total R m-R m The number of true null hypotheses m 0 is fixed but unknown. V and U are unobservable; R=U+V is observable. The FWE is the probability Pr(V 0). The FDR is E(V/R) (rejecting unaltered genes among the significances).

8
P-Value FWE Approach FWE : The probability of rejecting at least one true null hypothesis in the given family of the hypotheses. Bonferroni adjustment: set CWE at /m then FWE Improvements: Holm (Scand J., 1979) step-down procedure: (mp (1), (m-1)p (2), (m-2)p (3),... ) Estimating the number of un-altered genes m 0 : =FWE/m 0 (m 0 p (1), m 0 p (2), m 0 p (3),... ) Since m 0 << m, great improvement!

9
Estimating Number of True Nulls Difference of two adjacent p-values: d j = p (j) - p (j-1), j=1,..,(m+1), p (0) = 0, p (m+1) = 1 Under independence and H 0, d i Beta(1,m 0 ) with mean E(d j ) =1/(m 0 +1). An estimate of m 0 is m 0 {MD} = 1/d -1 1/E(d) –1. Graphic algorithm to estimate m 0 Benjamini and Hochberg (J Edu Behav. Stat. 2000) Hsueh et al., J. Biopharm. Stat. (2003) _

10
Simulation results for the m 0 {MD} estimator for m = 1,000, based on 10,000 replicates. Estimation: The effect size is set to have 80% power at the FWE = 25. The means and standard deviations (s.d.) Independence Hypotheses Correlated Hypotheses ( =.25) m 0 Mean s.d. Mean s.d Testing: Empirical familywise error rates at the FWE = 0.05, 010, Independence Hypotheses Correlated Hypotheses ( =.25) m

11
P-value FDR Methods FDR : The probability of falsely rejected null hypotheses. FDR-controlled (BH, 1995): q-value = mp (r) /r < FDR Fixed CWE = (Storey, 2002): estimate pFDR Fixed R = r (Tsai, 2003): estimate cFDR = E(V |R=r)/r. The expected number of false significances is (r x cFDR) FDRs depend on the distributions of R and the conditional distribution V|R. FDR = pFDR P(R>0) = cFDR Pr(R = r) Chen (ICSA Bulletin, 2003)

12
Distribution of R and the cFDR for m = 1000 and m 0 =900 at =.01 and 1 = 2. Assume paired t-test with five replicated arrays. r Pr(R=r) cFDR r Pr(R=r) cFDR r Pr(R=r) cFDR Unconditional estimates: FDR =.1067, pFDR =.1067, mFDR =.1075 Condition at E(R) = (mode), cFDR =.1064, eFDR=.1071.

13
FDR, pFDR, cFDR, and mFDR, at =.01 and.001; m = 100, and 1000, F 0 F 1 under independence. The cFDR are evaluated at [E(R)+1] =.01 =.001 m m0 FDR pFDR cFDR mFDR FDR pFDR cFDR mFDR

14
Conditional Distribution of V | R=r Given m 0 and, the number of rejections R = V+U, where V Bin(m 0, ) and U Bin(m 1,1- ) The conditional distribution V|R = r has the non-central hypergeometric distribution. The cFDR = E(V |R=r)/r estimated from the mean of V|R. It can also be computed from distribution of R To estimate cFDR: m o {MD} and distribution of R (parametric or bootstrap method)

15
Taiwan Academia Sinica (Metal) Data* Control and 8 metals, 55 one-channel arrays, 684 genes * Data from Dr. D. T. Lees laboratory

16
Identifying DE Genes: Sinica Data Objective: Control vs. As vs. Cd. Design: 6 arrays per group (I, III, IV, VI, VII, IX ; 18 arrays) Microarray: As-chip-TCL01 (one-channel membrane array) Probes: 708 genes with 16 house keeping genes. Data filtering: Spots with more than 3 zero/negative intensity were removed resulted in 540 genes. Gene Expression matrix: 540 (genes) x 18 (arrays). Normalization: GAM (lowess) to adjust for array effects. Significance test:The p-values were computed using the F statistic from all 18 C C 6 permutations.

17
MCP Analysis of Sinica Data Total number of genes: m = 540 Estimated number of un-altered genes:m 0 {MD} = 444 Number of rejections (r): FWE = 0.05, 0.05/444: r = /540: r = 9 FDR = 0.05, = (0.05 x r)/444: r = x r)/540: r = 27 CWE = = 0.01: r = 50 m 1 {MD} = 96: r = 96 The FDR, pFDR, cFDR, and eFDR estimates are close.

18
pFDR and cFDR Estimates using Different MCP Methods MCP r p (r) pFDR cFDR v* FWE(0.05) 1.13 x10 -4* x x x FDR(0.05) 4.39 x10 -3* x x x CWE(0.01) x x x M 1 {MD} x x x * FWE( 0.05/444; FDR( 0.05 x r)/444; *v = r x cFDR * m = 540 and m 0 {MD} = 444

19
Association Study Relationships between genes and samples: Effects of drugs (toxicants) on gene expression profiles, DNA diagnostic testing, or pathogen detection (classification). Relationships among samples: Molecular classification of different tissue types or samples on the basis of gene expression (cluster analysis). Relationships among genes: Genes of similar function yield similar expression patterns in microarray experiments (metabolic pathways, molecular function, biological process, etc.) (cluster analysis)

20
Class Prediction Class prediction (classification): to develop a decision rule to predict the class membership of a new sample based on the expression profiles of some key genes. Three Steps: Selection of the discriminatory (key) gene set. 1.Formation of the discrimination rule : Fishers linear discriminant function, nearest-neighbor classifiers, support vector machines, and classification tree. 2.Cross-validation to estimate accuracy of the prediction

21
Class Prediction: Sinica Data Nine different treatments: Control, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV for a total of 55 samples (arrays). Number of Genes: 684 genes (some 2- or 3-plicates). Gene Expression matrix: 684 (genes) x 55 (arrays). Normalization: GAM (lowess) to adjust for array effects. Gene Sets: Five gene sets are considered. Classification methods: Fishers linear discriminant function, nearest-neighbor classifiers (k-nn) Cross-validation: 10-fold cross-validation, 11 arrays/group.

22
Selections of Discriminatory genes Significance testing approach to gene selection: 1. F : Differential expression (global) genes among the 9 groups using F test with FWE = genes T Treatment-specific marker genes, One-Vs-All t-test compares each group with 8 remaining groups with adjusted p = 0.01, G i. T = G 1 U … U G 9 89 genes I = F T Intersection of F and T 25 genes 4. U = F U T Union of F and T 102 genes 5. Original gene set 684 genes

23
Average accuracy (%) of k-NN multi-class classification, based on 11-fold cross-validation over 1,000 permutations. Metal n I F T U A # of genes Control As AsV Cd Cu Ni Cr Sb Pb Total The FLDA algorithm performed poorly, for example, the overall accuracies are 67.9% and 40.5% for I and F respectively.

24
Cluster analysis with a 2-MDS plot for the treatment- specific marker genes in I : Each gene is labeled with the compound to which it gives a unique expression. Metal I Ctrl 7 As 1 AsV 1 Cd 3 Cu 2 Ni 4 Cr 1 Sb 8 Pb 0 (1- ) metric, complete linkage

25
Clustering results with 2-MDS plots for the 55 arrays for the genes I and A Gene set I (25 genes) Gene set A (684 genes)

26
Acknowledgements Collaborators and Contributors Dr. Frank Sistare & Staff (CDER/FDA; Merck) Dr. Sue-Jane Wang (CDER/FDA) Dr. T-C Lee & Staff (Academia Sinica,Taiwan) Dr. C-h Chen & Staff (Academia Sinica,Taiwan) Dr. Suzanne Morris & Staff (NCTR) Dr. Jim Fuscoe & Staff (NCTR) Dr. Ralph Kodell NCTR) Dr. Robert Delongchamp (NCTR) Dr. Hueymiin Hsueh (Cheng-chi Univ.,Taiwan) Dr. Chen-an Tsai (NCTR) Ms. Yi-Ju Chen (Pen State, NCTR)

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google