Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics.

Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics National Cancer Institute

All cells of a multi-cellular organism contain essentially the same DNA Cells differ in function based on the spectra of which genes are expressed and the level of expression Proteins do the work of cells and gene expression determines the intra-cellular concentration of proteins mRNA is an intermediate product of gene expression; a gene is transcribed into a mRNA molecule which is then translated into a protein molecule

Types of DNA Microarrays mRna transcript quantification Genomic DNA sequence determination –SNP identification –Genotyping Detecting gene deletions or gene duplications

Types of Microarrays DNA microarrays Tissue microarrays Protein microarrays

[Affymetrix] Hybridization Array

Biology in Transition Biotechnology –Restriction enzymes –Ligases –Polymerases –PCR Instruments, Tools, Reagents and Information Resources of Major Impact –DNA sequencing –Functional whole genomic assays

How to Deal With the Plethora of Data Development of software tools Training of biologists to use tools Collaboration with mathematical & computational scientists Training of mathematical & computational scientists

Bioinformatics An ambiguous term that helps further confuse people who are sometimes already confused Refers to a range of activities all of which involve multi-disciplinary collaboration among biological, mathematical, computational scientists and software engineers Organizations searching for structures that will support quality inter-disciplinary research in bioinformatics

Organizing for Bioinformatics Collaborative, not service oriented Enable extensive interaction and education Enable scientists to be stimulated by important problems and to accomplish organizational and personal goals in solving them

Molecular Statistics & Bioinformatics Section Utilize mathematical and computational sciences in conjunction with data from genomics & high thruput technologies to elucidate the biological basis of cancer –translating this to effective means of eradicating cancer Train statisticians, mathematicians, physical and biological scientists in cancer computational biology

Microarray Research Collaborative data analysis Methodology development Software development

Microarray Myths That the greatest challenge is managing the mass of micro-array data That pattern-recognition or data mining are the most appropriate paradigm for the analysis of micro-array data That pre-packaged analysis tools are a substitute for collaboration with statistical scientists in complex problems That statistical collaboration can be a service function That statisticians can be effective collaborators without substantial knowledge of biology and microarray technology

Applications of DNA Microarrays to Cancer Research Identify genes and pathways involved in oncogenesis –Transgenic mouse models –Profiling pre-cancerous lesions Identifying molecular targets for –therapeutics –early detection

Applications of DNA Microarrays to Cancer Research Diagnostic classification –For identifying disease subsets with distinctive pathogenesis –For selecting therapy Large cell lymphoma Stage I breast cancer

DNA Microarray Analytics Design issues –Arrays –Specimens Labeling Replication Image analysis –Pixels to feature Feature analysis –Background adjustment –Normalization –Features to genes –Normalization Analysis of biological objectives

Method of Analysis Should Be Tailored to Objectives Class discovery –Identifying expression profiles characteristic of non-predefined subsets of tumors Class/phenotype prediction –Identifying expression profiles that distinguish predefined subsets of tumors

Components of Class Prediction Establish that expression “profiles” differ to a statistically significant degree and that differences observed are not due to examination of thousands of genes Identify genes that account for the differences between classes Develop multi-gene classifier to predict the class for a new sample and estimate the misclassification rates

Do Expression Profiles Differ for Two Defined Classes of Arrays? Not a clustering problem –Global similarity measures generally used for clustering arrays may not distinguish classes Supervised vs unsupervised methods Requires multiple biological samples from each class

Do Expression Profiles Differ for Two Defined Classes of Arrays? Global test –Number of genes significantly differentially expressed among classes at specified nominal significance level –Cross-validated mis-classification rate Multiple comparison adjustment for finding differentially expressed genes –Experiment-wise error –Univariate screening with p<0.001 threshold –False discovery rate

training set test set specimens log-expression ratios specimens log-expression ratios full data set Non-cross-validated Prediction Cross-validated Prediction (Leave-one-out method) 1. Prediction rule is built using full data set. 2. Rule is applied to each specimen for class prediction. 1. Full data set is divided into training and test sets (test set contains 1 specimen). 2. Prediction rule is built using the training set. 3. Rule is applied to the specimen in the test set for class prediction. 4. Process is repeated until each specimen has appeared once in the test set.

Prediction on Simulated Null Data Generation of Gene Expression Profiles 14 specimens (P i is the expression profile for specimen i) Log-ratio measurements on 6000 genes P i ~ MVN(0, I 6000 ) Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)? Prediction Method Compound covariate prediction (discussed later) Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Exact Permutation Test Premise: Under the null hypothesis of no systematic difference in expression profiles between the two classes, it can be assumed that assignment of class labels to expression profiles is purely coincidental. Performing the test 1. Consider every possible permutation of the class labels among the gene expression profiles. 2. Determine the proportion of the permutations that result in a misclassification error rate less than or equal to the observed error rate. 3. This proportion is the achieved significance level in a test of the null hypothesis.

Examining all permutations is computationally burdensome. Instead, a Monte Carlo method is used… n perm permutations of the labels are randomly generated. The proportion of these permutations that have m or fewer misclassifications is an estimate of the achieved significance level in a test of the null hypothesis. n perm is chosen such that the variability in the estimate is less than an acceptable level. If the true proportion of permutations with m £ 2 is 0.05, n perm = 2000 ensures the coefficient of variation of the estimate of the achieved significance level is less than 0.1. Monte Carlo Permutation Test

Gene-Expression Profiles in Hereditary Breast Cancer Breast tumors studied: 7 BRCA1+ tumors 8 BRCA2+ tumors 7 sporadic tumors Log-ratios measurements of 3226 genes for each tumor after initial data filtering cDNA Microarrays Parallel Gene Expression Analysis RESEARCH QUESTION Can we distinguish BRCA1+ from BRCA1– cancers and BRCA2+ from BRCA2– cancers based solely on their gene expression profiles?

The Compound Covariate Predictor (CCP) We consider only genes that are differentially expressed between the two groups (using a two-sample t-test with small  ). The CCP –Motivated by J. Tukey, Controlled Clinical Trials, 1993 –Simple approach that may serve better than complex multivariate analysis –A compound covariate is built from the basic covariates (log-ratios) t j is the two-sample t-statistic for gene j. x ij is the log-ratio measure of sample i for gene j. Sum is over all differentially expressed genes. Threshold of classification: midpoint of the CCP means for the two classes.

Accuracy of class prediction as selection stringency increases

Advantages of Compound Covariate Classifier Good feature selection Does not over-fit data –Incorporates influence of multiple predictive variables without attempting to select the best small subset of variables –Does not attempt to model the multivariate interactions among the predictors and outcome

Extensions Adjustment for covariates Paired samples Survival data Other classification methods More than 2 classes

Class Discovery For determining whether a set of tumors is homogeneous with regard to expression profile

Class Discovery Methods Cluster analysis Multi-dimensional Scaling

1 - correlation Melanoma Gene Expression Data 19 tumor cluster of interest Q: Can gene expression profiles of melanoma be used to distinguish sub-classes of disease? (M. Bittner et al., Nature Genetics Aug 2000)

Validation of Clusters Clustering algorithms find clusters, even when they are spurious Clusters found may change with re-assaying tumors or selection of new tumors

Clustering Arrays Cluster significance Cluster reproducibility

Add perturbation noise to original data Re-cluster perturbed data to assess stability of original clusters D: Proportion of pairs of samples in a specified cluster of the original data that are in separate clusters after perturbation R: Average number of specimens lost or gained in a specified cluster || C  P(C) - C  P(C) ||

Melanoma Data: mn-error Method - Individual Clusters

Test of Cluster Significance Multivariate Gaussian null hypothesis Project to subspace determined by first three principal components Compute EDF of nearest neighbor Euclidean distances between samples Compare the NN EDF observed to that expected under the null distribution using a squared difference discrepancy metric Estimate null distribution by sampling from 3D Gaussian distribution with mean and covariance matrix corresponding to observed data

BRB ArrayTools: An integrated package for the analysis of DNA microarray data http://linus.nci.nih.gov/BRB-ArrayTools.html

BRB ArrayTools Design Objectives Easy user interface –Excel front-end Ease of data loading –integrated Drill-down linkage to genomic databases Educating biologists in microarray data analysis Powerful analytic & visualization tools Easily extensible –R backend Portable –Non-proprietary Ease of development –R back-end

Collaborators Molecular Statistics & Bioinformatics –Kevin Dobbin –Lisa McShane –Amy Peng –Michael Radmacher –Joanna Shih –George Wright –Yingdong Zhao

Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics.

Similar presentations

Presentation on theme: "Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics.

Similar presentations

Presentation on theme: "Myths and Statistical Principles in DNA Microarray Research Richard Simon, D.Sc. Chief, Biometric Research Branch Head, Molecular Statistics & Bioinformatics."— Presentation transcript:

Similar presentations

About project

Feedback