Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.

Gene Expression Profiling

Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined classes Class Prediction –Prediction of predetermined class using information from gene expression profile Response vs no response Class Discovery –Discover clusters of specimens having similar expression profiles –Discover clusters of genes having similar expression profiles

Class Comparison and Class Prediction Not clustering problems Supervised methods

Levels of Replication Technical replicates –RNA sample divided into multiple aliquots Biological replicates –Multiple subjects –Multiple animals –Replication of the tissue culture experiment

Comparing classes or developing classifiers requires independent biological replicates. The power of statistical methods for microarray data depends on the number of biological replicates

Microarray Platforms Single label arrays –Affymetrix GeneChips Dual label arrays –Common reference design –Other designs

Common Reference Design A1A1 R A2A2 B1B1 B2B2 RRR RED GREEN Array 1Array 2Array 3Array 4 A i = ith specimen from class A R = aliquot from reference pool B i = ith specimen from class B

The reference generally serves to control variation in the size of corresponding spots on different arrays and variation in sample distribution over the slide. The reference provides a relative measure of expression for a given gene in a given sample that is less variable than an absolute measure. The reference is not the object of comparison.

Dye swap technical replicates of the same two rna samples are not necessary with the common reference design For two-label direct comparison designs for comparing two classes, dye bias is of concern and dye swaps may be needed.

Class Comparison Gene Finding

Controlling for Multiple Comparisons Bonferroni type procedures control the probability of making any false positive errors Overly conservative for the context of DNA microarray studies

Simple Procedures Control expected the number of false discoveries by testing each gene for differential expression between classes using a stringent significance level –expected number of false discoveries in testing G genes with significance threshold p* is G p* –e.g. To limit of 10 false discoveries in 10,000 comparisons, conduct each test at p<0.001 level Control FDR –Expected proportion of false discoveries among the genes declared differentially expressed –Benjamini-Hochberg procedure –FDR = G p* / #(p  p*)

Additional Procedures Multivariate permutation tests –Korn et al. Stat Med 26:4428,2007 SAM - Significance Analysis of Microarrays Advantages –Distribution-free even if they use t statistics –Preserve/exploit correlation among tests by permuting each profile as a unit –More effective than univariate permutation tests especially with limited number of samples

Randomized Variance t-test Wright G.W. and Simon R. Bioinformatics19:2448-2455,2003 Pr(  -2 =x) = x a-1 exp(-x/b)/  (a)b a

Class Prediction

Components of Class Prediction Feature (gene) selection –Which genes will be included in the model Select model type –E.g. Diagonal linear discriminant analysis, Nearest-Neighbor, … Fitting parameters (regression coefficients) for model –Selecting value of tuning parameters

Feature Selection Genes that are differentially expressed among the classes at a significance level  (e.g. 0.01) –The  level is selected only to control the number of genes in the model –For class comparison false discovery rate is important –For class prediction, predictive accuracy is important

Complex Gene Selection Small subset of genes which together give most accurate predictions –Genetic algorithms Little evidence that complex feature selection is useful in microarray problems

Linear Classifiers for Two Classes

Fisher linear discriminant analysis –Requires estimating correlations among all genes selected for model Diagonal linear discriminant analysis (DLDA) assumes features are uncorrelated Compound covariate predictor (Radmacher) and Golub’s method are similar to DLDA in that they can be viewed as weighted voting of univariate classifiers

Linear Classifiers for Two Classes Compound covariate predictor Instead of for DLDA

Linear Classifiers for Two Classes Support vector machines with inner product kernel are linear classifiers with weights determined to separate the classes with a hyperplane that minimizes the length of the weight vector

Support Vector Machine

Other Linear Methods Perceptrons Principal component regression Supervised principal component regression Partial least squares Stepwise logistic regression

Other Simple Methods Nearest neighbor classification Nearest k-neighbors Nearest centroid classification Shrunken centroid classification

Nearest Neighbor Classifier To classify a sample in the validation set, determine it’s nearest neighbor in the training set; i.e. which sample in the training set is its gene expression profile is most similar to. –Similarity measure used is based on genes selected as being univariately differentially expressed between the classes –Correlation similarity or Euclidean distance generally used Classify the sample as being in the same class as it’s nearest neighbor in the training set

Nearest Centroid Classifier For a training set of data, select the genes that are informative for distinguishing the classes Compute the average expression profile (centroid) of the informative genes in each class Classify a sample in the validation set based on which centroid in the training set it’s gene expression profile is most similar to.

When p>>n The Linear Model is Too Complex It is always possible to find a set of features and a weight vector for which the classification error on the training set is zero. It may be unrealistic to expect that there is sufficient data available to train more complex non-linear classifiers

Other Methods Top-scoring pairs –Claim that it gives accurate prediction with few pairs because pairs of genes are selected to work well together Random Forest –Very popular in machine learning community –Complex classifier

Comparative studies indicate that linear methods and nearest neighbor type methods often work as well or better than more complex methods for microarray problems because they avoid over-fitting the data.

Evaluating a Classifier Fit of a model to the same data used to develop it is no evidence of prediction accuracy for independent data –Goodness of fit vs prediction accuracy

Class Prediction A classifier is not a set of genes Testing whether analysis of independent data results in selection of the same set of genes is not an appropriate test of predictive accuracy of a classifier The classification of independent data should be accurate. There are many reasons why the classifier may be unstable. The classification should not be unstable.

Hazard ratios and statistical significance levels are not appropriate measures of prediction accuracy A hazard ratio is a measure of association –Large values of HR may correspond to small improvement in prediction accuracy Kaplan-Meier curves for predicted risk groups within strata defined by standard prognostic variables provide more information about improvement in prediction accuracy Time dependent ROC curves within strata defined by standard prognostic factors can also be useful

Time-Dependent ROC Curve M(b) = binary marker based on threshold b PPV = prob{S  T|M(b)=1} NPV = prob{S<T|M(b)=0} ROC Curve is Sensitivity vs 1-Specificity as a function of b Sensitivity = prob{M(b)=1|S  T} Specificity = prob{M(b)=0|S<T}

Validation of a Predictor Internal validation –Re-substitution estimate Very biased –Split-sample validation –Cross-validation Independent data validation

Split-Sample Evaluation Split your data into a training set and a test set –Randomly (e.g. 2:1) –By center Training-set –Used to select features, select model type, determine parameters and cut-off thresholds Test-set –Withheld until a single model is fully specified using the training-set. –Fully specified model is applied to the expression profiles in the test-set to predict class labels. –Number of errors is counted

Leave-one-out Cross Validation Leave-one-out cross-validation simulates the process of separately developing a model on one set of data and predicting for a test set of data not used in developing the model

Leave-one-out Cross Validation Omit sample 1 –Develop multivariate classifier from scratch on training set with sample 1 omitted –Predict class for sample 1 and record whether prediction is correct

Leave-one-out Cross Validation Repeat analysis for training sets with each single sample omitted one at a time e = number of misclassifications determined by cross-validation Subdivide e for estimation of sensitivity and specificity

Cross validation is only valid if the test set is not used in any way in the development of the model. Using the complete set of samples to select genes violates this assumption and invalidates cross-validation. With proper cross-validation, the model must be developed from scratch for each leave-one-out training set. This means that feature selection must be repeated for each leave-one-out training set. The cross-validated estimate of misclassification error is an estimate of the prediction error for model fit using specified algorithm to full dataset

Prediction on Simulated Null Data Generation of Gene Expression Profiles 14 specimens (P i is the expression profile for specimen i) Log-ratio measurements on 6000 genes P i ~ MVN(0, I 6000 ) Can we distinguish between the first 7 specimens (Class 1) and the last 7 (Class 2)? Prediction Method Compound covariate prediction (discussed later) Compound covariate built from the log-ratios of the 10 most differentially expressed genes.

Permutation Distribution of Cross- validated Misclassification Rate of a Multivariate Classifier Randomly permute class labels and repeat the entire cross-validation Re-do for all (or 1000) random permutations of class labels Permutation p value is fraction of random permutations that gave as few misclassifications as e in the real data

Simulated Data 40 cases, 10 genes selected from 5000 MethodEstimateStd Deviation True.078 Resubstitution.007.016 LOOCV.092.115 10-fold CV.118.120 5-fold CV.161.127 Split sample 1-1.345.185 Split sample 2-1.205.184.632+ bootstrap.274.084

Simulated Data 40 cases MethodEstimateStd Deviation True.078 10-fold.118.120 Repeated 10-fold.116.109 5-fold.161.127 Repeated 5-fold.159.114 Split 1-1.345.185 Repeated split 1-1.371.065

Does an Expression Profile Classifier Predict More Accurately Than Standard Prognostic Variables? Not an issue of which variables are significant after adjusting for which others or which are independent predictors –Predictive accuracy and inference are different Kaplan-Meier curves of high and low risk groups can be compared within strata defined by standard prognostic variables Gene expression risk group classifier can be compared to a classifier based on standard prognostic variables based on time-dependent ROC curves

Survival Risk Group Prediction Screen individual genes by fitting single variable proportional hazards regression models Select genes based on p-value threshold Compute first k (eg 3) principal components of the selected genes Fit PH regression model with the k pc’s as predictors

Survival Risk Group Prediction LOOCV loop: –Create training set T -i by omitting i’th case Develop supervised pc PH model on T -i Compute cross-validated predictive index for i’th case using PH model developed on T -i Compute rank of predictive index for i’th case among predictive indices for cases in T -i

Survival Risk Group Prediction Plot Kaplan Meier survival curves for groups based on cross-validated ranks Compute log-rank statistic comparing the cross-validated Kaplan Meier curves

Survival Risk Group Prediction Repeat the entire procedure for a large number of permutations of survival times and censoring indicators to generate the null distribution of the log-rank statistic –The usual chi-square null distribution is not valid because the cross-validated ranks are correlated among cases Evaluate statistical significance of the association of survival and expression profiles by referring the log-rank statistic for the unpermuted data to the permutation null distribution

Survival Risk Group Prediction The supervised pc method is implemented in BRB-ArrayTools BRB-ArrayTools also provides for comparing the risk group classifier based on expression profiles to one based on standard covariates and one based on a combination of both types of variables

Types of Clinical Outcome Survival or disease-free survival Response to therapy

90 publications identified that met criteria –Abstracted information for all 90 Performed detailed review of statistical analysis for the 42 papers published in 2004

Major Flaws Found in 40 Studies Published in 2004 Inadequate control of multiple comparisons in gene finding –9/23 studies had unclear or inadequate methods to deal with false positives 10,000 genes x.05 significance level = 500 false positives Misleading report of prediction accuracy –12/28 reports based on incomplete cross-validation Misleading use of cluster analysis –13/28 studies invalidly claimed that expression clusters based on differentially expressed genes could help distinguish clinical outcomes 50% of studies contained one or more major flaws

Gene Finding Don’t use only fold-changes between groups to select genes Don’t use a.05 significance threshold to select the differentially expressed genes Do use a method to control the number of false positive differentially expressed genes

Supervised Prediction Do frame a therapeutically relevant question and select a homogeneous set of patients Don’t use any information from the test set in developing the classifier –e.g. don’t use the test set to re-calibrate the classifier for a new platform Do access the test set once and only for testing the fully specified classifier developed with the training data

Supervised Prediction Don’t use the same set of features for all loops of the cross validation Do report error estimates for all classification methods tried, not just the one with the smallest error estimate Don’t consider that retaining a small separate test set adds value to a correctly cross-validated estimate of accuracy Do report the fully specified classifier with its parameters Do report the permutation statistical significance of the cross-validation estimate of prediction error

Conclusion Do Use BRB-ArrayTools

Sample Size Planning for Gene Expression Profiling K Dobbin & R Simon. Sample size determination in microarray experiments for class comparison and prognostic classification. Biostatistics 6:27- 35, 2005 K Dobbin & R Simon. Sample size planning for developing classifiers using high dimensional DNA microarray data. Biostatistics 8:101-17, 2007 K Dobbin, Y Zhao & R Simon. How large a training set is needed to develop a classifier for microarray data. Clinical Cancer Research 14:108-14, 2008

Sample Size Planning for Class Comparison GOAL: Identify genes differentially expressed in a comparison of two pre-defined classes of specimens on single label array or two-color arrays using reference design Compare classes separately by gene with adjustment for multiple comparisons Approximate expression levels as normally distributed Determine number of samples n/2 per class to give power 1-  for detecting mean difference  at level 

n = 4  2 (z  /2 + z  ) 2 /(  /  ) 2  = mean difference in log expression between classes  = within class standard deviation Choose  small, e.g.  =.001 Use percentiles of t distribution for improved accuracy Comparing 2 equal size classes No technical reps

Total Number of Samples for Two Class Comparison  Total Samples 0.0010.051 (2-fold) 0.5 human tissue 26 0.25 transgenic mice 12 (t approximation)

 = proportion of genes on array that are differentially expressed between classes N = number of genes on the array FD = expected number of false discoveries TD = expected number of true discoveries FDR  FD/(FD+TD)

FD =  (1-  )N TD = (1-  )  N FDR =  (1-  )N/{  (1-  )N + (1-  )  N} = 1/{1 + (1-  )  /  (1-  )}

Controlling Expected False Discovery Rate  FDR 0.010.0010.109.9% 0.010.00535.5% 0.050.0012.1% 0.050.0059.5%

Number of Events Needed to Detect Gene Specific Effects on Survival  = standard deviation in log2 ratios for each gene  = hazard ratio (>1) corresponding to 2-fold change in gene expression

Number of Events Required to Detect Gene Specific Effects on Survival  =0.001,  =0.05 Hazard Ratio   Events Required 20.526 1.50.576

Sample Size Planning for Class Prediction X=expression vector Linear classifier l(x)=w’x w vector determined from training set w=a(T) For xєC 1 x~N( ,  ) For xєC 2 x~N(- ,  ) Classify specimen to C 1 if l(x)  0

For xєC 1

Compound Covariate Predictor

Simple Scenario  =  2 I  i =  for i  G (g i =1)  i = 0 for i  G (g i =0)

Optimal significance level cutoffs for gene selection. 50 differentially expressed genes out of 22,000 genes on the microarrays 2δ/σn=10n=30n=50 10.1670.0030.00068 1.250.0850.00110.00035 1.50.0450.000630.00016 1.750.0260.000360.00006 20.0150.00020.00002

Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.

Similar presentations

Presentation on theme: "Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.

Similar presentations

Presentation on theme: "Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined."— Presentation transcript:

Similar presentations

About project

Feedback