Summer Inst. Of Epidemiology and Biostatistics, 2008: Gene Expression Data Analysis 8:30am-12:30pm in Room W2017 Carlo Colantuoni –

Slides:



Advertisements
Similar presentations
Publications Reviewed Searched Medline Hand screening of abstracts & papers Original study on human cancer patients Published in English before December.
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Unsupervised learning
Introduction to Bioinformatics
Cluster Analysis.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Three kinds of learning
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Clustering and Classification In Gene Expression Data Carlo Colantuoni Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry,
Elizabeth Garrett-Mayer November 5, 2003 Oncology Biostatistics
Model Assessment and Selection Florian Markowetz & Rainer Spang Courses in Practical DNA Microarray Analysis.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Data mining and machine learning A brief introduction.
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Analysis of Molecular and Clinical Data at PolyomX Adrian Driga 1, Kathryn Graham 1, 2, Sambasivarao Damaraju 1, 2, Jennifer Listgarten 3, Russ Greiner.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
The Broad Institute of MIT and Harvard Classification / Prediction.
1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.
Carlo Colantuoni – Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017.
Microarray Workshop 1 Introduction to Classification Issues in Microarray Data Analysis Jane Fridlyand Jean Yee Hwa Yang University of California, San.
Gene Expression Profiling. Good Microarray Studies Have Clear Objectives Class Comparison (gene finding) –Find genes whose expression differs among predetermined.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Summer Inst. Of Epidemiology and Biostatistics, 2008: Gene Expression Data Analysis 8:30am-12:30pm in Room W2017 Carlo Colantuoni –
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Cluster validation Integration ICES Bioinformatics.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Summer Inst. Of Epidemiology and Biostatistics, 2009: Gene Expression Data Analysis 8:30am-12:00pm in Room W2017 Carlo Colantuoni –
Unsupervised Learning
Semi-Supervised Clustering
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Mining K-means Algorithm
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Parametric Methods Berlin Chen, 2005 References:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Unsupervised Learning
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Summer Inst. Of Epidemiology and Biostatistics, 2008: Gene Expression Data Analysis 8:30am-12:30pm in Room W2017 Carlo Colantuoni –

Class Outline Basic Biology & Gene Expression Analysis Technology Data Preprocessing, Normalization, & QC Measures of Differential Expression Multiple Comparison Problem Clustering and Classification The R Statistical Language and Bioconductor GRADES – independent project with Affymetrix data.

Class Outline - Detailed Basic Biology & Gene Expression Analysis Technology –The Biology of Our Genome & Transcriptome –Genome and Transcriptome Structure & Databases –Gene Expression & Microarray Technology Data Preprocessing, Normalization, & QC –Intensity Comparison & Ratio vs. Intensity Plots (log transformation) –Background correction (PM-MM, RMA, GCRMA) –Global Mean Normalization –Loess Normalization –Quantile Normalization (RMA & GCRMA) –Quality Control: Batches, plates, pins, hybs, washes, and other artifacts –Quality Control: PCA and MDS for dimension reduction Measures of Differential Expression –Basic Statistical Concepts –T-tests and Associated Problems –Significance analysis in microarrays (SAM) [ & Empirical Bayes] –Complex ANOVA’s (limma package in R) Multiple Comparison Problem –Bonferroni –False Discovery Rate Analysis (FDR) Differential Expression of Functional Gene Groups –Functional Annotation of the Genome –Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum –Gene Set Enrichment Analysis (GSEA) –Parametric Analysis of Gene Set Enrichment (PAGE) –geneSetTest –Notes on Experimental Design Clustering and Classification –Hierarchical clustering –K-means –Classification LDA (PAM), kNN, Random Forests Cross-Validation Additional Topics –The R Statistical Language –Bioconductor –Affymetrix data processing example!

DAY #4: Clustering Classification R and Bioconductor! Affymetrix Example

Clustering and Classification In Gene Expression Data Carlo Colantuoni Slide Acknowledgements: Elizabeth Garrett-Mayer, Rafael Irizarry, Giovanni Parmigiani, David Madigan, Kevin Coombs, Richard Simon, Ingo Ruczinski. Classification based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber

Data from Garber et al. PNAS (98), 2001.

Clustering is an exploratory tool to see who's running with who: Genes and Samples. “Unsupervized” NOT for classification of samples. NOT for identification of differentially expressed genes. Clustering

Clustering organizes things that are close into groups. What does it mean for two genes to be close? What does it mean for two samples to be close? Once we know this, how do we define groups? Hierarchical and K-Means Clustering

Distance We need a mathematical definition of distance between two points What are points? If each gene is a point, what is the mathematical definition of a point?

Points Gene1= (E 11, E 12, …, E 1N )’ Gene2= (E 21, E 22, …, E 2N )’ Sample1= (E 11, E 21, …, E G1 )’ Sample2= (E 12, E 22, …, E G2 )’ E gi =expression gene g, sample i G G N DATA MATRIX

Most Famous Distance Euclidean distance –Example distance between gene 1 and 2: –Sqrt of Sum of ( E 1i -E 2i ) 2, i=1,…,N Baltimore DC Distance When N is 20,000 you have to think abstractly Remember Highschool? Pythagorean Theorum When N is 2, this is distance as we know it:

Correlation can also be used to compute distance Pearson Correlation (r) Spearman Correlation Uncentered Correlation Absolute Value of Correlation (or r 2 )

The difference is that, if you have two vectors X and Y with identical shape, but which are offset relative to each other by a fixed value, they will have a standard Pearson correlation (centered correlation) of 1 but will not have an uncentered correlation of 1.

The similarity/distance matrices 1 2 ………………………………...G G G G G 1 2 ……….N DATA MATRIX GENE SIMILARITY MATRIX

The similarity/distance matrices 1 2 …………..N G G 12...N12...N 1 2 ……….N DATA MATRIX SAMPLE SIMILARITY MATRIX

Gene and Sample Selection Do you want all genes included? What to do about replicates from the same individual/tumor? Genes that contribute noise will affect your results. Including all genes: dendrogram can’t all be seen at the same time. Perhaps screen the genes?

Two commonly seen clustering approaches in gene expression data analysis Hierarchical clustering –Dendrogram (red-green picture) –Allows us to cluster both genes and samples in one picture and see whole dataset “organized” K-means/K-medoids –Partitioning method –Requires user to define K = # of clusters a priori –No picture to (over)interpret

Hierarchical Clustering The most overused statistical method in gene expression analysis Gives us pretty red-green picture with patterns But, pretty picture tends to be pretty unstable. Many different ways to perform hierarchical clustering Tend to be sensitive to small changes in the data Provided with clusters of every size: where to “cut” the dendrogram is user-determined

Choose clustering direction Agglomerative clustering (bottom-up) –Starts with as each gene in its own cluster –Joins the two most similar clusters –Then, joins next two most similar clusters –Continues until all genes are in one cluster Divisive clustering (top-down) –Starts with all genes in one cluster –Choose split so that genes in the two clusters are most similar (maximize “distance” between clusters) –Find next split in same manner –Continue until all genes are in single gene clusters

Choose linkage method (if bottom-up) Single Linkage: join clusters whose distance between closest genes is smallest (elliptical) Complete Linkage: join clusters whose distance between furthest genes is smallest (spherical) Average Linkage : join clusters whose average distance is the smallest.

Dendrogram Creation + Interpretation

Cluster Assignment

450 relevant genes “noise” genes.450 relevant genes. Simulated Data with 4 clusters: 1-10, 11-20, 21-30, 31-40

K-means and K-medoids Partitioning Method Don’t get pretty picture MUST choose number of clusters K a priori More of a “black box” because output is most commonly looked at purely as assignments Each object (gene or sample) gets assigned to a cluster Begin with initial partition Iterate so that objects within clusters are most similar

K-means (continued) Euclidean distance most often used Spherical clusters. Can be hard to choose or figure out K. Not unique solution: clustering can depend on initial partition No pretty figure to (over)interpret

K-means Algorithm 1. Choose K centroids at random 2. Make initial partition of objects into k clusters by assigning objects to closest centroid 3.Calculate the centroid (mean) of each of the k clusters. 4.a. For object i, calculate its distance to each of the centroids. b. Allocate object i to cluster with closest centroid. c. If object was reallocated, recalculate centroids based on new clusters. 4. Repeat 3 for object i = 1,….N. 5.Repeat 3 and 4 until no reallocations occur. 6.Assess cluster structure for fit and stability

K-means We start with some data Interpretation: –We are showing expression for two samples for 14 genes –We are showing expression for two genes for 14 samples This is with 2 genes. Iteration = 0

K-means Choose K centroids These are starting values that the user picks. There are some data driven ways to do it Iteration = 0

K-means Make first partition by finding the closest centroid for each point This is where distance is used Iteration = 1

K-means Now re-compute the centroids by taking the middle of each cluster Iteration = 2

K-means Repeat until the centroids stop moving or until you get tired of waiting Iteration = 3

K-means Limitations Final results depend on starting values How do we chose K? There are methods but not much theory saying what is best. Where are the pretty pictures?

Assessing cluster fit and stability Most often ignored. Cluster structure is treated as reliable and precise Can be VERY sensitive to noise and to outliers Homogeneity and Separation Cluster Silhouettes: how similar genes within a cluster are to genes in other clusters (Rousseeuw Journal of Computation and Applied Mathematics, 1987)

Silhouettes Silhouette of gene i is defined as: a i = average distance of gene i to other gene in same cluster b i = average distance of gene i to genes in its nearest neighbor cluster

WADP: Weighted Average Discrepancy Pairs Add perturbations to original data Calculate the number of paired samples that cluster together in the original cluster that didn’t in the perturbed Repeat for every cutoff (i.e. for each k) Do iteratively Estimate for each k the proportion of discrepant pairs.

Different levels of noise have been added By Bittner’s recommendation, 1.0 is appropriate for our dataset But, not well-justified. External information would help determine level of noise for perturbation We look for largest k before WADP gets big. WADP

Classification Diagnostic tests are good examples of classifiers –A patient has a given disease or not –The classifier is a machine that accepts some clinical parameters as input, and spits out an prediction for the patient D Not-D Classes must be mutually exclusive and exhaustive

Components of Class Prediction Select features (genes) –Which genes will be included in the model Select type of classifier –E.g. (D)LDA, SVM, k-Nearest-Neighbor, … Fit parameters for model (train the classifier) Quantify predictive accuracy: Cross- Validation

Feature Selection Goal is to identify a small subset of genes which together give accurate predictions. Methods will vary depending on nature of classification problem –Choose genes with significant t-statistics to distinguish between two simple classes e.g.

Classifier Selection In microarray classification, the number of features is (almost) always much greater than the number of samples. Overfitting is a distinct risk, and increases with more complicated methods.

How microarrays differ from the rest of the world Complex classification algorithms such as neural networks that perform better elsewhere don’t do as well as simpler methods for expression data. Comparative studies have shown that simpler methods work as well or better for microarray problems because the number of candidate predictors exceeds the number of samples by orders of magnitude. (Dudoit, Fridlyand and Speed JASA 2001)

Statistical Methods Appropriate for Class Comparison may not be Appropriate for Class Prediction Demonstrating statistical significance of prognostic factors is not the same as demonstrating predictive accuracy. Demonstrating goodness of fit of a model to the data used to develop it is not a demonstration of predictive accuracy. Most statistical methods were not developed for p>>n prediction problems

Many Classification Algorithms have been Applied to and Modified for Expression Analysis LDA, QDA, DLDA, Weighted Gene Voting, kNN, SVM, Random Forests, CART, Logistic Regression, Bayesian Classifiers, very simple gene combination rules, very complex algorithms.

Linear discriminant analysis If there are K classes, simply draw lines (planes) to divide the space of expression profiles into K regions, one for each class. If profile X falls in region K, predict class K.

LDAQDA

Nearest Neighbor Classification To classify a new observation X, measure the distance d(X,X i ) between X and every sample X i in training set Assign to X the class label of its “nearest neighbor” in the training set.

Evaluating a classifier Want to estimate the error rate when classifier is used to predict class of a new observation The ideal approach is to get a set of new observations, with known class label and see how frequently the classifier makes the correct prediction. Performance on the training set is a poor approach, and will deflate the error estimate. Cross validation methods are used to get less biased estimates of error using only the training data.

Split-Sample Evaluation Training-set –Used to select features, select model type, determine parameters and cut-off thresholds Test-set –Withheld until a single model is fully specified using the training-set. –Fully specified model is applied to the expression profiles in the test-set to predict class labels. –Number of errors is counted

V-fold cross validation Divide data into V groups. Hold one group back, train the classifier on other V-1 groups, and use it to predict the last one. Rotate through all V points, holding each back. Error estimate is total error rate on all V test groups.

Leave-one-out Cross Validation Hold one data point back, train the classifier on other n-1 data points, and use it to predict the last one. Rotate through all n points, holding each back. Error estimate is total error rate on all n test values.

training set test set specimens log-expression ratios specimens log-expression ratios full data set Non-cross-validated Prediction Cross-validated Prediction (Leave-one-out method) 1. Prediction rule is built using full data set. 2. Rule is applied to each specimen for class prediction. 1. Full data set is divided into training and test sets (test set contains 1 specimen). 2. Prediction rule is built from scratch using the training set. 3. Rule is applied to the specimen in the test set for class prediction. 4. Process is repeated until each specimen has appeared once in the test set.

Which to use depends mostly on sample size If the sample is large enough, split into test and train groups. If sample is barely adequate for either testing or training, use leave one out. In between consider V-fold. This method can give more accurate estimates than leave one out, but reduces the size of training set.

Beware Cross-validation of a model cannot occur after selecting the genes to be used in the model

Incomplete (incorrect) Cross- Validation Publications are using all the data to select genes and then cross-validating only the parameter estimation component of model development –Highly biased –Many published complex methods which make strong claims based on incorrect cross-validation. Frequently seen in complex feature set selection algorithms Some software encourages inappropriate cross-validation

R Bioconductor Affymetrix Data Example