Analyzing Expression Data: Clustering and Stats Chapter 16.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Basic Gene Expression Data Analysis--Clustering
Outlines Background & motivation Algorithms overview
Cluster Analysis: Basic Concepts and Algorithms
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Introduction to Bioinformatics
Microarray technology and analysis of gene expression data Hillevi Lindroos.
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
SVD and PCA COS 323. Dimensionality Reduction Map points in high-dimensional space to lower number of dimensionsMap points in high-dimensional space to.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
What is Cluster Analysis?
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
SA basics Lack of independence for nearby obs
What is Cluster Analysis?
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Cluster Analysis Hierarchical and k-means. Expression data Expression data are typically analyzed in matrix form with each row representing a gene and.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Statistical Analysis of Microarray Data
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Model Building III – Remedial Measures KNNL – Chapter 11.
Lecture 20: Cluster Validation
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Distances Between Genes and Samples Naomi Altman Oct. 06.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Flat clustering approaches
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Part II Exploring Relationships Between Variables.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Exploring Microarray data
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension Reduction via PCA (Principal Component Analysis)
Elementary Statistics
Multivariate Statistical Methods
Dimension reduction : PCA and Clustering
Unsupervised Learning
Presentation transcript:

Analyzing Expression Data: Clustering and Stats Chapter 16

Goals We’ve measured the expression of genes or proteins using the technologies discussed previously. What can we do with that information? –Identify significant differences in expression –Identify similar patterns of expression (clustering)

Analysis steps 1.Data normalization 2.Statistical Analysis 3.Cluster Analysis

I. Data Normalization Why normalize? –Removes systematic errors –Makes the data easier to analyze statistically

Sources of Error Measurements always contain errors. –Systematic (oops) –Random (noise!) Subtracting the background level can remove some systematic error –Using the ratio in two-channel experiments does this –Subtracting the overall average intensity can be used with one- channel data. Taking averages over replicates of the experiment reduces the random error. Advanced error models are mentioned on p. 628 and covered in “Further Reading”.

Expression data usually not Gaussian (normal) Many statistical tests assume that the data is normally distributed. Expression microarray spot intensity data (for example) is not. Intensity ratio data (two- channel) is not normal either. Both go from 0 to infinity whereas normal data is symmetrical.

Taking the logarithm helps normalize expression ratio data The expression ratio plotted versus the expression level (geometric mean) in both channels. Plotting the log ratio vs. the log expression level gives data that is centered around y=0 and fairly “normal looking”.

Taking the log of the expression ratio “fixes” the left tail

LOWESS Normalization Sometimes there is still a bias that depends on the expression level. This can be removed by a type of regression called “Locally Weighted Scatterplot Smoothing”. This computes and subtracts the mean locally for various values of expression level (RG).

II. Statistical Analysis Determining what differences in expression are statistically significant Controlling false positives

When are two measurements significantly different? We want to say that an expression ratio is significant if it is big enough (>1) or small enough (<1). A two-fold ratio (for example) is only significant if the variances of the underlying measurements are sufficiently small. The significance is related to the area of the overlap of the underlying distributions.

The Z-test If the data is approximately normal, convert it to a Z-score. –X can be the log expression ratio;  is then 0 –  is the sample standard deviation; n is the number of repeats The Z-score is distributed N(0,1) (standard normal). The significance level is the area in the tail(s) of the standard normal distribution.

The t-test The t-test makes fewer assumptions about the data than the Z-test It can be applied to compare two average measurements which can have –Different variances –Different numbers of observations You compute the t-statistic (see pages ) and then look up the significance level of the Students’ T distribution in a table.

III. Cluster Analysis Similar expression patterns –Groups of genes/proteins with similar expression profiles Similar expression sub-patterns –Groups of genes/proteins with similar expression profiles in a subset of conditions Different clustering methods Assessing the value of clusters

Example: Gene Expression Profiles Expression level of a gene is measured at different time points after treating cells. Many different expression profiles are possible. –No effect –Immediate increase or decrease –Delayed increase or decrease –Transient increase or decrease

Clustering by Eye n genes or proteins m different samples (or conditions) Represent a gene as a point: – X = If m is 1 or 2 (or even 3) you can plot the points and look for clusters of genes with similar expression. –But what if m is bigger than 3? –Need to reduce the dimensionality: PCA

Reducing the Dimensionality of Data: Principal Components Analysis PCA linearly map each point to a small set of dimensions (components). –The principal components are dimensions that capture the maximum variation in the data. The principal components capture most of the important information in the data (usually). Plotting each point’s values in two of the principal component dimensions allows us to see clusters. 2-D Gel Data

PCA: An Illustration Yeast Cell Cycle Gene Expression Singular value decomposition of a matrix X (SVD) is –X = U  V T The mapped value of X is –Y = X V T The rows of Y give the mapping of each gene. –Mapped gene i: Y i (2000)

Clustering Using Statistics Algorithm identifies groups. –Example: similar expression profiles Distance measure between pairs of points is needed.

Distance Measures Between Pairs of Points In order to cluster the points (genes or conditions), we need some concept of which points are “close” to each other. So we need a measure of distance (or, conversely,) similarity between two rows (or columns) in our n by m matrix. We can then compute all the pair-wise distances between rows (or columns).

Standard Distance Measures Euclidean Distance Pearson Correlation Coefficient Mahalanobis Distance

Euclidean Distance Standard, everyday distance –Treats all dimensions equally –If some genes vary more than others (have higher variance), they influence the distance more.

Mahalanobis Distance The “normalized” Euclidean distance Scales each dimension by the variance in that dimension. –This is useful if the genes tend to vary much more in one sample than in others since it reduces the affect of that sample on the distances.

Pearson Correlation Coefficient Distances are small when two genes have similar patterns of change even if the size of the changes are different. This is accomplished by scaling by the sample variance of the gene’s expression levels under different conditions.

Choice of Distance Matters Heirarchical clustering (dentrogram) of tissues. –Corresponds to clustering the columns of the matrix. Branches are different (cancer B/C vs A/B).

Clustering Algorithms Hierarchical Clustering K-means clustering Self-organizing maps and trees

Hierachical Clustering Algorithms progressively merge clusters or split clusters. –Merging criterion can by single-linkage or complete-linkage. Produce dendrograms –Can be interpreted at different thresholds.

Types of Linkage A. Single Linkage B. Complete Linkage C. Centroid Method

K-means Clutering Related to Expectation Maximization You specify the number of clusters Iteratively moves the means of the clusters to maximize the likelihood (minimize total error).