Clustering Manpreet S. Katari.

Slides:



Advertisements
Similar presentations
Contingency Table Analysis Mary Whiteside, Ph.D..
Advertisements

Chapter 18: The Chi-Square Statistic
Hypothesis Testing IV Chi Square.
Chapter18 Determining and Interpreting Associations Among Variables.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Pathway Analysis. Goals Characterize biological meaning of joint changes in gene expression Organize expression (or other) changes into meaningful ‘chunks’
Proliferation cluster (G12) Figure S1 A The proliferation cluster is a stable one. A dendrogram depicting results of cluster analysis of all varying genes.
Gene expression analysis
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Cluster validation Integration ICES Bioinformatics.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
Chapter 13- Inference For Tables: Chi-square Procedures Section Test for goodness of fit Section Inference for Two-Way tables Presented By:
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Review of statistical modeling and probability theory Alan Moses ML4bio.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Chi Square Procedures Chapter 14. Chi-Square Goodness-of-Fit Tests Section 14.1.
Chi Square Test of Homogeneity. Are the different types of M&M’s distributed the same across the different colors? PlainPeanutPeanut Butter Crispy Brown7447.
Exact probability test (Fisher’s Test) Oh Hun Kwon.
Descriptive Statistics ( )
Comparing Counts Chi Square Tests Independence.
Unsupervised Learning
I. ANOVA revisited & reviewed
Nonparametric Statistics
Statistical Programming Using the R Language
Correlation I have two variables, practically „equal“ (traditionally marked as X and Y) – I ask, if they are independent and if they are „correlated“,
Exploring Microarray data
GO : the Gene Ontology & Functional enrichment analysis
Making Use of Associations Tests
Statistical Testing with Genes
Association between two categorical variables
Correlation – Regression
Hypothesis testing. Chi-square test
CHAPTER 11 Inference for Distributions of Categorical Data
Elementary Statistics
Microarray Clustering
M. Fu, G. Huang, Z. Zhang, J. Liu, Z. Zhang, Z. Huang, B. Yu, F. Meng 
Elementary Statistics
Nonparametric Statistics
The Chi-Square Distribution and Test for Independence
Chi Square Two-way Tables
Analysis of GO annotation at cluster level by Agnieszka S. Juncker
Discrete Event Simulation - 4
Gene expression analysis
Two Categorical Variables: The Chi-Square Test
CHAPTER 11 Inference for Distributions of Categorical Data
MIS2502: Data Analytics Clustering and Segmentation
CHAPTER 11 Inference for Distributions of Categorical Data
MIS2502: Data Analytics Clustering and Segmentation
CHAPTER 11 Inference for Distributions of Categorical Data
Lecture # 2 MATHEMATICAL STATISTICS
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Inference for Two Way Tables
Making Use of Associations Tests
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Chapter 18: The Chi-Square Statistic
Chapter 26 Comparing Counts.
CHAPTER 11 Inference for Distributions of Categorical Data
CHAPTER 11 Inference for Distributions of Categorical Data
Statistical Testing with Genes
Unsupervised Learning
Presentation transcript:

Clustering Manpreet S. Katari

Steps for Clustering Calculate distance In order to group similar observations or attributes we need to have a way to measure their difference Euclidean Pearson Spearman Group similar observations or attributes K-means Hierarchical Determine quality of clustering Silhouette plots

Euclidean Distance Geometric difference 520 Gene expression 1090 experiments expvalues.dist<-dist(expvalues, method=“euclidean”)

Pearson correlation Start with the concept of covariance Covxy=  (xi-x)(yi-y) n-1 n i=1 Implication for gene expression: the shape of gene expression responses will determine similarity Gene expression Normalize the measure by taking the variance of two measurements VarX and VarY  (VarX)(VarY)  (xi-x)(yi-y) n-1 n i=1 Pearson correlation coefficient Pearson correlation has the nice property of varying between -1 and 1 -1 1 experiments expvalues.dist<-as.dist(1-cor(t(expvalues), method=“pearson”))

Spearman Rank Correlation A pearson correlation of ranks. Gives better correlation when relationship is not linear Less sensitive compared to pearson in regards to outliers expvalues.dist<-as.dist(1-cor(t(expvalues), method=“spearman”))

Hierarchical Clustering: Example gene 1 gene2 gene 3 1 0.5 0.8 gene 2 0.6 Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 1 3 Gene 1 Find cells with the shortest distance (e.g.,highest correlation) Group the two genes Recalculate distances from joined group to rest of matrix Find next shortest distance Repeat 3-4 until done Gene 2 Gene 3 Gene 4 Gene 5 … … Gene 10,000 expvalues.hclust<-hclust(expvalues.dist, method=“average”)

Some linkage methods Single linkage Complete linkage Centroid linkage Average linkage expvalues.hclust<-hclust(expvalues.dist, method=“average”) expvalues.groups<-cutree(expvalues.hclust, h=0.7) expvalues.groups<-cutree(expvalues.hclust, k=10)

K-means Decide how many clusters you want, K=n, (Principal components Analysis can help) Step 1: Make random assignments and compute centroids (big dots) Step 2: Assign points to nearest centroids Step 3: Re-compute centroids (in this example, solution is now stable) Optimize by maximizing distance between groups, minimizing distance within groups expvalues.kmeans<-kmeans(expvalues.dist, 10)

K means in action: creates round clouds

K-means weaknesses: can give you a different result each time with exactly the same data

Interpreting Clusters Take Home Message: You Can Cluster Anything You Want and Anything in the Right Format Will Cluster. It Doesn’t Make it Relevant.

Judging the Quality of a Cluster The idea is to classify distinct groups: other methods seek to directly optimize this trait in classification

Measuring the Quality of Clusters This guy has a low ai and a big bi so a high Silhouette value Will lead to low or negative Silhouette values Silhouette Width Sili = (bi-ai)/max(ai,bi) ai-average within cluster distance with respect to gene i bi-average between cluster distance with respect to gene i expvalues.sil = silhouette(expvalue.groups, expvalues.dist)

Silhouette Plot plot(expvalues.sil, col=”blue")

HeatMaps Heatmaps allow us to visually inspect the separation of genes based on the clustering method used in terms of their expression values. The default values of how the heatmap performs clustering can be changed by creating new functions that calculate distance and also perform clustering.

Genome wide analysis results in gene lists When analyzing high-throughput data, like Microarray experiment, the end result is often a list of genes. Differentially expressed genes. Cluster of highly correlated genes. A natural next step is to identify the commonality between the genes in the list. Similar annotations Same Pathway Components of a Protein Complex

Gene lists as a discovery tool Depending on how the gene list was created, the genes can be used for discovering new things For example if you have a cluster of highly correlated genes. One can look for novel Transcription Factor Binding sites by aligning the promoter regions of the genes in the cluster. Many genes in the genome are still annotated as “unknown function”. Finding an “unknown” gene in a list consisting of genes only up-regulated by a given treatment allows the biologists to provide a putative function for the unknown gene.

Gene Set Enrichment Often when we characterize this list of genes, we use statistics to show that the property or annotation is significantly over-represented compared to if the list was created randomly. Two of the common statistical methods are : Hypergeometric Test Fisher’s exact test.

Gene Ontology “The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products in different databases.” “The GO project has developed three structured controlled vocabularies (ontologies) that describe gene products in terms of their associated biological processes, cellular components and molecular functions in a species- independent manner.” “A gene product might be associated with or located in one or more cellular components; it is active in one or more biological processes, during which it performs one or more molecular functions.”

Go is a directed acyclic graph

Categorical Data Analysis Biomedical data are often represented in groups of more than on category. Often we want to know if there are any associations with the between the categories (for example age and ER) Chi-square test can be used to test the association. The Null hypothesis here is that there is no association. In the case where values in the table are less than 10, then fisher’s exact test is more appropriate

Chi-Square Test First step of Chi-Square test is to determine the expected number for each count. For the value in the table at position [1,1], the expected value is (sum(row1) * sum(col1)) / totalcount The chi-square statistic is Σ ( E – O )^2 / E P-value can be determined by using the degree of freedom (number of rows – 1 x number of columns – 1)

Fisher’s Exact test Fisher’s Exact test can be used regardless of the size of the values, however it becomes hard to calculate when samples are large or too well balanced so chi-square test can be used here. Fisher showed that probability of obtaining such values was given by hypergeometric distribution P= (A+B)!(C+D)!(A+C)!(B+D)! / A!B!C!D!N! Factor1 Level1 Level2 Factor2Level1 A B A+B Factor2Level2 C D C+D A+C B+D A+B+C+D

Hypergeometric Test The hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of n draws from a finite population without replacement. Think of an urn with two types of marbles, blue(5) and red(45) where blue is success and red is failure. Stand next to the urn with your eyes closes and select 10 marbles without replacement. What is the probability that 4 of the 10 will be blue? http://en.wikipedia.org/wiki/Hypergeometric_distribution