CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: 6874-6877

Slides:



Advertisements
Similar presentations
Different types of data e.g. Continuous data:height Categorical data ordered (nominal):growth rate very slow, slow, medium, fast, very fast not ordered:fruit.
Advertisements

Clustering II.
Basic Gene Expression Data Analysis--Clustering
Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Albert Gatt Corpora and Statistical Methods Lecture 13.
PARTITIONAL CLUSTERING
Cluster analysis for microarray data Anja von Heydebreck.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
2004/05/03 Clustering 1 Clustering (Part One) Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information.
Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.
Clustering II.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Lecture 4 Cluster analysis Species Sequence P.symA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTTTTATTTCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG P.xanA AATGCCTGACGTGGGAAATCTTTAGGGCTAAGGTTAATATTCCGTATGCTATGTAGCTTAAGGGTACTGACGGTAG.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Clustering Algorithms Bioinformatics Data Analysis and Tools
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Interactive Exploration of Hierarchical Clustering Results HCE (Hierarchical Clustering Explorer) Jinwook Seo and Ben Shneiderman Human-Computer Interaction.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
Elizabeth Garrett-Mayer November 5, 2003 Oncology Biostatistics
Data mining and machine learning A brief introduction.
CLUSTER ANALYSIS.
LSM3241: Bioinformatics and Biocomputing Lecture 8: Gene Expression Profiles and Microarray Data Analysis Prof. Chen Yu Zong Tel:
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Lecture 4 Clustering Algorithms Bioinformatics Data Analysis and Tools
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Data Mining and Text Mining. The Standard Data Mining process.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Hierarchical clustering approaches for high-throughput data
Roberto Battiti, Mauro Brunato
Multivariate Statistical Methods
Dimension reduction : PCA and Clustering
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Clustering The process of grouping samples so that the samples are similar within each group.
Unsupervised Learning
Presentation transcript:

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel: Room 07-24, level 7, SOC1, NUS

2 Clustering Algorithms Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it. Anything will cluster! Garbage In means Garbage Out.

3 Supervised vs. Unsupervised Learning Supervised: there is a teacher, class labels are known Support vector machines Backpropagation neural networks Unsupervised: No teacher, class labels are unknown Clustering Self-organizing maps

4 Gene Expression Data Gene expression data on p genes for n samples Genes mRNA samples Gene expression level of gene i in mRNA sample j = Log (Red intensity / Green intensity) Log(Avg. PM - Avg. MM) sample1sample2sample3sample4sample5 …

5 Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types Line Graph -2 2 Numeric Vector Heatmap

6 Expression Vectors As Points in ‘ Expression Space ’ Experiment 1 Experiment 2 Experiment 3 Similar Expression t 1t 2t 3 G1 G2 G3 G4 G

7 Cluster Analysis Group a collection of objects into subsets or “clusters” such that objects within a cluster are closely related to one another than objects assigned to different clusters.

8 How can we do this? What is closely related? Distance or similarity metric What is close? Clustering algorithm How do we minimize distance between objects in a group while maximizing distances between groups?

9 Distance Metrics Euclidean Distance measures average distance Manhattan (City Block) measures average in each dimension Correlation measures difference with respect to linear trends Gene Expression 1 Gene Expression 2 (5.5,6) (3.5,4)

10 Clustering Gene Expression Data Cluster across the rows, group genes together that behave similarly across different conditions. Cluster across the columns, group different conditions together that behave similarly across most genes. Genes Expression Measurements i j

11 Clustering Time Series Data Measure gene expression on consecutive days Gene Measurement matrix G1= [ ] G2= [ ] G3= [ ] G4= [ ]

12 Euclidean Distance Distance is the square root of the sum of the squared distance between coordinates

13 City Block or Manhattan Distance G1= [ ] G2= [ ] G3= [ ] G4= [ ] Distance is the sum of the absolute value between coordinates

14 Correlation Distance Pearson correlation measures the degree of linear relationship between variables, [-1,1] Distance is 1-(pearson correlation), range of [0,2]

15 Similarity Measurements Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

16 Similarity Measurements Cosine Correlation +1  Cosine Correlation  – 1

17 Hierarchical Clustering (HCL-1) IDEA: Iteratively combines genes into groups based on similar patterns of observed expression By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. Display the data as a heatmap and dendrogram Cluster genes, samples or both

18 Hierarchical Clustering Dendrogram Venn Diagram of Clustered Data

19 Hierarchical clustering Merging (agglomerative): start with every measurement as a separate cluster then combine Splitting: make one large cluster, then split up into smaller pieces What is the distance between two clusters?

20 Distance between clusters Single-link: distance is the shortest distance from any member of one cluster to any member of the other cluster Complete link: distance is the longest distance from any member of one cluster to any member of the other cluster Average: Distance between the average of all points in each cluster Ward: minimizes the sum of squares of any two clusters

21 Hierarchical Clustering-Merging Euclidean distance Average linking Gene expression time series Distance between clusters when combined

22 Manhattan Distance Average linking Gene expression time series Distance between clusters when combined

23 Correlation Distance

24 Data Standardization Data points are normalized with respect to mean and variance, “sphering” the data After sphering, Euclidean and correlation distance are equivalent Standardization makes sense if you are not interested in the size of the effects, but in the effect itself Results are misleading for noisy data

25 Distance Comments Every clustering method is based SOLELY on the measure of distance or similarity E.G. Correlation: measures linear association between two genes What if data are not properly transformed? What about outliers? What about saturation effects? Even good data can be ruined with the wrong choice of distance metric

26 ABCD Dist ABCD A2072 B1025 C3 D Distance MatrixInitial Data Items Hierarchical Clustering

27 ABCD Dist ABCD A2072 B1025 C3 D Distance MatrixInitial Data Items Hierarchical Clustering

28 Current Clusters Single Linkage Hierarchical Clustering Dist ABCD A2072 B1025 C3 D Distance Matrix ABCD 2

29 Dist ADBC 203 B10 C Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering ABCD

30 ABCD Dist ADBC 203 B10 C Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering

31 Dist ADBC 203 B10 C Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering ABCD 3

32 Dist AD C B 10 B Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering ABCD

33 ABCD Dist AD C B 10 B Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering

34 Dist AD C B 10 B Distance MatrixCurrent Clusters Single Linkage Hierarchical Clustering ABCD 10

35 ABCD Dist AD CB Distance MatrixFinal Result Single Linkage Hierarchical Clustering

36 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

37 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

38 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

39 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

40 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

41 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

42 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

43 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8

44 Hierarchical Clustering HL

45 Hierarchical Clustering The Leaf Ordering Problem: Find ‘optimal’ layout of branches for a given dendrogram architecture 2 N-1 possible orderings of the branches For a small microarray dataset of 500 genes, there are 1.6*E150 branch configurations Samples Genes

46 Hierarchical Clustering The Leaf Ordering Problem:

47 Hierarchical Clustering Pros: –Commonly used algorithm –Simple and quick to calculate Cons: –Real genes probably do not have a hierarchical organization

48 Using Hierarchical Clustering 1.Choose what samples and genes to use in your analysis 2.Choose similarity/distance metric 3.Choose clustering direction 4.Choose linkage method 5.Calculate the dendrogram 6.Choose height/number of clusters for interpretation 7.Assess results 8.Interpret cluster structure

49 Choose what samples/genes to include Very important step Do you want to include housekeeping genes or genes that didn’t change in your results? How do you handle replicates from the same sample? Noisy samples? Dendrogram is a mess if everything is included in large datasets Gene screening

50 No Filtering

51 Filtering 100 relevant genes

52 2. Choose distance metric Metric should be a valid measure of the distance/similarity of genes Examples –Applying Euclidean distance to categorical data is invalid –Correlation metric applied to highly skewed data will give misleading results

53 3. Choose clustering direction Merging clustering (bottom up) Divisive –split so that genes in the two clusters are the most similar, maximize distance between clusters

54 Nearest Neighbor Algorithm Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached.

55 Nearest Neighbor, Level 3, k = 6 clusters.

56 Nearest Neighbor, Level 4, k = 5 clusters.

57 Nearest Neighbor, Level 5, k = 4 clusters.

58 Nearest Neighbor, Level 6, k = 3 clusters.

59 Nearest Neighbor, Level 7, k = 2 clusters.

60 Nearest Neighbor, Level 8, k = 1 cluster.

61 Calculate the similarity between all possible combinations of two profiles Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Hierarchical Clustering Keys Similarity Clustering

62 Hierarchical Clustering C1C1 C2C2 C3C3 Merge which pair of clusters ?

Hierarchical Clustering Single Linkage C1C1 C2C2 Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters Tend to generate “long chains”

Hierarchical Clustering Complete Linkage C1C1 C2C2 Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters Tend to generate “clumps”

Hierarchical Clustering Average Linkage C1C1 C2C2 Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

Hierarchical Clustering Average Group Linkage C1C1 C2C2 Dissimilarity between two clusters = Distance between two cluster means.

67 Which one? Both methods are “step-wise” optimal, at each step the optimal split or merge is performed Doesn’t mean that the final result is optimal Merging: Computationally simple Precise at bottom of tree Good for many small clusters Divisive More complex, but more precise at the top of the tree Good for looking at large and/or few clusters For Gene expression applications, divisive makes more sense