Computational Biology

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering (2). Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram –A tree like.
Hierarchical Clustering
Unsupervised Learning
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis Basics
Hierarchical Clustering, DBSCAN The EM Algorithm
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
Lecture 20: Cluster Validation
Microarrays.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CSE4334/5334 Data Mining Clustering. What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related)
Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining: Basic Cluster Analysis
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering in COSC 4335
Clustering CSC 600: Data Mining Class 21.
Clustering 28/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Clustering Techniques for Finding Patterns in Large Amounts of Biological Data Michael Steinbach Department of Computer Science
What Is the Problem of the K-Means Method?
Data Mining K-means Algorithm
Cluster Analysis: Basic Concepts and Algorithms
Clustering Evaluation The EM Algorithm
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Techniques: Basic
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Critical Issues with Respect to Clustering
Clustering 23/03/2016 A diák alatti jegyzetszöveget írta: Balogh Tamás Péter.
Computational BioMedical Informatics
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering Analysis.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Clustering.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Presentation transcript:

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9

Clustering A clustering is a set of clusters Important distinction between hierarchical and partitional sets of clusters Partitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset Hierarchical clustering A set of nested clusters organized as a hierarchical tree

Partitional Clustering A Partitional Clustering Original Points

Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram

Notion of a Cluster can be Ambiguous How many clusters? Six Clusters Two Clusters Four Clusters

What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

10 Clusters Example Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example Starting with two initial centroids in one cluster of each pair of clusters

10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

10 Clusters Example Starting with some pairs of clusters having three initial centroids, while other have only one.

Limitations of K-means: Differing Sizes Original Points K-means (3 Clusters)

Limitations of K-means: Differing Density Original Points K-means (3 Clusters)

Limitations of K-means: Non-globular Shapes Original Points K-means (2 Clusters)

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 . . . . Proximity Matrix

Intermediate Situation After some merging steps, we have some clusters C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2

Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2

After Merging The question is “How do we update the proximity matrix?” C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . Similarity? MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . .   MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix

Cluster Similarity: MIN or Single Link Similarity of two clusters is based on the two most similar (closest) points in the different clusters Determined by one pair of points, i.e., by one link in the proximity graph. 1 2 3 4 5

Hierarchical Clustering: MIN 5 1 2 3 4 5 6 4 3 2 1 Nested Clusters Dendrogram

Measuring Cluster Validity Via Correlation Two matrices Proximity Matrix “Incidence” Matrix One row and one column for each data point An entry is 1 if the associated pair of points belong to the same cluster An entry is 0 if the associated pair of points belongs to different clusters Compute the correlation between the two matrices Since the matrices are symmetric, only the correlation between n(n-1) / 2 entries needs to be calculated. High correlation indicates that points that belong to the same cluster are close to each other. Not a good measure for some density or contiguity based clusters.

Measuring Cluster Validity Via Correlation Correlation of incidence and proximity matrices for the K-means clusterings of the following two data sets. Corr = -0.9235 Corr = -0.5810

Using Similarity Matrix for Cluster Validation Order the similarity matrix with respect to cluster labels and inspect visually.

End Theory I 5 min mindmapping 10 min break

Practice I

Clustering Dataset We will use the same datasets as last week Have fun clustering with Orange Try K-means clustering, hierachial clustering, MDS Analyze differences in results

K? What is the k in k-means again? How many clusters are in my dataset? Solution: iterate over a reasonable number of ks Do that and try to find out how many clusters there are in your data

End Practice I 15 min break

Theory II Microarrays

Microarrays Gene Expression: We see difference between cels because of differential gene expression, Gene is expressed by transcribing DNA intosingle-stranded mRNA, mRNA is later translated into a protein, Microarrays measure the level of mRNA expression

Microarrays Gene Expression: mRNA expression represents dynamic aspects of cell, mRNA is isolated and labeled using a fluorescent material, mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser

Microarrays

Microarrays

Microarrays

Processing Microarray Data Differentiating gene expression: R = G  not differentiated R > G  up-regulated R < G  down regulated

Processing Microarray Data Problems: Extract data from microarrays, Analyze the meaning of the multiple arrays.

Processing Microarray Data

Processing Microarray Data Problems: Extract data from microarrays, Analyze the meaning of the multiple arrays.

Processing Microarray Data

Processing Microarray Data Clustering: Find classes in the data, Identify new classes, Identify gene correlations, Methods: K-means clustering, Hierarchical clustering, Self Organizing Maps (SOM)

Processing Microarray Data Distance Measures: Euclidean Distance: Manhattan Distance:

Processing Microarray Data K-means Clustering: Break the data into K clusters, Start with random partitioning, Improve it by iterating.

Processing Microarray Data Agglomerative Hierarchical Clustering:

Processing Microarray Data Self-Organizing Feature Maps: by Teuvo Kohonen, a data visualization technique which helps to understand high dimensional data by reducing the dimensions of data to a map.

Processing Microarray Data Self-Organizing Feature Maps: humans simply cannot visualize high dimensional data as is, SOM help us understand this high dimensional data.

Processing Microarray Data Self-Organizing Feature Maps: Based on competitive learning, SOM helps us by producing a map of usually 1 or 2 dimensions, SOM plot the similarities of the data by grouping similar data items together.

Processing Microarray Data Self-Organizing Feature Maps:

Processing Microarray Data Self-Organizing Feature Maps: Input vector, synaptic weight vector x = [x1, x2, …, xm]T wj=[wj1, wj2, …, wjm]T, j = 1, 2,3, l Best matching, winning neuron i(x) = arg min ||x-wj||, j =1,2,3,..,l Weights wi are updated.

Figure 2. Output map containing the distributions of genes from the alpha30 database. Genes are mapped according to their similarity in expression level. Gene clusters with similar expression levels in the time series are mapped to areas that are closer to each other. Arrows point to the neurons that correspond to the expression level graphs shown. Clusters that are closer on the map have more similar expression levels. Genes with different behavior are located farther away on the map. Furthermore, genes with opposite behavior tend to be located in opposite neurons on the map. doi:10.1371/journal.pone.0093233.g002 Chavez-Alvarez R, Chavoya A, Mendez-Vazquez A (2014) Discovery of Possible Gene Relationships through the Application of Self-Organizing Maps to DNA Microarray Databases. PLoS ONE 9(4): e93233. doi:10.1371/journal.pone.0093233 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093233

Figure 5. Color-coded output maps representing the final weight of neurons from the samples of the alpha30 database. Maps are labeled with the sampling time and the cell cycle phase. The cluster centroid value is coded with a range of colors from blue for the lowest expression level value, to red for the highest value. Green rectangles enclose the region where the chromosomal DNA replication genes are located on the maps; these genes have their highest expression level during late G1 phase and early S phase. Yellow rectangles correspond to neurons with the genes coding for histones; these genes have their highest expression level during the S phase. doi:10.1371/journal.pone.0093233.g005 Chavez-Alvarez R, Chavoya A, Mendez-Vazquez A (2014) Discovery of Possible Gene Relationships through the Application of Self-Organizing Maps to DNA Microarray Databases. PLoS ONE 9(4): e93233. doi:10.1371/journal.pone.0093233 http://www.plosone.org/article/info:doi/10.1371/journal.pone.0093233

End Theory II 5 min mindmapping 10 min break

Practice II

Microarray http://www.biolab.si/supp/bi-visprog/dicty/dictyExample.htm Use the data as described on the page