Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Slides:



Advertisements
Similar presentations
Chapter 4 Partition I. Covering and Dominating.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
O(N 1.5 ) divide-and-conquer technique for Minimum Spanning Tree problem Step 1: Divide the graph into  N sub-graph by clustering. Step 2: Solve each.
PARTITIONAL CLUSTERING
Cluster Analysis Measuring latent groups. Cluster Analysis - Discussion Definition Vocabulary Simple Procedure SPSS example ICPSR and hands on.
Prim’s Algorithm from a matrix A cable TV company is installing a system of cables to connect all the towns in the region. The numbers in the network are.
Author: Jie chen and Yousef Saad IEEE transactions of knowledge and data engineering.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
Image Segmentation Chapter 14, David A. Forsyth and Jean Ponce, “Computer Vision: A Modern Approach”.
HCS Clustering Algorithm
Introduction to Bioinformatics Algorithms Clustering.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
4. Ad-hoc I: Hierarchical clustering
Distance methods. UPGMA: similar to hierarchical clustering but not additive Neighbor-joining: more sophisticated and additive What is additivity?
Chapter 9: Greedy Algorithms The Design and Analysis of Algorithms.
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Introduction to Bioinformatics Algorithms Clustering.
CSE182-L17 Clustering Population Genetics: Basics.
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Comp602 Bioinformatics Algorithms -m werner 2011
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
Clustering Unsupervised learning Generating “classes”
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Lecture 11. Microarray and RNA-seq II
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Evolutionary tree reconstruction
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Clustering.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
NP-Complete problems.
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Graphs A ‘Graph’ is a diagram that shows how things are connected together. It makes no attempt to draw actual paths or routes and scale is generally inconsequential.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Visualization of Biological Information with Circular Drawings.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Distance-based methods for phylogenetic tree reconstruction Colin Dewey BMI/CS 576 Fall 2015.
Spanning Trees Dijkstra (Unit 10) SOL: DM.2 Classwork worksheet Homework (day 70) Worksheet Quiz next block.
Graph clustering to detect network modules
Unsupervised Learning
All-pairs Shortest paths Transitive Closure
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
PREDICT 422: Practical Machine Learning
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
dij(T) - the length of a path between leaves i and j
3.1 Clustering Finding a good clustering of the points is a fundamental issue in computing a representative simplicial complex. Mapper does not place any.
Clustering.
Hierarchical clustering approaches for high-throughput data
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE 417: Algorithms and Computational Complexity
SEEM4630 Tutorial 3 – Clustering.
Clustering.
Unsupervised Learning
Presentation transcript:

Clustering [Idea only, Chapter 10.1, 10.2, 10.4]

Why Clustering Viewing and analyzing vast amount of biological data as a whole set can be perplexing It is easier to interpret the data if they are partitioned into clusters combining similar data points. Researchers want to know the functions of newly sequenced genes Simply comparing the new gene sequences to known DNA sequences often does not give away the function of gene “Gene activity measure”, also called “gene expression level”, gives more accurate information

Experiments Attach phosphor to genes to see when a particular gene is “expressed” Different color phosphors are available to compare many samples at once Different time stamps

Experiments

Experimental Data Experiment data are usually transformed into an intensity matrix (below) The intensity matrix allows biologists to make correlations between different genes Time:Time XTime YTime Z Gene 1108 Gene Gene Gene 4783 Gene 5123

Clustering of Data Clusters Bonus: How is distance matrix created from intensity matrix ?

Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows

Bad Clustering This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

Good Clustering This clustering satisfies both Homogeneity and Separation principles

Clustering Techniques Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees

Hierarchical Clustering

Hierarchical Clustering: Example

Hierarchical Clustering (cont’d) Hierarchical Clustering is often used to reveal evolutionary history

Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T

Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T Different ways to define distances between clusters may lead to different clusterings

Clique Graphs A clique is a graph with every vertex connected to every other vertex A clique graph is a graph where each connected component is a clique

Transforming an Arbitrary Graph into a Clique Graphs A graph can be transformed into a clique graph by adding or removing edges

Corrupted Cliques Problem Input: A graph G Output: The smallest number of additions and removals of edges that will transform G into a clique graph NP-complete !

Distance Graph to Clique Graph Turn the distance matrix into a distance graph Genes are represented as vertices in the graph Choose a distance threshold θ If the distance between two vertices is below θ, draw an edge between them The resulting graph may contain cliques These cliques represent clusters of closely located data points!

Distance Graph to Clique Graph The distance graph (threshold θ=7) is transformed into a clique graph after removing the two highlighted edges After transforming the distance graph into the clique graph, the dataset is partitioned into three clusters