Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Clustering Categorical Data The Case of Quran Verses
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Cluster analysis for microarray data Anja von Heydebreck.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Quick Sort, Shell Sort, Counting Sort, Radix Sort AND Bucket Sort
Introduction to Bioinformatics
Molecular Evolution Revised 29/12/06
Clustering II.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Aki Hecht Seminar in Databases (236826) January 2009
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Heapsort. 2 Why study Heapsort? It is a well-known, traditional sorting algorithm you will be expected to know Heapsort is always O(n log n) Quicksort.
Introduction to Bioinformatics Algorithms Clustering.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
CSE182-L17 Clustering Population Genetics: Basics.
Introduction to Bioinformatics - Tutorial no. 12
Fuzzy K means.
Hashing General idea: Get a large array
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Birch: An efficient data clustering method for very large databases
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Clustering Unsupervised learning Generating “classes”
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Heapsort Based off slides by: David Matuszek
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Lecture 11. Microarray and RNA-seq II
Heapsort CSC Why study Heapsort? It is a well-known, traditional sorting algorithm you will be expected to know Heapsort is always O(n log n)
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Microarrays.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
Data Structures Week 8 Further Data Structures The story so far  Saw some fundamental operations as well as advanced operations on arrays, stacks, and.
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Heapsort. What is a “heap”? Definitions of heap: 1.A large area of memory from which the programmer can allocate blocks as needed, and deallocate them.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 Chapter 6 Heapsort. 2 About this lecture Introduce Heap – Shape Property and Heap Property – Heap Operations Heapsort: Use Heap to Sort Fixing heap.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Unsupervised Learning
Lecture 10. Clustering Algorithms
Sorting.
Ch. 8 Priority Queues And Heaps
Clustering.
Hierarchical and Ensemble Clustering
Heapsort.
Heapsort.
Clustering.
Unsupervised Learning
Presentation transcript:

Lecture 10. Clustering Algorithms The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture outline 1.Numeric datasets and clustering 2.Some clustering algorithms – Hierarchical clustering Dendrograms – K-means – Subspace and bi-clustering algorithms Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20152

NUMERIC DATASETS AND CLUSTERING Part 1

Numeric datasets in bioinformatics So far we have mainly studied problems related to biological sequences Sequences represent the static states of an organism – Program and data stored in hard disk Numeric measurements represent the dynamic states – Program and data loaded into memory at run time Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20154

Numeric measurements Activity level of a gene (expression level) – mRNA level (number of copies in a cell) – Protein level Occupancy of a transcription factor at a binding site Fraction of C ’s in CpG dinucleotides being methylated Frequency of a residue of a histone protein to be acetylated Fraction of A ’s on a mRNA being edited to I ’s (inosines) Higher level measurements: heart beat, blood pressure and cholesterol level of patients... Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Adenosin ( A ) Inosine ( I ) Image source: Wikipedia

Gene expression Protein abundance is the best way to measure activity of a protein coding gene – However, not much data is available due to difficult experiments mRNA levels are not ideal indicators of gene activity – mRNA level and protein level are not very correlated due to mRNA degradation, translational efficiency, translation and post-translational modifications, and so on – However, it is very easy to measure mRNA levels – High-throughput experiments to measure the mRNA levels of many genes at a time: microarrays and RNA (cDNA) sequencing Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20156

Microarrays The basic idea is “hybridization” – For each gene, we design short sequences that are unique to the gene Usually nucleotides – When RNA is converted back to DNA, if it is complementary to a probe, it will bind to the probe – “hybridization” Ideally only for perfect match, but sometimes hybridization also happens with some mismatches – Note: We need to know the DNA sequences of the genes Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 20157

Hybridization Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Image source: Wikipedia

The arrays Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Image sources:

Processing workflows Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Image source:

RNA sequencing Microarrays data are quite noisy because of cross- hybridization, narrow signal range, analog measurements, etc. RNA sequencing is a new technology that gives better quality – Convert RNAs back to cDNAs, sequence them, and identify which genes they correspond to – Better signal-to-noise ratio than microarrays – “Digital”: expression level represented by read counts – No need to have prior knowledge about the sequences – If a sequence is not unique to a gene, cannot determine which gene it comes from Also a problem for microarrays Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

RNA sequencing Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Image credit: Wang et al., Nature Review Genetics 10(1):57-63, (2009)

Processing RNA-seq data Many steps and we will not go into the details – Quality check – Read trimming and filtering – Read mapping (BWT, suffix array, etc.) – Data normalization –... Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Gene expression data Final form of data from microarray or RNA-seq: – A matrix of real numbers – Each row corresponds to a gene – Each column corresponds to a sample/experiment: A particular condition A cell type (e.g., cancer) Questions: – Are there genes that show similar changes to their expression levels across experiments? The genes may have related functions – Are there samples with similar set of genes expressed? The samples may be of the same type Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Clustering of gene expression data Clustering: Grouping of related objects into clusters – An object could be a gene or a sample – Usually clustering is done on both. When genes are the objects, each sample is an attribute. When samples are the objects, each gene is an attribute Goals: – Similar objects are in the same cluster – Dissimilar objects are in different clusters Could define a scoring function to evaluate how good a set of clusters is Most clustering problems are NP hard – We will study heuristic algorithms Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Heatmap and clustering results Color: expression level Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Image credit: Borries and Wang, Computational Statistics & Data Analysis 53(12): , (2009); Alizadeh et al., Nature 403(6769): , (2000) Clustering Genes Samples

SOME CLUSTERING ALGORITHMS Part 2

Hierarchical clustering One of the most commonly used clustering algorithms is agglomerative hierarchical clustering – Agglomerative: Merging – There are also divisive hierarchical clustering algorithms The algorithm: 1.Treat each object as a cluster by itself 2.Compute the distance between every pair of clusters 3.Merge the two closest clusters 4.Re-compute distance values between the merged cluster and each other cluster 5.Repeat #3 and #4 until only one cluster is left Same as UPGMA, but without the phylogenetic context Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Hierarchical clustering 2D illustration: – Each point is a gene – The coordinate of a point indicates the expression value of the gene in two samples Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Sample 1 Sample 2 A B C D E F Dendrogram (similar to a phylogenetic tree) ABCDEF

Representing a dendrogram Since a dendrogram is essentially a tree, we can represent it using any tree format – For example, the Newick format: (((A,B),(C,(D,E))),F); – We could use the Newick format to also specify how the leaves should be ordered in a visualization. For example, for the Newick string (F,((B,A),((D,E),C))); from the same merge order of the clusters but with the leaves ordered differently, the corresponding dendrogram is: Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall ABCDEFBACDEF

More details Three questions: – How to compute the distance between two points? – How to compute the distance between two clusters? – How to efficiently perform these computations? Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Distance Most common: Euclidean distance – x i 1 j is the expression level of the i 1 -th object (say, gene) and the j-th attribute (say, sample) and m is the total number of attributes – Need to normalize the attributes Also common: (1 - Pearson correlation) / 2. Pearson correlation is a similarity measure with value between -1 and 1:, where Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Euclidean distance vs. correlation Two points have small Euclidean distance if their attribute values are close (but not necessarily correlated) Two points have large Pearson correlation if their attribute values have consistent trends (but not necessarily close) Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Which one to use? Sometimes absolute expression values are more important – Example: When there is a set of homogenous samples (e.g., all of a certain cancer type), and the goal is to find out genes that are all highly expressed or lowly expressed Usually the increase-decrease trend is more important than absolute expression values – Example: When detecting changes between two sets of samples or across a number of time points Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Similarity between two clusters Several schemes (using Euclidean distance as example): – Average-link: average between all pairs of points (used by UPGMA) – Single-link: closest among all pairs of points – Complete-link: farthest among all pairs of points – Centroid-link: between the centroids Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Similarity between two clusters Average-link: equal vote by all members of the clusters, preferring to merge clusters liked by many Single-link: merge two clusters even if just one pair likes it very much Complete-link: not merge two clusters even if just one pair does not like it Centroid-link: similar to average-link, but easier to compute Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Similarity between two clusters Suppose clusters C 1 and C 2 have already been formed – Average-link prefers to merge I and C 2 next, as their points are close on average – Single-link prefers to merge C 1 and E next, as C and E are very close – Complete-link prefers to merge I and C 2 next, as I is not too far from F, G or H (as compared to A-E, C-H, E-H, etc.) – Centroid-link prefers to merge C 1 and C 2 next, as their centroids are close (and not so affected by the long distance between C and H) Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall A B C D E F I G H C1C1 C2C2

Updating To determine which two clusters are most similar, we need to compute the distance between every pair of clusters – At the beginning, this involves O(n 2 ) computations for n objects, followed by a way to find out the smallest value, either Linear scan, which would take O(n 2 ) time OR Sorting, which would take O(n 2 log n 2 ) = O(n 2 log n) time – After a merge, we need to remove the distances involving the two merging clusters, and add back the distances of the new cluster with all other clusters: O(n) between-cluster distance calculations (assuming that takes constant time for now – will come back to this topic later), followed by either Linear scan of new list, which would take O(n 2 ) time OR Re-sorting, which would take O(n 2 log n) time OR Binary search and removing/inserting distances, which would take O(n log n 2 ) = O(n log n) time Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Updating Summary: – At the beginning Linear scan: O(n 2 ) time OR Sorting: O(n 2 log n) time – After each merge Linear scan: O(n 2 ) time OR Re-sorting: O(n 2 log n) time OR Binary search and removing/inserting distances: O(n log n) time – In total, Linear scan: O(n 3 ) time Maintaining a sorted list: O(n 2 log n) time – Can these be done faster? Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Heap A heap (also called a priority queue) is for maintaining the minimum value of a list of numbers without sorting Ideas: – Build a binary tree structure with each node storing one of the numbers, and the root of a sub- tree is always smaller than all other nodes in the sub-tree – Store the tree in an array that allows efficient updates Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Heap Example: tree representation (each node is the distance between two clusters) Corresponding array representation (notice that the array is not entirely sorted) – If first entry has index 0, then the children of node at entry i are at entries 2i+1 and 2i+2 – Smallest value always at the first entry of the array Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Constructing a heap Staring with any input array – From node at entry  N/2  down to node at entry 1, swap with smallest child iteratively if it is smaller than the current node N is the total number of nodes, which is equal to n(n-1)/2, the number of pairs for n clusters Example input: 13, 5, 10, 4, 11, 1, 9, 12, 8, 6 Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Constructing a heap Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Constructing a heap Input: Output: Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Constructing a heap Time needed: – Apparently, for each of the O(N) nodes, up to O(log N) swaps are needed, so O(N log N) time in total Same as sorting – However, by doing a careful amortized analysis, actually only O(N) time is needed Why? Because only one node could have  log N  swaps, two nodes could have  log N  - 1 swaps, etc. For example, for 15 nodes: – N log N = 15 log 2 15  15(3) = 45 – 3 + 2(2) + 4(1) + 8(0) = 11 Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Deleting a value For many applications of heaps, deletion is done to remove the value at the root only In our clustering application, for each cluster we maintain the entries corresponding to the distances related to it, so after the cluster is merged we remove all these distance values from the heap In both cases, deletion is done by moving the last value to the deleted entry, then re- “heapify” Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Deleting a value Example: deleting 4 Each deletion takes O(log N) time After each merge of two clusters, need to remove O(n) distances from heap – O(n log N) = O(n log n) time in total Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Inserting a new value Add to the end Iteratively swap with parent until larger than parent Example: Adding value 3 Each insertion takes O(log N) time After each merge of two clusters, need to insert O(n) distances to heap – O(n log N) = O(n log n) time in total Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Total time and space O(N) = O(n 2 ) space Initial construction: O(N) = O(n 2 ) time After each merge: O(n log n) time – O(n) merges in total Therefore in total O(n 2 log n) time is needed Now we study another structure that needs O(n 2 ) space but only O(n 2 ) time in total Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Quad tree Proposed in Eppstein, Proceedings of the Ninth Annual ACM- SIAM Symposium on Discrete Algorithms , (1998) Main idea: Group the objects iteratively to form a tree, with the minimum distance between all objects stored at the root of the sub-tree Example: Distances between 9 objects Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Quad tree Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Updating the quad tree After a merge, the algorithm needs to – Delete distance values in two rows and two columns – Add back distance values into a row and a column If we do not want to compact the tree, simply fill in  to the other row and column – Re-compute minimum values at upper levels – Example: Merging clusters 5 and 6 – Suppose new distance between this merged cluster and other clusters are: Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Merging clusters 5 and 6 Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Distances with new cluster:    ,   4 7    ,    ,   4 7    , , ,

Space and time analysis Space needed: O(n 2 ) Initial construction: O(n 2 + n 2 /4 + n 2 / ) = O(n 2 ) time After each merge, number of values to update: O(2n + 2n/2 + 2n/4 +...) = O(n), each taking a constant amount of time Time needed for the whole clustering process = O(n 2 ) – More efficient than using a heap – There are data structures that require less space but more time Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Computing within-cluster distances If C i and C j are merged, how to compute d(C i  C j,C k ) based on d(C i,C k ) and d(C j,C k )? – Single-link: d(C i  C j,C k ) = min{d(C i,C k ), d(C j,C k )} – Complete-link: d(C i  C j,C k ) = max{d(C i,C k ), d(C j,C k )} – Average-link: d(C i  C j,C k ) = [d(C i,C k )|C i ||C k | + d(C j,C k )|C j ||C k |] / [(|C i |+|C j |)|C k |] – Centroid-link: Cen(C i  C j ) = [Cen(C i )|C i | + Cen(C j )|C j |] / (|C i |+|C j |) All can be performed in constant time Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

K-means K-means is another classical clustering algorithm – MacQueen, Proceedings of 5 th Berkeley Symposium on Mathematical Statistics and Probability , (1967) Instead of hierarchically merging clusters, k- means iteratively partitions the objects into k clusters by repeating two steps until stabilized: 1.Determining cluster representatives Randomly determined initially Centroids of current members in subsequent iterations 2.Assigning each object to the cluster with the closest representative Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Example (k=2) Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall A B C D E F I G H C1C1 C2C2 Assignment A B C D E F I G H Re-determining representatives A B C D E F I G H Assignment A B C D E F I G H C1C1 C2C2 Re-determining representatives A B C D E F I G H Assignment A B C D E F I G H C1C1 C2C2 Re-determining representatives A B C D E F I G H Assignment A B C D E F I G H C1C1 C2C2 Random initial representatives

Hierarchical clustering vs. k-means There are hundreds of other clustering algorithms proposed: – Model-based – Density approach – Less sensitive to outliers – More efficient – Allowing other data types – Considering domain knowledge – Finding clusters in subspaces (coming up next) –... Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Hierarchical clusteringk-means Advantages Providing the whole clustering tree (dendrogram), can cut to get any number of clusters No need to pre-determine k Fast Low memory consumption An object can move to another cluster Disadvantages Slow High memory consumption Once assigned, an object always stays in a cluster Providing only final clusters Need to pre-determine k

Embedded clusters Euclidean distance and Pearson correlation consider all attributes equally It is possible that for each cluster, only some attributes are relevant Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Image credit: Pomeroy et al., Nature 415(6870): , (2002)

Finding clusters in a subspace One way is not to distinguish between objects and attributes, but to find a subset of rows and a subset of columns (a bicluster) so that the values inside the bicluster exhibit some coherent patterns Here we study one bi-clustering algorithm – Cheng and Church, 8 th Annual International Conference on Intelligent Systems for Molecular Biology, , (2000) Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Cheng and Church biclustering Notations: – I is a subset of the rows – J is a subset of the columns – (I, J) defines a bicluster Model: Each value a ij (at row i and column j) in a cluster is influenced by: – Background of the whole cluster – Effect of the i-th row – Effect of the j-th column Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Cheng and Church biclustering Assumption: – In the ideal case, a ij = a iJ + a Ij – a IJ, where – is the mean of values in the cluster at row i – is the mean of values in the cluster at column j – is the mean of all values in the cluster Goal of the algorithm is to find I and J such that the following mean squared residue score is minimized: Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Example Suppose the values in a cluster are generated according to the following row and column effects: Then the corresponding averages values are: a 11 – a 1J – a I1 + a IJ = 12 – 12.5 – = 0 – You can verify for other i’s and j’s Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Why this model? Assuming the expression level of a gene in a particular sample is determined by three additive effects: – The cluster background E.g., the activity of the whole functional pathway – The gene E.g., some genes are intrinsically more active – The sample E.g., in some samples, all the genes in the cluster are activated Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Algorithm How to find out clusters (i.e., (I, J)) that have small H values? – It is proved that finding the largest cluster with H less than a fixed threshold  is NP hard – Heuristic method: 1.Randomly determine I and J 2.Try all possible addition/deletion of one row/column, and accept the one that results in smallest H – Some variations involve addition or deletion only, or allowing addition or deletion of multiple rows/columns 3.Repeat #2 until H does not decrease or it is smaller than threshold  Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

More details Obviously, if the cluster contains only one row and one column, the residue H must be 0 – Could limit the minimum number of rows/columns A cluster containing genes that do not change their expression values across different samples may not be interesting – Could use the variance of expression values across samples as a secondary score How to find more than one cluster? – After finding a cluster, replace it with random values before calling the algorithm again Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Some clusters found Each line is a gene. The horizontal axis represents different time points Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

CASE STUDY, SUMMARY AND FURTHER READINGS Epilogue

Case study: Successful stories Clustering of gene expression data has led to the discovery of disease subtypes and key genes to some biological processes Example 1: Automatic identification of cancer subtypes acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without prior knowledge of these classes Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall clusters4 clusters Image credit: Golub et al., Science 286(5439): , (1999)

Case study: Successful stories Example 2: Identification of genes involved in the response to external stress – Each triangle: multiple time points after producing an environmental change, such as heat shock or amino acid starvation Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Image credit: Gasch et al., Molecular Biology of the Cell 11(12): , (2000)

Case study: Successful stories Example 3: Segmentation of the human genome into distinct region classes Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall Image credit: The ENCODE Project Consortium, Nature 489(7414):57-74, (2012)

Summary Clustering is the process to group similar things into clusters – Many applications in bioinformatics, the most well-known one is on gene expression analysis Classical clustering algorithms – Agglomerative hierarchical clustering – K-means Subspace/bi-clustering algorithms Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall

Further readings The book by Leonard Kaufman and Peter J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley Inter- Science 1990 – A classical reference book on cluster analysis Last update: 21-Nov-2015CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall