Presentation on theme: "Introduction to Bioinformatics"— Presentation transcript:
1 Introduction to Bioinformatics Biological NetworksDepartment of ComputingImperial College LondonMarch 4, 2010Lecture hours 14-15Nataša Pržulj
2 Data Clusteringfind relationships and patterns in the data to achieve insights in underlying biologyClustering algorithms can be applied to the data to find groups of similar genes/proteins, or groups of similar samples
3 What is data clustering? Clustering of data is a method by which large sets of data is grouped into clusters (groups) of smaller sets of similar data.Example: There are a total of 10 balls which are of three different colours. We are interested in clustering the balls into three different groups.An intuitive solution is that balls of same colour are clustered (grouped together) by colour.Identifying similarity by colour was easy, however we want to extend this to numerical values to be able to deal with biological data, and also to cases when there are more features (not just colour).
4 ClusteringPartition a set of elements into subsets called clusters such thatelements of the same cluster are similar to each other (homogeneity property, H)Elements from different clusters are different (separation property, S)
5 Clustering Algorithms A clustering algorithm attempts to find natural groups of components (or data) based on some notion similarity over the features describing them.Also, the clustering algorithm finds the centroid of a group of data sets.To determine cluster membership, many algorithms evaluate the distance between a point and the cluster centroids.The output from a clustering algorithm is basically a statistical description of the cluster centroids with the number of components in each cluster.
6 Clustering Algorithms Cluster centroid :The centroid of a cluster is a point whose parameter values are the mean of the parameter values of all the points in the clusters.Distance:Generally, the distance between two points is taken as a common metric to assess the similarity among the components of a population. The commonly used distance measure is the Euclidean distance which defines the distance between two points p= ( p1, p2, ....) and q = ( q1, q2, ....) is given by :
7 Clustering Algorithms There are many possible distance metrics.Some theoretical (and intuitive) properties of distance metricsDistance between two items (elements) must be greater than or equal to zero,Distances cannot be negative.The distance between an item and itself must be zeroConversely if the difference between two items is zero, then the items must be identical.The distance between item A and item B must be the same as the distance between item B and item A.The distance between item A and item C must be less than or equal to the sum of the distance between items A and B and items B and C (triangle inequality).
8 Clustering Algorithms Example distances:Euclidean (L2) distanceManhattan (L1) distanceLm: (|x1-x2|m+|y1-y2|m)1/mL∞: max(|x1-x2|,|y1-y2|)Inner product: x1x2+y1y2Correlation coefficientFor simplicity we will concentrate on Euclidean and Manhattan distances
9 Clustering Algorithms Distance Measures: Minkowski MetricSuppose two objects and both have features :The Minkowski metric is defined as:
10 Clustering Algorithms Commonly used Minkowski metrics:
11 Clustering Algorithms Examples of Minkowski metrics:
12 Clustering Algorithms Distance/Similarity matrices:Clustering is based on distances – distance/similarity matrix:Represents the distance between objectsOnly need half the matrix, since it is symmetric
13 Clustering Algorithms Hierarchical vs Non-hierarchical:Hierarchical clustering is the most commonly used methods for identifying groups of closely related genes or tissues. Hierarchical clustering is a method that successively links genes or samples with similar profiles to form a tree structure.K-means clustering is a method for non-hierarchical (flat) clustering that requires the analyst to supply the number of clusters in advance and then allocates genes and samples to clusters appropriately.
14 Clustering Algorithms Hierarchical Clustering:Given a set of N items to be clustered, and an NxN distance (orsimilarity) matrix, the basic process hierarchical clustering is this:Start by assigning each item to its own cluster, so that if you have N items, you now have N clusters, each containing just one item.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.Compute distances (similarities) between the new cluster and each of the old clustersRepeat steps 2 and 3 until all items are clustered into a single cluster of size N.
15 Clustering Algorithms Hierarchical Clustering:Scan the matrix for the minimumJoin items into one nodeUpdate matrix and repeat from step 1
16 Clustering Algorithms Hierarchical Clustering:Distance between two points – easy to computeDistance between two clusters – harder to compute:Single-Link Method / Nearest NeighborComplete-Link / Furthest NeighborAverage of all cross-cluster pairs
17 Clustering Algorithms Hierarchical Clustering:Single-Link Method / Nearest Neighbor (also called the connectedness, or minimum method):distance between one cluster and another cluster is equal to the shortest distance from any member of one cluster to any member of the other clusterComplete-Link / Furthest Neighbor (also called the diameter or maximum method)the distance between one cluster and another is equal to the longest distance from any member of one cluster to any member of the other clusterAverage-link clusteringthe distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster
20 Clustering Algorithms Hierarchical Clustering:In a dendrogram, the length of each tree branch represents the distancebetween clusters it joins.Different dendrograms may arise when different Linkage methods are used.
21 Clustering Algorithms K-Means Clustering:Basic Ideas : use cluster centroids (means) to represent cluster.Assigning data elements to the closet cluster (centroid).Goal: Minimize intra-cluster dissimilarity.
22 Clustering Algorithms K-Means Clustering:Pick (usually randomly) k points as centers of k clusters.Compute distances between a non-center point v and each of the k center pointsfind the minimum distance, say it is to center point Ci, and assign v to the cluster defined by Ci.Do this for all non-center points and obtain k non-overlapping clusters containing all the points.For each cluster, compute its new center, which is the point the with minimum sum of distances from that point to all other points in the cluster.Repeat until the algorithm converges, i.e., the same set of centers is chosen as in previous iteration.This results in non-overlapping clusters of potentially different sizes.
24 Clustering Algorithms K-means vs. Hierarchical clustering:Computation Time:– Hierarchical clustering: O( m n2 log(n) )– K-means clustering: O( k t m n )t: number of iterationsn: number of objectsm-dimensional vectorsk: number of clustersMemory Requirements:– Hierarchical clustering: O( mn + n2 )– K-means clustering: O( mn + kn )Other:Hierarchical Clustering:Need to select Linkage Methodto perform any analysis, it is necessary to partition the dendrogram into k disjoint clusters, cutting the dendrogram at some point. A limitation is that it is not clear how to choose this kK-means: Need to select KIn both cases: Need to select distance/similarity measure