University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Published byModified over 7 years ago
Presentation on theme: "University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini."— Presentation transcript:
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini
University of CreteCS4832 Introduction Classic clustering algorithms, like K- means, self-organizing maps, etc., have certain drawbacks No guarantee for global optimal results Depend on geometric shape of cluster boundaries (K-means)
University of CreteCS4833 Introduction MST clustering algorithms Expression data clustering analysis (Xu et al -2001) Iterative clustering algorithm (Varma et al - 2004) Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)
University of CreteCS4834 Definitions A minimum spanning tree (MST) of a weighted, undirected graph with weights is an acyclic subset that contains all of the vertices and whose total weight is minimum.
University of CreteCS4835 Definitions The DNA microarray technology enables the massive parallel measurement of gene expression of thousands genes simultaneously. Its usefulness: compare the activity of genes in diseased and healthy cells categorize a disease into subgroups discover new drug and toxicology studies.
University of CreteCS4836 Definitions Clustering is a common technique for data analysis. Clustering partitions the data set into subsets (clusters), so that the data in each subset share some common trait.
University of CreteCS4837 MST clustering algorithms Expression data clustering analysis (Xu et al -2001) Iterative clustering algorithm (Varma et al - 2004) Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)
University of CreteCS4838 Expression data clustering analysis Let be a set of expression data with each representing the expression levels at time 1 through time t of gene i. We define a weighted, undirected graph as follows. The vertex set and the edge set.
University of CreteCS4839 Expression data clustering analysis G is a complete graph. The weight of its edge is the distance of the two vertices e.g. Euclidean distance, Correlation coefficient, etc. Each cluster corresponds to one subtree of the MST. No essential information is lost for clustering.
University of CreteCS48310 Clustering through removing long MST-edges Based on intuition of the cluster Works very well when inter-cluster edges are larger than intra-cluster ones
University of CreteCS48311 An iterative Clustering Minimize the distance between the center of a cluster and its data Starts with K arbitrary clusters of the MST for each pair of adjacent clusters finds the edge to cut, which optimizes
University of CreteCS48312 A globally optimal clustering Tries to partition the tree into K subtrees Select K representatives to optimize
University of CreteCS48313 MST clustering algorithms Expression data clustering analysis (Xu et al -2001) Iterative clustering algorithm (Varma et al - 2004) Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)
University of CreteCS48314 Iterative clustering algorithm The clustering measure used here is Fukuyama-Sugeno where, are the two partitions of the set S, with each contains samples, denote by the mean of the samples in and the global mean of all samples. Also denote by the j-th sample in the cluster
University of CreteCS48315 Iterative clustering algorithm Feature selection counts the gene’s support to a partition Feature selection used here is t-statistic with pooled variance. T-statistic is heuristic measure Genes with absolute t-statistic greater than a threshold are selected
University of CreteCS48316 Iterative clustering algorithm Create an MST from all genes Delete edges from MST and obtain binary partitions. Select the one with minimum F-S clustering measure The feature selection is used to select a subset of genes that single out between the clusters
University of CreteCS48317 Iterative clustering algorithm In the next iteration the clustering is done in this selected set of genes Until the selected gene subset converges Remove them form the pool and continue.
University of CreteCS48318 MST clustering algorithms Expression data clustering analysis (Xu et al -2001) Iterative clustering algorithm (Varma et al - 2004) Dynamically growing self-organizing tree (DGSOT) (Luo et al - 2004)
University of CreteCS48319 Dynamically growing self- organizing tree (DGSOT) In the previous algorithms the MST is constructed on the original set of data and used to test the intra-cluster quantity, while here the MST is used as a criterion to test the inter-cluster property.
University of CreteCS48320 DGSOT algorithm Tree structure self-organizing neural network Grows vertically and horizontally Starts with a root-leaf node In every vertical growing every leaf node with heterogeneity two descendents are created and the learning process take place
University of CreteCS48321 DGSOT algorithm Heterogeneity Variability (maximum distance between input data and node) Average distortion d of a leaf D: total number of input data of lead i : distance between data j and leaf i : reference vector of leaf i
University of CreteCS48322 DGSOT algorithm In every horizontal growing for every lowest non-leaf node a child is added until the validation criterion is satisfied and the learning process take place The learning process distributes the data to the leaves in the best way. The best matching node has the minimum distance to the input data
University of CreteCS48323 The validation criterion of DGSOT Calculated without human intervention Based on geometric characteristics of the clusters Create the Voronoi diagram for the input data. The Voronoi diagram divides the set D data into n regions V(p):
University of CreteCS48324 The validation criterion of DGSOT Let’s define a weighted, undirected graph.The vertices is the set of the centroids of the Voronoi cell and the edge set is defined as Create the MST for the graph
University of CreteCS48325 Voronoi diagram of 2D dataset In A, the dataset is partitioned into three Voronoi cells. The MST of the centroid is ‘even’. In B, the dataset is partitioned into four Voronoi cells. The MST of the centroid is not ‘even’.
University of CreteCS48326 The validation criterion of DGSOT Cluster separation ` where is minimum length edge and is the maximum length edge A low value of the CS means that the two centroids are to close to each other and the Voronoi partition is not valid, while a high CS value means that the Voronoi partition is valid.
University of CreteCS48328 Conclusions The tree algorithms presented in this report have provided comparable result to those obtained by classic clustering algorithms, without their drawbacks, and superior to those obtained by standard hierarchical clustering.