1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Indexing DNA Sequences Using q-Grams
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Fast Algorithms For Hierarchical Range Histogram Constructions
Lecture 3: Parallel Algorithm Design
Greedy Algorithms Greed is good. (Some of the time)
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
The Cache Location Problem IEEE/ACM Transactions on Networking, Vol. 8, No. 5, October 2000 P. Krishnan, Danny Raz, Member, IEEE, and Yuval Shavitt, Member,
Presented by Marlene Shehadeh Advanced Topics in Computer Vision ( ) Winter
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Triple Patterning Aware Detailed Placement With Constrained Pattern Assignment Haitong Tian, Yuelin Du, Hongbo Zhang, Zigang Xiao, Martin D.F. Wong.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Automated Drawing of 2D chemical structures Kees Visser.
Birch: An efficient data clustering method for very large databases
Clustering Unsupervised learning Generating “classes”
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Distributed Constraint Optimization Michal Jakob Agent Technology Center, Dept. of Computer Science and Engineering, FEE, Czech Technical University A4M33MAS.
Network Aware Resource Allocation in Distributed Clouds.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
VAST 2011 Sebastian Bremm, Tatiana von Landesberger, Martin Heß, Tobias Schreck, Philipp Weil, and Kay Hamacher Interactive-Graphics Systems TU Darmstadt,
Introduction Of Tree. Introduction A tree is a non-linear data structure in which items are arranged in sequence. It is used to represent hierarchical.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
ASC2003 (July 15,2003)1 Uniformly Distributed Sampling: An Exact Algorithm for GA’s Initial Population in A Tree Graph H. S.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Tao Lin Chris Chu TPL-Aware Displacement- driven Detailed Placement Refinement with Coloring Constraints ISPD ‘15.
Compiled by: Dr. Mohammad Omar Alhawarat
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
A Study of Balanced Search Trees: Brainstorming a New Balanced Search Tree Anthony Kim, 2005 Computer Systems Research.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Spectral Sequencing Based on Graph Distance Rong Liu, Hao Zhang, Oliver van Kaick {lrong, haoz, cs.sfu.ca {lrong, haoz, cs.sfu.ca.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Using decision trees to build an a framework for multivariate time- series classification 1 Present By Xiayi Kuang.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
1 Using Graph Theory to Analyze Gene Network Coherence José A. Lagares Jesús S. Aguilar Norberto Díaz-Díaz Francisco A. Gómez-Vela
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Clustering Machine Learning Unsupervised Learning K-means Optimization objective Random initialization Determining Number of Clusters Hierarchical Clustering.
DATA STRUCURES II CSC QUIZ 1. What is Data Structure ? 2. Mention the classifications of data structure giving example of each. 3. Briefly explain.
Clustering Data Streams
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
Parallel Density-based Hybrid Clustering
Mean Shift Segmentation
Scale-Space Representation of 3D Models and Topological Matching
GPX: Interactive Exploration of Time-series Microarray Data
Efficient Subgraph Similarity All-Matching
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer Science The University of Oklahoma Norman, Oklahoma, 73019, USA {jianting,

2 Outline Introduction Related Work The Proposed Ordering Method – The Objective Function – The Ordering Algorithm – An Example Experiment Conclusions and Future Work Directions

3 Introduction DNA microarray technologies in recent years produce tremendous amount of gene expression data Clustering is one of the most popular methods to extract patterns hidden in the data and to understand functions of genomes Hierarchical clustering provides users an initial impression of the distribution of data and stimulates thorough inspection of the whole data set

4 Hierarchical clustering trees are usually displayed with their leaves in linear order Adjacent genes in a linear ordering are often hypothesized to be related in some manner => the ordering of the tree leaves is important for human perception. The genes in a hierarchical clustering tree are unordered and there are 2 n-1 possible orderings for a balanced binary clustering tree with n genes. Introduction

5 Introduction Need additional criteria to generate an ordering and algorithms to generate an optimal ordering based on the criteria An example: assume Gene 1 is more similar to Gene 4:

6 Introduction Our research Objectives: To provide Efficient and Optimal Leaf Ordering method for Hierarchical Clustering In Microarray Gene Expression Data Analysis

7 Related Work Heuristic Local Ordering Methods: – (Alizadeh, et al, 2000): proposed simple methods of weighting genes, such as average expression level, time of maximal induction, or chromosomal position; places the element with the lower average weight earlier in the final ordering – (Alon, et al, 1999): each pair of sibling branches is ordered according to the proximity of their centroids to the centroid of their parent’s sibling

8 Related Work Ordering Based Global Optimizations – (Bar-Joseph, et al, 2001): for an ordering  of tree T, the objective function is defined as S is the gene similarity matrix The proposed ordering algorithm to maximize the objective function has O(n 4 ) time complexity and O(n 2 ) space complexity

9 Related Work Ordering Based Global Optimizations – (Ding, 2002): new objective function: They argued that the objective function used in (Bar-Joseph, et al, 2001) ignores the similarity between large distance genes Provided an approximate algorithm with O(n 2 ) complexity to minimize their objective function

10 Related Work gene UYXWV 1 2 * [S(U,V)+S(V,W)+S(W,X)+S(X,Y)] 2 2 * [S(U,W)+S(V,X)+S(W,Y)] 3 2 * [S(U,X)+S(V,Y)] 4 2 * [S(U,Y)]

11 The Proposed Ordering Method Our Objective Function Linear, instead of quadratic as defined in (Ding, 2002) Can be rewritten as: Both of them summarize the weighted distances (weighted by similarity) of all possible edges in a similarity graph under ordering .  i denotes the node (gene) at position i while  (i) denotes the position of the node (gene) i.

12 The Proposed Ordering Method To minimize this objective function: Graph Linear Arrangement Problem! Ordering leaf nodes of a tree, not nodes of a regular graph (Bar-Yehuda, 2001): proposed an algorithm to approximately solve the regular graph MinLA problem by imposing a global constraint through a Binary Decomposition Tree (BDT)

13 The Proposed Ordering Method We propose to use a binary clustering tree as the BDT and use the algorithm to solve the hierarchical binary clustering tree leaf ordering problem (Bar-Yehuda, 2001) produces the optimal ordering for leaf ordering in O(n 2 ) time complexity The algorithm is exact, not approximate, for leaf ordering in a hierarchical clustering tree for genes.

14 The Proposed Ordering Method L R - R L 1-Orientation 0-Orientation Number of Possible Orderings 2 n-1 if a BDT is full and balanced Orientation tree (or_tree): the orientations at each intermediate node of the BDT form a tree that has the same structure as the BDT

15 Recursively test the two possible orientations of a BDT and choose the better one (the one with the lower cost) Use position implicitly in computing the cost which is very efficient. cost 0 =cost(left(0))+cost(right(0)) +|V(t 2 )|.right_cut(or_tree(t1)) +|V(t 1 )|.left_cut(or_tree(t2)) cost 1 =cost(left(1))+cost(right(1)) +|V(t 1 )|.right_cut(or_tree(t2)) +|V(t 2 )|.left_cut(or_tree(t1)) The Proposed Ordering Method (1)

16 The Proposed Ordering Method left_cut(or_tree(t))=left_cut(left(or_tree(t)) +left_cut(right(or_tree(t)) - in_cut(t) right_cut(or_tree(t))=right_cut(left(or_tree(t)) +right_cut(right(or_tree(t)) -in_cut(t) In_cut : summation of the weights of edges the beginning and ending nodes of which are within sub-tree t Left_cut : summation of the weights of edges the beginning node of which is not within sub-tree t while the ending node of which is within sub-tree t Right_cut: summation of the weights of edges the beginning node of which is within sub-tree t while the ending node of which is not within sub-tree t (2)

17 The Proposed Ordering Method Algorithm of (Bar-Yehuda, 2001) 1.Pre-compute in_cuts for all the non-leaf nodes in the BDT 2.Set tree t to the root of the BDT. Do the following recursively 3.If t is an intermediate node of the BDT: a)Compute the costs and the left_cuts and the right_cuts under both the 0-orientation and the 1-orientation of its two sub-trees t 1 and t 2 according to formula (1) and formula (2), respectively b)Set the orientation of t to the one that has less cost as the winner orientation 4.If t is a leaf node of the BDT, the left_cut is the summation of the weights of the edges with the node as the ending node. The right_cut is the summation of the weights of the edges with the node as the beginning node. The cost of the node is the same as its left_cut.

18 The Proposed Ordering Method An Example Similarity Graph Clustering Tree (BDT) In_cuts rooted at T 0 In_cuts rooted at T 12 In_cuts rooted at T 11

19 The Proposed Ordering Method Subtree at 0: left_cut=0.5, right_cut=0.60 Subtree at 1: left_cut= =1.75, right_cut=0 T 11 : left_cut= =1.65 Right_cut= =0 Cost= *( )+1*( )=3.4

20 The Proposed Ordering Method Cost=2.75 Winner T 12: 0-orientation: 1.6 (Winner) 1-Orientation: 1.65 Left_cut=0, right_cut=1.65 T 0 :1-orientation *( )+2*( )=4.35 => leaf ordering is ([2,3],[1,0])

21 The Proposed Ordering Method The total cost for T 0 under 0-orientation happens to be By convention, we choose 0-orientation if the costs of both orientations are the same. The optimized ordering is ([0,1], [3,2]) For the optimized ordering ([0,1], [3,2]), cost=4.35, Best among the 4!=24 possible orderings The initial ordering is ([1,0], [2,3]), cost=5.05

22 Experiment Data Set 800 cell-regulated genes of the yeast saccharomyces cerevisia ( The 800 genes have been assigned to five classes termed G1, S, S/G2, G2/M and M/G1 based on domain knowledge in (Spellman, etc, 1998); they can be used to visually examine the quality of an ordering Are also used in (Bar-Joseph, et al, 2001) to demonstrate the effectiveness of its leaf ordering technique

23 Experiment Preprocessing: Remove genes with exceptional missing values to compute the gene similarity matrix. The final number of genes in our study is 765 Only consider the gene pairs whose similarities are above a certain threshold (avg of similarity+1.5*std_dev of similarity): results in pairs of gene similarity and the average out-degree of the similarity graph is about 50 Use a graph-partition package called Metis to generate a clustering tree by recursive graph bisection and use it as our BDT

24 Experiment Computation Environment Dell Dimension HZ personal computer with 512M memory running Windows 2000 professional It takes about 27 seconds to perform ordering optimization Performance Metric: Normalized index

25 Experiment Results: Original Ordering (81.11), Optimized Ordering (69.69), The average of 1000 times random orderings (255.34) The optimized ordering is about 16.2% less than the original ordering. The average of the 1000 times random orderings is 3.14 times of the original ordering and 3.66 times of the optimized ordering Both the graph bisection based clustering method and the proposed leaf ordering method are effective

26 Experiments visually examination of 22 genes The optimized ordering is more compact. For example, G1 scatters in 6 segments in the original ordering while it scatters in 3 segments in the optimized ordering.

27 Conclusions Leaf ordering of hierarchical clusters is significant in presenting gene expression data. Previously proposed methods are either non-efficient or approximate in nature The proposed method is efficient: the time complexity is O(n 2 ) when the clustering tree is balanced. The proposed method optimal: it examines all possible 2 n-1 orderings without relying on approximation Preliminary Experiment show good results

28 Future Work Apply this graph theoretical leaf ordering optimization method to more gene expression data using clustering trees from a variety of hierarchical clustering methods as the BDTs for comparison purposes. Develop novel approaches to visually presenting the original and optimized cluster ordering to domain experts and examine the validity of our optimization objective function

29 Thanks! Questions?