GPX: Interactive Exploration of Time-series Microarray Data

Slides:



Advertisements
Similar presentations
AMCS/CS229: Machine Learning
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Clustering Prof. Navneet Goyal BITS, Pilani
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
University at BuffaloThe State University of New York Interactive Exploration of Coherent Patterns in Time-series Gene Expression Data Daxin Jiang Jian.
Threshold selection in gene co- expression networks using spectral graph theory techniques Andy D Perkins*,Michael A Langston BMC Bioinformatics 1.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Yeast Dataset Analysis Hongli Li Final Project Computer Science Department UMASS Lowell.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
What is Cluster Analysis
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Computer Vision James Hays, Brown
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
Density-Based Clustering Algorithms
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Presented by Ho Wai Shing
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering
CLUSTERING DENSITY-BASED METHODS Elsayed Hemayed Data Mining Course.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Computational Biology
Clustering Anna Reithmeir Data Mining Proseminar 2017
Data Mining: Basic Cluster Analysis
More on Clustering in COSC 4335
CSE 4705 Artificial Intelligence
Hierarchical Clustering: Time and Space requirements
CSE 5243 Intro. to Data Mining
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
CS 685: Special Topics in Data Mining Jinze Liu
William Norris Professor and Head, Department of Computer Science
William Norris Professor and Head, Department of Computer Science
CSE572, CBS598: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Hierarchical Clustering
Clustering The process of grouping samples so that the samples are similar within each group.
CSE572: Data Mining by H. Liu
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Presentation transcript:

GPX: Interactive Exploration of Time-series Microarray Data Daxin Jiang, Jian Pei, and Aidong Zhang Motivations Specific features of time-series microarray data Special requirements from the domain of biology Most clustering algorithms may not be effective to address the above problems

Time-series Microarray Data Gene expression levels are monitored at different time points during a time series.

Co-expressed Genes and Coherent Patterns Parallel coordinates for Iyer’s data Examples of co-expressed genes and coherent patterns in gene expression data [1] Iyer, V.R. et al. The transcriptional program in the response of human fibroblasts to serum. Science, 283:83–87, 1999.

Example – Cell Cycle S phase Early G1 phase The cell cycle Expression patterns of cell-cycle regulated genes of yeast reported by Spellman et al. G2 phase Late G1 phase [2] Spellman et al., (1998).  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.  Molecular Biology of the Cell 9, 3273-3297. M phase

Cluster Analysis Partition the data set into several disjoint clusters Each cluster is a group of co-expressed genes. The centroid of the cluster is the coherent pattern. Various of clustering methods Partition-based approaches Hierarchical approaches Density-based approaches ……

What the Data Look Like L. Zhang et al. Enhanced Visualization of Time Series through Higher Fourier Harmonics. BIOKDD 2003

High Connectivity of the Data ga gb Two genes with complete different patterns connected by a “bridge”

Hierarchies of Co-expressed Genes and Coherent Patterns The interpretation of co-expressed genes and coherent patterns mainly depends on the domain knowledge

To Split or Not Dependent on “domain knowledge” group A1 group A2

Which Split Option to Choose Dependent on “domain knowledge” Various split options may correspond to different hypotheses regarding gene function.

What is a “Good” Clustering Algorithm Form a hierarchical structure Flexible and convenient to derive clusters Support users’ domain knowledge Handle the high connectivity effectively

Partition-based Approaches Form a hierarchical structure? Yes, if we use it as the split strategy in the divisive approach Flexible and convenient to derive clusters? No, since the parameters are hard to determine Handle the high connectivity effectively? No, since it partitions the data set by force

Cut cluster borders by force

Hierarchical Approaches Form a hierarchical structure? Sure Flexible and convenient to derive clusters? Global threshold: convenient but not flexible Handle the high connectivity effectively? Depends on which inter-cluster measure is used e.g., complete-link may be better than single-link

Density-based Approaches Form a hierarchical structure? Not explicitly Possible if we adjust the parameters level by level Flexible and convenient to derive clusters? DBSCAN and DENCLUE use global thresholds not flexible OPTICS plots cluster structure both flexible and convenient

Density-based Approaches Handle the high connectivity effectively? DBSCAN and OPTICS are not effective “indirectly density-reachable” forms a chain DENCLUE cuts the cluster border by force center-defined clusters a local maximum of density is the “center” of a cluster other objects in the cluster are “attracted” to the local maximum

Our Solution– An Interactive Approach Adopt a divisive approach to form a hierarchical structure Users can choose whether to split or not Still need one parameter robust easy to determine Plot the cluster structure of the data set Users can explore the data set by “drill down” and “roll up” operations based on their domain knowledge Apply a novel strategy to handle the high connectivity. Users can determine the cluster border

Pattern-based Strategy coherent pattern To find co-expressed genes and coherent expression patterns Cluster-based strategy First find clusters as co-expressed genes Then use centroids as coherent expression patterns Pattern-based strategy First find coherent expression patterns Then determine the co-expressed genes conforming to the pattern Similarity Genes … 0.94 gene g i6 gene g i5 0.95 gene g i4 0.98 gene g i3 gene g i2 0.99 gene g i1 -0.55 gene g in -0.45 gene g in-1 -0.44 gene g in-2    Pattern-based strategy

Distance Measure Users are interested in overall shape Euclidean distance does not work well Normalize each data object O to O’ with a mean of 0 and a variance of 1 An object After normalization Shifting patterns m is the number of attributes,  and  are the mean and the standard deviation of O, respectively. Scaling patterns

Distance Measure Similarity and Distance between two genes (objects) The similarity and distance measure defined above are consistent Given objects O1, O2, O3 and O4, Similarity(O1,O2)≥Similarity(O3,O4) if and only if Distance(O1,O2)  Distance(O3,O4)

A Density-based Model A group of co-expressed genes form a dense area; Genes at the core area have high density, while genes at the boundary area have low density; Genes at the boundary area are “attracted” towards the local maximum level by level.

Density Measures Radius-based density KNN-based density DENCLUE density

Definition of Density We modify the density definition by Denclue[3] The influence function (attraction function) Given a data set D d(Oi,Oj) is the distance between Oi and Oj, and  is a parameter is the estimated average similarity within a cluster [3] Hinneburg, A. et al. An efficient approach to clustering in large multimedia database with noise. Proc. 4th Int. Con. on Knowledge discovery and data mining, 1998.

Attraction Tree The “attractor” of object O is its nearest neighbor with a higher density than O. Denoted by O  Attractor(O). We can derive an attraction tree based on the “attractor” relationship The weight for each edge e(Oi,Oj) on the attraction tree is defined as the similarity between Oi and Oj. Use Pearson’s correlation coefficient as similarity measure.

An Example of Attraction Tree An example data set The attraction tree Three features of attraction tree: self-closed: a group of objects conforming to the same coherent pattern forms an attraction subtree. robust to intermediate genes (noise) three levels of edge weights

Index List Serialization of the attraction tree Search the attraction tree based on the edge weight. Order the genes in the “index list”. The attraction tree The index list

Index list Similarity curve for Iyer’sdata set

Coherent Pattern Index Graph Compute the “coherent pattern index (CPI)” for each gene. p is a parameter, Sim(gi) is the similarity between gi and its parent gj on the attraction tree The index list The coherent pattern index graph

M phase S phase Early G1 phase Late G1 phase G2 phase

Validation Measure P1 C1 P2 C2 P3 C3 P4 C4 … … Pn Cm Ground truth patterns Reported patterns P1 is matched by C4 with similarity 0.95. (suppose Sim(P1,C4)=0.95) P2 is matched by C1 with similarity 0.90. (supposeSim(P2,C1)=0.9)

Comparison With Other Approaches Pattern GPX (10) Kmeans (10) SOM (10) SOTA (50) Adapt(11) CLICK(7) CAST (9) 1 0.998 0.973 0.983 0.962 0.956 0.884 0.955 2 0.996 0.950 0.992 0.936 0.911 0.991 0.887 3 0.993 0.910 0.872 0.947 0.994 0.997 4 0.995 0.989 0.984 0.883 0.968 5 0.964 0.882 0.716 0.868 0.855 6 0.940 0.965 0.764 0.972 0.970 7 0.880 0.892 0.988 0.976 0.990 0.719 8 0.963 0.917 0.958 0.914 0.999 9 0.907 0.848 0.824 0.844 0.800 10 0.987 0.930 0.960 0.981 The similarity between the pattern reported by different approaches and the corresponding pattern in the ground truth (if any)

Comparison With Other Approaches Pattern GPX (7) Kmeans (5) SOM (5) SOTA (99) Adapt(21) CLICK(9) CAST (19) 1 0.901 0.928 0.194 0.938 0.884 0.855 0.900 2 0.970 0.976 0.972 0.968 0.978 3 0.980 0.950 0.552 0.940 0.953 0.888 4 0.773 0.437 0.961 0.796 0.984 5 0.945 0.965 0.964 0.956 0.962

Comparison with Optics Iyer’s data set Spellman’s data set

Effects of Parameters Spellman’s data set Iyer’s data set

Scalability The algorithm scales well with large data sets. The computation time is dominated by the distance calculation.