The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Slides:



Advertisements
Similar presentations
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
PARTITIONAL CLUSTERING
Fast Algorithms For Hierarchical Range Histogram Constructions
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Cluster Analysis. Midterm: Monday Oct 29, 4PM  Lecture Notes from Sept 5, 2007 until Oct 15, Chapters from Textbook and papers discussed in class.
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
A Probabilistic Framework for Semi-Supervised Clustering
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Clustering II.
Mutual Information Mathematical Biology Seminar
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Switch to Top-down Top-down or move-to-nearest Partition documents into ‘k’ clusters Two variants “Hard” (0/1) assignment of documents to clusters “soft”
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Radial Basis Function Networks
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
First topic: clustering and pattern recognition Marc Sobel.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
HMM - Part 2 The EM algorithm Continuous density HMM.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Data Mining and Decision Support
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Ultra-high dimensional feature selection Yun Li
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
4.0 - Data Mining Sébastien Lemieux Elitra Canada Ltd.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
What Is Cluster Analysis?
Semi-Supervised Clustering
LECTURE 11: Advanced Discriminant Analysis
Subspace Clustering/Biclustering
CARPENTER Find Closed Patterns in Long Biological Datasets
CSE572, CBS598: Data Mining by H. Liu
Clustering and Multidimensional Scaling
A Fast Algorithm for Subspace Clustering by Pattern Similarity
CSE572, CBS572: Data Mining by H. Liu
CS 485G: Special Topics in Data Mining
CSE572: Data Mining by H. Liu
Presentation transcript:

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 10/04/2006 Model-based Clustering Model-Based Clustering What is model-based clustering? Attempt to optimize the fit between the given data and some mathematical model Based on the assumption: Data are generated by a mixture of underlying probability distribution Typical methods Statistical approach EM (Expectation maximization), AutoClass Machine learning approach COBWEB, CLASSIT Neural network approach SOM (Self-Organizing Feature Map)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 10/04/2006 Model-based Clustering EM — Expectation Maximization EM — A popular iterative refinement algorithm An extension to k-means Assign each object to a cluster according to a weight (prob. distribution) New means are computed based on weighted measures General idea Starts with an initial estimate of the parameter vector Iteratively rescores the patterns against the mixture density produced by the parameter vector The rescored patterns are used to update the parameter updates Patterns belonging to the same cluster, if they are placed by their scores in a particular component Algorithm converges fast but may not be in global optima AutoClass (Cheeseman and Stutz, 1996)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 10/04/2006 Model-based Clustering 1D Guassian Mixture Model Given a set of data distributed in a 1D space, how to perform clustering in the data set? General idea: factorize the p.d.f. into a mixture of simple models. Discrete values: Bernoulli distribution Continues values: Gaussian distribution

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 10/04/2006 Model-based Clustering The EM (Expectation Maximization) Algorithm Initially, randomly assign k cluster centers Iteratively refine the clusters based on two steps Expectation step: assign each data point X i to cluster C i with the following probability Maximization step: Estimation of model parameters

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 10/04/2006 Model-based Clustering Another Way of K-mean? Pos: AutoClass can adapt to different (convex) shapes of clusters, k- mean assumes spheres Solid statistics foundation Cons: computational expensive

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 10/04/2006 Model-based Clustering Model Based Subspace Clustering Microarray Bi-clustering δ-clustering p-clustering OP-clustering

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 10/04/2006 Model-based Clustering MicroArray Dataset

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 10/04/2006 Model-based Clustering Gene Expression Matrix Genes Conditions Genes Conditions Time points Cancer Tissues

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 10/04/2006 Model-based Clustering Data Mining: Clustering Where K-means clustering minimizes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 10/04/2006 Model-based Clustering Clustering by Pattern Similarity (p-Clustering) The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space Parallel Coordinates Plots Difficult to find their patterns “non-traditional” clustering

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 10/04/2006 Model-based Clustering Clusters Are Clear After Projection

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 10/04/2006 Model-based Clustering Why p-Clustering? Microarray data analysis may need to Clustering on thousands of dimensions (attributes) Discovery of both shift and scaling patterns Clustering with Euclidean distance measure? — cannot find shift patterns Clustering on derived attribute A ij = a i – a j ? — introduces N(N-1) dimensions Bi-cluster using transformed mean-squared residue score matrix (I, J) Where A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0 Problems with bi-cluster No downward closure property, Due to averaging, it may contain outliers but still within δ-threshold

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 10/04/2006 Model-based Clustering Motivation DNA microarray analysis CH1ICH1BCH1DCH2ICH2B CTFC VPS EFB SSA FUN SP MDM CYS DEP NTG

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 10/04/2006 Model-based Clustering Motivation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 10/04/2006 Model-based Clustering Motivation Strong coherence exhibits by the selected objects on the selected attributes. They are not necessarily close to each other but rather bear a constant shift. Object/attribute bias bi-cluster

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 10/04/2006 Model-based Clustering Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases may be local to the set of selected objects/attributes are usually unknown in advance May have many unspecified entries

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 10/04/2006 Model-based Clustering Previous Work Subspace clustering Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R Only considers global offset of each object/attribute.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 10/04/2006 Model-based Clustering bi-cluster Terms Consists of a (sub)set of objects and a (sub)set of attributes Corresponds to a submatrix Occupancy threshold  Each object/attribute has to be filled by a certain percentage. Volume: number of specified entries in the submatrix Base: average value of each object/attribute (in the bi- cluster) Biclustering of Expression Data, Cheng & Church ISMB’00

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 10/04/2006 Model-based Clustering bi-cluster CH1ICH1BCH1DCH2ICH2BObj base CTFC3 VPS EFB SSA1 FUN14 SP07 MDM10 CYS DEP1 NTG1 Attr base

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 10/04/2006 Model-based Clustering 17 conditions 40 genes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 10/04/2006 Model-based Clustering Motivation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 10/04/2006 Model-based Clustering 17 conditions 40 genes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 10/04/2006 Model-based Clustering Motivation Co-regulated genes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 10/04/2006 Model-based Clustering bi-cluster Perfect  -cluster Imperfect  -cluster Residue: d IJ d Ij d iJ d ij

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 10/04/2006 Model-based Clustering bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify  -clusters with residue smaller than a given threshold

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 10/04/2006 Model-based Clustering Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 10/04/2006 Model-based Clustering The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 10/04/2006 Model-based Clustering The FLOC algorithm Action: the change of membership of a row(or column) with respect to a cluster column row M+N actions are Performed at each iteration N=3 M=4

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 10/04/2006 Model-based Clustering The FLOC algorithm Gain of an action: the residue reduction incurred by performing the action Order of action: Fixed order Random order Weighted random order Complexity: O((M+N)MNkp)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 10/04/2006 Model-based Clustering The FLOC algorithm Additional features Maximum allowed overlap among clusters Minimum coverage of clusters Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 10/04/2006 Model-based Clustering Performance Microarray data: 2884 genes, 17 conditions 100 bi-clusters with smallest residue were returned. Average residue = The average residue of clusters found via the state of the art method in computational biology field is The average volume is 25% bigger The response time is an order of magnitude faster

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 10/04/2006 Model-based Clustering Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set. base residue Many additional features can be accommodated (nearly for free).

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 10/04/2006 Model-based Clustering p-Clustering: Clustering by Pattern Similarity Given object x, y in O and features a, b in T, pCluster is a 2 by 2 matrix A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T), pScore(X) ≤ δ for some δ > 0 For scaling patterns, one can observe, taking logarithmic on will lead to the pScore form H. Wang, et al., Clustering by pattern similarity in large data sets, SIGMOD’02.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 10/04/2006 Model-based Clustering Coherent Cluster Want to accommodate noises but not outliers

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 10/04/2006 Model-based Clustering Coherent Cluster Coherent cluster Subspace clustering pair-wise disparity For a 2  2 (sub)matrix consisting of objects {x, y} and attributes {a, b} x y ab d xa d ya d xb d yb x y ab z attribute mutual bias of attribute a mutual bias of attribute b

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 10/04/2006 Model-based Clustering Coherent Cluster A 2  2 (sub)matrix is a  -coherent cluster if its D value is less than or equal to . An m  n matrix X is a  -coherent cluster if every 2  2 submatrix of X is  -coherent cluster. A  -coherent cluster is a maximum  -coherent cluster if it is not a submatrix of any other  -coherent cluster. Objective: given a data matrix and a threshold , find all maximum  -coherent clusters.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 10/04/2006 Model-based Clustering Coherent Cluster Challenges: Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. The actual values of the objects in a coherent cluster may be far apart from each other. Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 10/04/2006 Model-based Clustering Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters Two-way Pruning

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 10/04/2006 Model-based Clustering Coherent Cluster Observation: Given a pair of objects {o 1, o 2 } and a (sub)set of attributes {a 1, a 2, …, a k }, the 2  k submatrix is a  -coherent cluster iff, for every attribute a i, the mutual bias (d o1ai – d o2ai ) does not differ from each other by more than . a1a1 a2a2 a3a3 a4a4 a5a o1o1 o2o2  [2, 3.5] If  = 1.5, then {a 1,a 2,a 3,a 4,a 5 } is a coherent attribute set (CAS) of (o 1,o 2 ).

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 10/04/2006 Model-based Clustering Coherent Cluster Observation: given a subset of objects {o 1, o 2, …, o l } and a subset of attributes {a 1, a 2, …, a k }, the l  k submatrix is a  -coherent cluster iff {a 1, a 2, …, a k } is a coherent attribute set for every pair of objects (o i,o j ) where 1  i, j  l. a1a1 a5a5 a6a6 a7a7 a2a2 a3a3 a4a4 o1o1 o3o3 o4o4 o5o5 o6o6 o2o2

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 10/04/2006 Model-based Clustering a1a1 a2a2 a3a3 a4a4 a5a r1r1 r2r2 Coherent Cluster Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .  = r1r1 r2r2 a2a2 2 a3a3 3.5 a4a4 2 a5a5 2.5 a1a1 3 1 The maximum coherent attribute sets define the search space for maximum coherent clusters.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 10/04/2006 Model-based Clustering Two Way Pruning a0a1a2 o0142 o1255 o2365 o o (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) delta=1 nc =3 nr = 3 (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) MCAS MCOS

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 10/04/2006 Model-based Clustering Coherent Cluster Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. objects attributes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 10/04/2006 Model-based Clustering (o 0,o 1 ) : {a 0,a 1 }, {a 2,a 3 } (o 0,o 2 ) : {a 0,a 1,a 2,a 3 } (o 0,o 4 ) : {a 1,a 2 } (o 1,o 2 ) : {a 0,a 1,a 2 }, {a 2,a 3 } (o 1,o 3 ) : {a 0,a 2 } (o 1,o 4 ) : {a 1,a 2 } (o 2,o 3 ) : {a 0,a 2 } (o 2,o 4 ) : {a 1,a 2 } a0a0 a1a1 a2a2 a3a3 o0o o1o o2o o3o o4o a0a0 a1a1 a2a2 a2a2 a3a3 a1a1 a2a2 a2a2 a3a3 (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o1,o3)(o1,o3) (o2,o3)(o2,o3) (o0,o4)(o0,o4) (o1,o4)(o1,o4) (o2,o4)(o2,o4) (o0,o1)(o0,o1) (o1,o2)(o1,o2) assume  = 1 {a 0,a 1 } : (o 0,o 1 ) {a 0,a 2 } : (o 1,o 3 ),(o 2,o 3 ) {a 1,a 2 } : (o 0,o 4 ),(o 1,o 4 ),(o 2,o 4 ) {a 2,a 3 } : (o 0,o 1 ),(o 1,o 2 ) {a 0,a 1,a 2 } : (o 1,o 2 ) {a 0,a 1,a 2,a 3 } : (o 0,o 2 ) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 10/04/2006 Model-based Clustering Coherent Cluster High expressive power The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. Efficient and highly scalable Wide applications Gene expression analysis Collaborative filtering subspace cluster coherent cluster

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 10/04/2006 Model-based Clustering Remark Comparing to Bicluster Can well separate noises and outliers No random data insertion and replacement Produce optimal solution

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 10/04/2006 Model-based Clustering Definition of OP-Cluster Let I be a subset of genes in the database. Let J be a subset of conditions. We say forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions. A 1 A 2 A 3 A 4 Experssion Levels when

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 10/04/2006 Model-based Clustering Problem Statement Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold n c and n r.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 10/04/2006 Model-based Clustering Conversion to Sequence Mining Problem A 1 A 2 A 3 A 4 Experssion Levels Sequence:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 10/04/2006 Model-based Clustering Ming OP-Clusters: A naïve approach A naïve approach Enumerate all possible subsequences in a prefix tree. For each subsequences, collect all genes that contain the subsequences. Challenge: The total number of distinct subsequences are abcd bcd cdbdbc dcbdbc acd cdad… dcad… … A Complete Prefix Tree with 4 items {a,b,c,d} root a b d

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 10/04/2006 Model-based Clustering Mining OP-Clusters: Prefix Tree Goal: Build a compact prefix tree that includes all sub-sequenes only occurring in the original database. Strategies: 1.Depth-First Traversal 2.Suffix concatenation: Visit subsequences that only exist in the input sequences. 3.Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences. g1g1 adbc g2g2 abdc g3g3 badc a:1,2 d:1b:2 d:2b:1c:1,3 b:3 Root c:1c:2 a:3 d:3 c:3 a:3 d:3 c:3 a:1,2 d:1,3 a:1,2,3 d:1,3d:1,2,3 c:1,2,3 d:2 c:2 a:1,2,3 d:1,2,3

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 10/04/2006 Model-based Clustering References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp , H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.