The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 10/04/2006 Model-based Clustering Model-Based Clustering What is model-based clustering? Attempt to optimize the fit between the given data and some mathematical model Based on the assumption: Data are generated by a mixture of underlying probability distribution Typical methods Statistical approach EM (Expectation maximization), AutoClass Machine learning approach COBWEB, CLASSIT Neural network approach SOM (Self-Organizing Feature Map)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 10/04/2006 Model-based Clustering EM — Expectation Maximization EM — A popular iterative refinement algorithm An extension to k-means Assign each object to a cluster according to a weight (prob. distribution) New means are computed based on weighted measures General idea Starts with an initial estimate of the parameter vector Iteratively rescores the patterns against the mixture density produced by the parameter vector The rescored patterns are used to update the parameter updates Patterns belonging to the same cluster, if they are placed by their scores in a particular component Algorithm converges fast but may not be in global optima AutoClass (Cheeseman and Stutz, 1996)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 10/04/2006 Model-based Clustering 1D Guassian Mixture Model Given a set of data distributed in a 1D space, how to perform clustering in the data set? General idea: factorize the p.d.f. into a mixture of simple models. Discrete values: Bernoulli distribution Continues values: Gaussian distribution

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 10/04/2006 Model-based Clustering The EM (Expectation Maximization) Algorithm Initially, randomly assign k cluster centers Iteratively refine the clusters based on two steps Expectation step: assign each data point X i to cluster C i with the following probability Maximization step: Estimation of model parameters

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 10/04/2006 Model-based Clustering Another Way of K-mean? Pos: AutoClass can adapt to different (convex) shapes of clusters, k- mean assumes spheres Solid statistics foundation Cons: computational expensive

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 10/04/2006 Model-based Clustering Model Based Subspace Clustering Microarray Bi-clustering δ-clustering p-clustering OP-clustering

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 10/04/2006 Model-based Clustering MicroArray Dataset

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 10/04/2006 Model-based Clustering Gene Expression Matrix Genes Conditions Genes Conditions Time points Cancer Tissues

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 10/04/2006 Model-based Clustering Data Mining: Clustering Where K-means clustering minimizes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 10/04/2006 Model-based Clustering Clustering by Pattern Similarity (p-Clustering) The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space Parallel Coordinates Plots Difficult to find their patterns “non-traditional” clustering

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 10/04/2006 Model-based Clustering Clusters Are Clear After Projection

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 10/04/2006 Model-based Clustering Why p-Clustering? Microarray data analysis may need to Clustering on thousands of dimensions (attributes) Discovery of both shift and scaling patterns Clustering with Euclidean distance measure? — cannot find shift patterns Clustering on derived attribute A ij = a i – a j ? — introduces N(N-1) dimensions Bi-cluster using transformed mean-squared residue score matrix (I, J) Where A submatrix is a δ-cluster if H(I, J) ≤ δ for some δ > 0 Problems with bi-cluster No downward closure property, Due to averaging, it may contain outliers but still within δ-threshold

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide14 10/04/2006 Model-based Clustering Motivation DNA microarray analysis CH1ICH1BCH1DCH2ICH2B CTFC343922844108280228 VPS8401281120275298 EFB131828037277215 SSA1401292109580238 FUN1428572852576271226 SP0722829048285224 MDM10538272266277236 CYS332228841278219 DEP131227240273232 NTG132929633274228

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 10/04/2006 Model-based Clustering Motivation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide16 10/04/2006 Model-based Clustering Motivation Strong coherence exhibits by the selected objects on the selected attributes. They are not necessarily close to each other but rather bear a constant shift. Object/attribute bias bi-cluster

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 10/04/2006 Model-based Clustering Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases may be local to the set of selected objects/attributes are usually unknown in advance May have many unspecified entries

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 10/04/2006 Model-based Clustering Previous Work Subspace clustering Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R Only considers global offset of each object/attribute.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 10/04/2006 Model-based Clustering bi-cluster Terms Consists of a (sub)set of objects and a (sub)set of attributes Corresponds to a submatrix Occupancy threshold  Each object/attribute has to be filled by a certain percentage. Volume: number of specified entries in the submatrix Base: average value of each object/attribute (in the bi- cluster) Biclustering of Expression Data, Cheng & Church ISMB’00

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 10/04/2006 Model-based Clustering bi-cluster CH1ICH1BCH1DCH2ICH2BObj base CTFC3 VPS8401120298273 EFB131837215190 SSA1 FUN14 SP07 MDM10 CYS332241219194 DEP1 NTG1 Attr base34766244219

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 10/04/2006 Model-based Clustering 17 conditions 40 genes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 10/04/2006 Model-based Clustering Motivation

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 10/04/2006 Model-based Clustering 17 conditions 40 genes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 10/04/2006 Model-based Clustering Motivation Co-regulated genes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 10/04/2006 Model-based Clustering bi-cluster Perfect  -cluster Imperfect  -cluster Residue: d IJ d Ij d iJ d ij

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 10/04/2006 Model-based Clustering bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify  -clusters with residue smaller than a given threshold

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 10/04/2006 Model-based Clustering Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 10/04/2006 Model-based Clustering The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 10/04/2006 Model-based Clustering The FLOC algorithm Action: the change of membership of a row(or column) with respect to a cluster 34 4 13 22 3 2 2 04 column row 1 3 2 1 234 M+N actions are Performed at each iteration N=3 M=4

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 10/04/2006 Model-based Clustering The FLOC algorithm Gain of an action: the residue reduction incurred by performing the action Order of action: Fixed order Random order Weighted random order Complexity: O((M+N)MNkp)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 10/04/2006 Model-based Clustering The FLOC algorithm Additional features Maximum allowed overlap among clusters Minimum coverage of clusters Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 10/04/2006 Model-based Clustering Performance Microarray data: 2884 genes, 17 conditions 100 bi-clusters with smallest residue were returned. Average residue = 10.34 The average residue of clusters found via the state of the art method in computational biology field is 12.54 The average volume is 25% bigger The response time is an order of magnitude faster

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 10/04/2006 Model-based Clustering Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set. base residue Many additional features can be accommodated (nearly for free).

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 10/04/2006 Model-based Clustering p-Clustering: Clustering by Pattern Similarity Given object x, y in O and features a, b in T, pCluster is a 2 by 2 matrix A pair (O, T) is in δ-pCluster if for any 2 by 2 matrix X in (O, T), pScore(X) ≤ δ for some δ > 0 For scaling patterns, one can observe, taking logarithmic on will lead to the pScore form H. Wang, et al., Clustering by pattern similarity in large data sets, SIGMOD’02.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 10/04/2006 Model-based Clustering Coherent Cluster Want to accommodate noises but not outliers

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 10/04/2006 Model-based Clustering Coherent Cluster Coherent cluster Subspace clustering pair-wise disparity For a 2  2 (sub)matrix consisting of objects {x, y} and attributes {a, b} x y ab d xa d ya d xb d yb x y ab z attribute mutual bias of attribute a mutual bias of attribute b

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 10/04/2006 Model-based Clustering Coherent Cluster A 2  2 (sub)matrix is a  -coherent cluster if its D value is less than or equal to . An m  n matrix X is a  -coherent cluster if every 2  2 submatrix of X is  -coherent cluster. A  -coherent cluster is a maximum  -coherent cluster if it is not a submatrix of any other  -coherent cluster. Objective: given a data matrix and a threshold , find all maximum  -coherent clusters.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide38 10/04/2006 Model-based Clustering Coherent Cluster Challenges: Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. The actual values of the objects in a coherent cluster may be far apart from each other. Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 10/04/2006 Model-based Clustering Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters Two-way Pruning

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 10/04/2006 Model-based Clustering Coherent Cluster Observation: Given a pair of objects {o 1, o 2 } and a (sub)set of attributes {a 1, a 2, …, a k }, the 2  k submatrix is a  -coherent cluster iff, for every attribute a i, the mutual bias (d o1ai – d o2ai ) does not differ from each other by more than . a1a1 a2a2 a3a3 a4a4 a5a5 1 3 5 7 323.522.5 o1o1 o2o2  [2, 3.5] If  = 1.5, then {a 1,a 2,a 3,a 4,a 5 } is a coherent attribute set (CAS) of (o 1,o 2 ).

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide41 10/04/2006 Model-based Clustering Coherent Cluster Observation: given a subset of objects {o 1, o 2, …, o l } and a subset of attributes {a 1, a 2, …, a k }, the l  k submatrix is a  -coherent cluster iff {a 1, a 2, …, a k } is a coherent attribute set for every pair of objects (o i,o j ) where 1  i, j  l. a1a1 a5a5 a6a6 a7a7 a2a2 a3a3 a4a4 o1o1 o3o3 o4o4 o5o5 o6o6 o2o2

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 10/04/2006 Model-based Clustering a1a1 a2a2 a3a3 a4a4 a5a5 1 3 5 7 323.522.5 r1r1 r2r2 Coherent Cluster Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .  = 1 3 5 7 r1r1 r2r2 a2a2 2 a3a3 3.5 a4a4 2 a5a5 2.5 a1a1 3 1 The maximum coherent attribute sets define the search space for maximum coherent clusters.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 10/04/2006 Model-based Clustering Two Way Pruning a0a1a2 o0142 o1255 o2365 o342007 o430076 (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) delta=1 nc =3 nr = 3 (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) MCAS MCOS

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44 10/04/2006 Model-based Clustering Coherent Cluster Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. objects attributes

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide45 10/04/2006 Model-based Clustering (o 0,o 1 ) : {a 0,a 1 }, {a 2,a 3 } (o 0,o 2 ) : {a 0,a 1,a 2,a 3 } (o 0,o 4 ) : {a 1,a 2 } (o 1,o 2 ) : {a 0,a 1,a 2 }, {a 2,a 3 } (o 1,o 3 ) : {a 0,a 2 } (o 1,o 4 ) : {a 1,a 2 } (o 2,o 3 ) : {a 0,a 2 } (o 2,o 4 ) : {a 1,a 2 } a0a0 a1a1 a2a2 a3a3 o0o0 1425 o1o1 2558 o2o2 3657 o3o3 42072 o4o4 30766 a0a0 a1a1 a2a2 a2a2 a3a3 a1a1 a2a2 a2a2 a3a3 (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o1,o3)(o1,o3) (o2,o3)(o2,o3) (o0,o4)(o0,o4) (o1,o4)(o1,o4) (o2,o4)(o2,o4) (o0,o1)(o0,o1) (o1,o2)(o1,o2) assume  = 1 {a 0,a 1 } : (o 0,o 1 ) {a 0,a 2 } : (o 1,o 3 ),(o 2,o 3 ) {a 1,a 2 } : (o 0,o 4 ),(o 1,o 4 ),(o 2,o 4 ) {a 2,a 3 } : (o 0,o 1 ),(o 1,o 2 ) {a 0,a 1,a 2 } : (o 1,o 2 ) {a 0,a 1,a 2,a 3 } : (o 0,o 2 ) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2)

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide46 10/04/2006 Model-based Clustering Coherent Cluster High expressive power The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. Efficient and highly scalable Wide applications Gene expression analysis Collaborative filtering subspace cluster coherent cluster

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide47 10/04/2006 Model-based Clustering Remark Comparing to Bicluster Can well separate noises and outliers No random data insertion and replacement Produce optimal solution

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide48 10/04/2006 Model-based Clustering Definition of OP-Cluster Let I be a subset of genes in the database. Let J be a subset of conditions. We say forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions. A 1 A 2 A 3 A 4 Experssion Levels when

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide49 10/04/2006 Model-based Clustering Problem Statement Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold n c and n r.

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide50 10/04/2006 Model-based Clustering Conversion to Sequence Mining Problem A 1 A 2 A 3 A 4 Experssion Levels Sequence:

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide51 10/04/2006 Model-based Clustering Ming OP-Clusters: A naïve approach A naïve approach Enumerate all possible subsequences in a prefix tree. For each subsequences, collect all genes that contain the subsequences. Challenge: The total number of distinct subsequences are abcd bcd cdbdbc dcbdbc acd cdad… dcad… … A Complete Prefix Tree with 4 items {a,b,c,d} root a b d

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide52 10/04/2006 Model-based Clustering Mining OP-Clusters: Prefix Tree Goal: Build a compact prefix tree that includes all sub-sequenes only occurring in the original database. Strategies: 1.Depth-First Traversal 2.Suffix concatenation: Visit subsequences that only exist in the input sequences. 3.Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences. g1g1 adbc g2g2 abdc g3g3 badc a:1,2 d:1b:2 d:2b:1c:1,3 b:3 Root c:1c:2 a:3 d:3 c:3 a:3 d:3 c:3 a:1,2 d:1,3 a:1,2,3 d:1,3d:1,2,3 c:1,2,3 d:2 c:2 a:1,2,3 d:1,2,3

Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide53 10/04/2006 Model-based Clustering References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations

Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Similar presentations

Presentation on theme: "The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006."— Presentation transcript:

Similar presentations

About project

Feedback