The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP 790-90 Seminar Spring 2008

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 2 Coherent Cluster Want to accommodate noises but not outliers

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 3 Coherent Cluster Coherent cluster  Subspace clustering pair-wise disparity  For a 2  2 (sub)matrix consisting of objects {x, y} and attributes {a, b} x y ab d xa d ya d xb d yb x y ab z attribute mutual bias of attribute a mutual bias of attribute b

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 4 Coherent Cluster  A 2  2 (sub)matrix is a  -coherent cluster if its D value is less than or equal to .  An m  n matrix X is a  -coherent cluster if every 2  2 submatrix of X is  -coherent cluster.  A  -coherent cluster is a maximum  -coherent cluster if it is not a submatrix of any other  -coherent cluster.  Objective: given a data matrix and a threshold , find all maximum  -coherent clusters.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 5 Coherent Cluster Challenges:  Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality.  The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix.  The actual values of the objects in a coherent cluster may be far apart from each other.  Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 6 Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters Two-way Pruning

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 7 Coherent Cluster Observation: Given a pair of objects {o 1, o 2 } and a (sub)set of attributes {a 1, a 2, …, a k }, the 2  k submatrix is a  -coherent cluster iff, for every attribute a i, the mutual bias (d o1ai – d o2ai ) does not differ from each other by more than . a1a1 a2a2 a3a3 a4a4 a5a5 1 3 5 7 323.522.5 o1o1 o2o2  [2, 3.5] If  = 1.5, then {a 1,a 2,a 3,a 4,a 5 } is a coherent attribute set (CAS) of (o 1,o 2 ).

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 8 Coherent Cluster Observation: given a subset of objects {o 1, o 2, …, o l } and a subset of attributes {a 1, a 2, …, a k }, the l  k submatrix is a  -coherent cluster iff {a 1, a 2, …, a k } is a coherent attribute set for every pair of objects (o i,o j ) where 1  i, j  l. a1a1 a5a5 a6a6 a7a7 a2a2 a3a3 a4a4 o1o1 o3o3 o4o4 o5o5 o6o6 o2o2

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 9 a1a1 a2a2 a3a3 a4a4 a5a5 1 3 5 7 323.522.5 r1r1 r2r2 Coherent Cluster Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .  = 1 3 5 7 r1r1 r2r2 a2a2 2 a3a3 3.5 a4a4 2 a5a5 2.5 a1a1 3 1 The maximum coherent attribute sets define the search space for maximum coherent clusters.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 10 Two Way Pruning a0a1a2 o0142 o1255 o2365 o342007 o430076 (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) delta=1 nc =3 nr = 3 MCAS MCOS

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 11 Coherent Cluster Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. objects attributes

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 12 (o 0,o 1 ) : {a 0,a 1 }, {a 2,a 3 } (o 0,o 2 ) : {a 0,a 1,a 2,a 3 } (o 0,o 4 ) : {a 1,a 2 } (o 1,o 2 ) : {a 0,a 1,a 2 }, {a 2,a 3 } (o 1,o 3 ) : {a 0,a 2 } (o 1,o 4 ) : {a 1,a 2 } (o 2,o 3 ) : {a 0,a 2 } (o 2,o 4 ) : {a 1,a 2 } a0a0 a1a1 a2a2 a3a3 o0o0 1425 o1o1 2558 o2o2 3657 o3o3 42072 o4o4 30766 a0a0 a1a1 a2a2 a2a2 a3a3 a1a1 a2a2 a2a2 a3a3 (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o1,o3)(o1,o3) (o2,o3)(o2,o3) (o0,o4)(o0,o4) (o1,o4)(o1,o4) (o2,o4)(o2,o4) (o0,o1)(o0,o1) (o1,o2)(o1,o2) assume  = 1 {a 0,a 1 } : (o 0,o 1 ) {a 0,a 2 } : (o 1,o 3 ),(o 2,o 3 ) {a 1,a 2 } : (o 0,o 4 ),(o 1,o 4 ),(o 2,o 4 ) {a 2,a 3 } : (o 0,o 1 ),(o 1,o 2 ) {a 0,a 1,a 2 } : (o 1,o 2 ) {a 0,a 1,a 2,a 3 } : (o 0,o 2 ) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 13 a0a0 a1a1 a2a2 a2a2 a3a3 a1a1 a2a2 a2a2 a3a3 (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o1,o3)(o1,o3) (o2,o3)(o2,o3) (o0,o4)(o0,o4) (o1,o4)(o1,o4) (o2,o4)(o2,o4) (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) a3a3 {o 0,o 2 }  {a 0,a 1,a 2,a 3 } (o1,o2)(o1,o2) (o0,o2)(o0,o2)(o1,o2)(o1,o2) (o0,o2)(o0,o2)(o1,o2)(o1,o2) (o0,o2)(o0,o2) {o 1,o 2 }  {a 0,a 1,a 2 } {o 0,o 1,o 2 }  {a 0,a 1 } a3a3 (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) (o0,o2)(o0,o2) {o 1,o 2,o 3 }  {a 0,a 2 } {o 0,o 2,o 4 }  {a 1,a 2 } {o 1,o 2,o 4 }  {a 1,a 2 } {o 0,o 1,o 2 }  {a 2,a 3 }

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 14 Coherent Cluster High expressive power  The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. Efficient and highly scalable Wide applications  Gene expression analysis  Collaborative filtering subspace cluster coherent cluster

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 15 Remark Comparing to Bicluster  Can well separate noises and outliers  No random data insertion and replacement  Produce optimal solution

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 16 Definition of OP-Cluster Let I be a subset of genes in the database. Let J be a subset of conditions. We say forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions. A 1 A 2 A 3 A 4 Expression Levels when

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 17 Problem Statement Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold n c and n r.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 18 Conversion to Sequence Mining Problem A 1 A 2 A 3 A 4 Expression Levels Sequence:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 19 Ming OP-Clusters: A naïve approach A naïve approach  Enumerate all possible subsequences in a prefix tree.  For each subsequences, collect all genes that contain the subsequences. Challenge:  The total number of distinct subsequences are abcd bcd cdbdbc dcbdbc acd cdad… dcad… … A Complete Prefix Tree with 4 items {a,b,c,d} root a b d

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 20 Mining OP-Clusters: Prefix Tree Goal: Build a compact prefix tree that includes all sub-sequences only occurring in the original database. Strategies: 1.Depth-First Traversal 2.Suffix concatenation: Visit subsequences that only exist in the input sequences. 3.Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences. g1g1 adbc g2g2 abdc g3g3 badc a:1,2 d:1b:2 d:2b:1c:1,3 b:3 Root c:1c:2 a:3 d:3 c:3 a:3 d:3 c:3 a:1,2 d:1,3 a:1,2,3 d:1,3d:1,2,3 c:1,2,3 d:2 c:2 a:1,2,3 d:1,2,3

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 21 References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.

Similar presentations

Presentation on theme: "The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.

Similar presentations

Presentation on theme: "The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008."— Presentation transcript:

Similar presentations

About project

Feedback