The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Longest Common Subsequence
PARTITIONAL CLUSTERING
Avrilia Floratou, Sandeep Tata, and Jignesh M. Patel ICDE 2010 Efficient and Accurate Discovery of Patterns in Sequence Datasets.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Aki Hecht Seminar in Databases (236826) January 2009
Association Analysis: Basic Concepts and Algorithms.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequence Clustering COMP Research Seminar Spring 2011.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
1 Chapter Elementary Notions and Notations.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequential Pattern Mining COMP Seminar BCB 713 Module Spring 2011.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology CONTOUR: an efficient algorithm for discovering discriminating.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Fast and Memory Efficient Mining of Frequent Closed Itemsets Claudio Lucchese Salvatore Orlando Raffaele Perego DB group seminar Presenter: Leonidas.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Graph Indexing From managing and mining graph data.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining COMP Seminar BCB 713 Module Spring 2011.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
GROUP 6 KIIZA FELIX 2013/BIT/110 MUHANGUZI EUSTUS 2013/BIT/104/PS TUGIROKWIKIRIZA FLAVIA 2013/BIT/111/PS HAMSTONE NATOSHA 2013/BIT/122/PS GILBERT MUMBERE.
Slides for KDD07 Mining statistically important equivalence classes and delta-discriminative emerging patterns Jinyan Li School of Computer Engineering.
Mining Utility Functions based on user ratings
What Is Cluster Analysis?
Trees Chapter 15.
Computing and Compressive Sensing in Wireless Sensor Networks
B-Trees 7/5/2018 4:26 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
RE-Tree: An Efficient Index Structure for Regular Expressions
Frequent Pattern Mining
COMP Research Seminar Spring 2011
Subspace Clustering/Biclustering
William Norris Professor and Head, Department of Computer Science
CARPENTER Find Closed Patterns in Long Biological Datasets
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
William Norris Professor and Head, Department of Computer Science
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis: Basic Concepts and Algorithms
A Fast Algorithm for Subspace Clustering by Pattern Similarity
Frequent-Pattern Tree
CSE572, CBS572: Data Mining by H. Liu
Market Basket Analysis and Association Rules
FP-Growth Wenlong Zhang.
Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.
CS 485G: Special Topics in Data Mining
Group 9 – Data Mining: Data
CSE572: Data Mining by H. Liu
Association Analysis: Basic Concepts
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 2 Coherent Cluster Want to accommodate noises but not outliers

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 3 Coherent Cluster Coherent cluster  Subspace clustering pair-wise disparity  For a 2  2 (sub)matrix consisting of objects {x, y} and attributes {a, b} x y ab d xa d ya d xb d yb x y ab z attribute mutual bias of attribute a mutual bias of attribute b

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 4 Coherent Cluster  A 2  2 (sub)matrix is a  -coherent cluster if its D value is less than or equal to .  An m  n matrix X is a  -coherent cluster if every 2  2 submatrix of X is  -coherent cluster.  A  -coherent cluster is a maximum  -coherent cluster if it is not a submatrix of any other  -coherent cluster.  Objective: given a data matrix and a threshold , find all maximum  -coherent clusters.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 5 Coherent Cluster Challenges:  Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality.  The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix.  The actual values of the objects in a coherent cluster may be far apart from each other.  Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 6 Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters Two-way Pruning

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 7 Coherent Cluster Observation: Given a pair of objects {o 1, o 2 } and a (sub)set of attributes {a 1, a 2, …, a k }, the 2  k submatrix is a  -coherent cluster iff, for every attribute a i, the mutual bias (d o1ai – d o2ai ) does not differ from each other by more than . a1a1 a2a2 a3a3 a4a4 a5a o1o1 o2o2  [2, 3.5] If  = 1.5, then {a 1,a 2,a 3,a 4,a 5 } is a coherent attribute set (CAS) of (o 1,o 2 ).

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 8 Coherent Cluster Observation: given a subset of objects {o 1, o 2, …, o l } and a subset of attributes {a 1, a 2, …, a k }, the l  k submatrix is a  -coherent cluster iff {a 1, a 2, …, a k } is a coherent attribute set for every pair of objects (o i,o j ) where 1  i, j  l. a1a1 a5a5 a6a6 a7a7 a2a2 a3a3 a4a4 o1o1 o3o3 o4o4 o5o5 o6o6 o2o2

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 9 a1a1 a2a2 a3a3 a4a4 a5a r1r1 r2r2 Coherent Cluster Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .  = r1r1 r2r2 a2a2 2 a3a3 3.5 a4a4 2 a5a5 2.5 a1a1 3 1 The maximum coherent attribute sets define the search space for maximum coherent clusters.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 10 Two Way Pruning a0a1a2 o0142 o1255 o2365 o o (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) delta=1 nc =3 nr = 3 MCAS MCOS

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 11 Coherent Cluster Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. objects attributes

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 12 (o 0,o 1 ) : {a 0,a 1 }, {a 2,a 3 } (o 0,o 2 ) : {a 0,a 1,a 2,a 3 } (o 0,o 4 ) : {a 1,a 2 } (o 1,o 2 ) : {a 0,a 1,a 2 }, {a 2,a 3 } (o 1,o 3 ) : {a 0,a 2 } (o 1,o 4 ) : {a 1,a 2 } (o 2,o 3 ) : {a 0,a 2 } (o 2,o 4 ) : {a 1,a 2 } a0a0 a1a1 a2a2 a3a3 o0o o1o o2o o3o o4o a0a0 a1a1 a2a2 a2a2 a3a3 a1a1 a2a2 a2a2 a3a3 (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o1,o3)(o1,o3) (o2,o3)(o2,o3) (o0,o4)(o0,o4) (o1,o4)(o1,o4) (o2,o4)(o2,o4) (o0,o1)(o0,o1) (o1,o2)(o1,o2) assume  = 1 {a 0,a 1 } : (o 0,o 1 ) {a 0,a 2 } : (o 1,o 3 ),(o 2,o 3 ) {a 1,a 2 } : (o 0,o 4 ),(o 1,o 4 ),(o 2,o 4 ) {a 2,a 3 } : (o 0,o 1 ),(o 1,o 2 ) {a 0,a 1,a 2 } : (o 1,o 2 ) {a 0,a 1,a 2,a 3 } : (o 0,o 2 ) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 13 a0a0 a1a1 a2a2 a2a2 a3a3 a1a1 a2a2 a2a2 a3a3 (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o1,o3)(o1,o3) (o2,o3)(o2,o3) (o0,o4)(o0,o4) (o1,o4)(o1,o4) (o2,o4)(o2,o4) (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) a3a3 {o 0,o 2 }  {a 0,a 1,a 2,a 3 } (o1,o2)(o1,o2) (o0,o2)(o0,o2)(o1,o2)(o1,o2) (o0,o2)(o0,o2)(o1,o2)(o1,o2) (o0,o2)(o0,o2) {o 1,o 2 }  {a 0,a 1,a 2 } {o 0,o 1,o 2 }  {a 0,a 1 } a3a3 (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) (o0,o2)(o0,o2) {o 1,o 2,o 3 }  {a 0,a 2 } {o 0,o 2,o 4 }  {a 1,a 2 } {o 1,o 2,o 4 }  {a 1,a 2 } {o 0,o 1,o 2 }  {a 2,a 3 }

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 14 Coherent Cluster High expressive power  The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. Efficient and highly scalable Wide applications  Gene expression analysis  Collaborative filtering subspace cluster coherent cluster

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 15 Remark Comparing to Bicluster  Can well separate noises and outliers  No random data insertion and replacement  Produce optimal solution

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 16 Definition of OP-Cluster Let I be a subset of genes in the database. Let J be a subset of conditions. We say forms an Order Preserving Cluster (OP-Cluster), if one of the following relationships exists for any pair of conditions. A 1 A 2 A 3 A 4 Expression Levels when

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 17 Problem Statement Given a gene expression matrix, our goal is to find all the statistically significant OP-Clusters. The significance is ensured by the minimal size threshold n c and n r.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 18 Conversion to Sequence Mining Problem A 1 A 2 A 3 A 4 Expression Levels Sequence:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 19 Ming OP-Clusters: A naïve approach A naïve approach  Enumerate all possible subsequences in a prefix tree.  For each subsequences, collect all genes that contain the subsequences. Challenge:  The total number of distinct subsequences are abcd bcd cdbdbc dcbdbc acd cdad… dcad… … A Complete Prefix Tree with 4 items {a,b,c,d} root a b d

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 20 Mining OP-Clusters: Prefix Tree Goal: Build a compact prefix tree that includes all sub-sequences only occurring in the original database. Strategies: 1.Depth-First Traversal 2.Suffix concatenation: Visit subsequences that only exist in the input sequences. 3.Apriori Property: Visit subsequences that are sufficiently supported in order to derive longer subsequences. g1g1 adbc g2g2 abdc g3g3 badc a:1,2 d:1b:2 d:2b:1c:1,3 b:3 Root c:1c:2 a:3 d:3 c:3 a:3 d:3 c:3 a:1,2 d:1,3 a:1,2,3 d:1,3d:1,2,3 c:1,2,3 d:2 c:2 a:1,2,3 d:1,2,3

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 21 References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp , H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.