The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP 790-90 Seminar Spring 2011.

Slides:



Advertisements
Similar presentations
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
PARTITIONAL CLUSTERING
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Fast Algorithms For Hierarchical Range Histogram Constructions
Clustering Prof. Navneet Goyal BITS, Pilani
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
A Probabilistic Framework for Semi-Supervised Clustering
Clustered alignments of gene- expression time series data Adam A. Smith, Aaron Vollrath, Cristopher A. Bradfield and Mark Craven Department of Biosatatistics.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Mutual Information Mathematical Biology Seminar
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Introduction to Bioinformatics Algorithms Clustering.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Reduced Support Vector Machine
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Clustering (Part II) 11/26/07. Spectral Clustering.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
SVD and PCA COS 323, Spring 05. SVD and PCA Principal Components Analysis (PCA): approximating a high-dimensional data set with a lower-dimensional subspacePrincipal.
Protein Structure Prediction Samantha Chui Oct. 26, 2004.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Sequence Clustering COMP Research Seminar Spring 2011.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Gene expression & Clustering (Chapter 10)
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Chapter 3 Computational Molecular Biology Michael Smith
Ground Truth Free Evaluation of Segment Based Maps Rolf Lakaemper Temple University, Philadelphia,PA,USA.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
1 Network Models Transportation Problem (TP) Distributing any commodity from any group of supply centers, called sources, to any group of receiving.
Finding Local Correlations in High Dimensional Data USTC Seminar Xiang Zhang Case Western Reserve University.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Fast SLAM Simultaneous Localization And Mapping using Particle Filter A geometric approach (as opposed to discretization approach)‏ Subhrajit Bhattacharya.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Graph Indexing From managing and mining graph data.
Corresponding Clustering: An Approach to Cluster Multiple Related Spatial Datasets Vadeerat Rinsurongkawong and Christoph F. Eick Department of Computer.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
University at BuffaloThe State University of New York Pattern-based Clustering How to cluster the five objects? qHard to define a global similarity measure.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Multiple sequence alignment (msa)
Principal Component Analysis (PCA)
Subspace Clustering/Biclustering
CARPENTER Find Closed Patterns in Long Biological Datasets
Clustering.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
CS 485G: Special Topics in Data Mining
Group 9 – Data Mining: Data
Donghui Zhang, Tian Xia Northeastern University
Clustering.
Presentation transcript:

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 2 Data Mining: Clustering Where K-means clustering minimizes

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 3 Clustering by Pattern Similarity (p-Clustering) The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space  Parallel Coordinates Plots  Difficult to find their patterns “non-traditional” clustering

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 4 Clusters Are Clear After Projection

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 5 Motivation E-Commerce: collaborative filtering Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7 Viewer Viewer Viewer Viewer Viewer 55534

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 6 Motivation

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 7 Motivation Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7 Viewer Viewer Viewer Viewer Viewer 55534

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 8 Motivation

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 9 Motivation DNA microarray analysis CH1ICH1BCH1DCH2ICH2B CTFC VPS EFB SSA FUN SP MDM CYS DEP NTG

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 10 Motivation

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 11 Motivation Strong coherence exhibits by the selected objects on the selected attributes.  They are not necessarily close to each other but rather bear a constant shift.  Object/attribute bias bi-cluster

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 12 Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases  may be local to the set of selected objects/attributes  are usually unknown in advance May have many unspecified entries

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 13 Previous Work Subspace clustering  Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R  Only considers global offset of each object/attribute.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 14 bi-cluster Consists of a (sub)set of objects and a (sub)set of attributes  Corresponds to a submatrix  Occupancy threshold   Each object/attribute has to be filled by a certain percentage.  Volume: number of specified entries in the submatrix  Base: average value of each object/attribute (in the bi-cluster)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 15 bi-cluster CH1ICH1BCH1DCH2ICH2BObj base CTFC3 VPS EFB SSA1 FUN14 SP07 MDM10 CYS DEP1 NTG1 Attr base

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 16 bi-cluster Perfect  -cluster Imperfect  -cluster  Residue: d IJ d Ij d iJ d ij

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 17 bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify  -clusters with residue smaller than a given threshold

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 18 Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 19 The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 20 The FLOC algorithm Action: the change of membership of a row(or column) with respect to a cluster column row M+N actions are Performed at each iteration N=3 M=4

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 21 The FLOC algorithm Gain of an action: the residue reduction incurred by performing the action Order of action:  Fixed order  Random order  Weighted random order Complexity: O((M+N)MNkp) 

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 22 The FLOC algorithm Additional features  Maximum allowed overlap among clusters  Minimum coverage of clusters  Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 23 Performance Microarray data: 2884 genes, 17 conditions  100 bi-clusters with smallest residue were returned.  Average residue =  The average residue of clusters found via the state of the art method in computational biology field is  The average volume is 25% bigger  The response time is an order of magnitude faster

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 24 Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set.  base  residue Many additional features can be accommodated (nearly for free).

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 25 Coherent Cluster Want to accommodate noise but not outliers

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 26 Coherent Cluster Coherent cluster  Subspace clustering pair-wise disparity  For a 2  2 (sub)matrix consisting of objects {x, y} and attributes {a, b} x y ab d xa d ya d xb d yb x y ab z attribute mutual bias of attribute a mutual bias of attribute b

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 27 Coherent Cluster  A 2  2 (sub)matrix is a  -coherent cluster if its D value is less than or equal to .  An m  n matrix X is a  -coherent cluster if every 2  2 submatrix of X is  -coherent cluster.  A  -coherent cluster is a maximum  -coherent cluster if it is not a submatrix of any other  -coherent cluster.  Objective: given a data matrix and a threshold , find all maximum  -coherent clusters.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 28 Coherent Cluster Challenges:  Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality.  The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix.  The actual values of the objects in a coherent cluster may be far apart from each other.  Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 29 Coherent Cluster Compute the maximum coherent attribute sets for each pair of objects Construct the lexicographical tree Post-order traverse the tree to find maximum coherent clusters Two-way Pruning

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 30 Coherent Cluster Observation: Given a pair of objects {o 1, o 2 } and a (sub)set of attributes {a 1, a 2, …, a k }, the 2  k submatrix is a  -coherent cluster iff, for every attribute a i, the mutual bias (d o1ai – d o2ai ) does not differ from each other by more than . a1a1 a2a2 a3a3 a4a4 a5a o1o1 o2o2  [2, 3.5] If  = 1.5, then {a 1,a 2,a 3,a 4,a 5 } is a coherent attribute set (CAS) of (o 1,o 2 ).

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 31 Coherent Cluster Observation: given a subset of objects {o 1, o 2, …, o l } and a subset of attributes {a 1, a 2, …, a k }, the l  k submatrix is a  -coherent cluster iff {a 1, a 2, …, a k } is a coherent attribute set for every pair of objects (o i,o j ) where 1  i, j  l. a1a1 a5a5 a6a6 a7a7 a2a2 a3a3 a4a4 o1o1 o3o3 o4o4 o5o5 o6o6 o2o2

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 32 a1a1 a2a2 a3a3 a4a4 a5a r1r1 r2r2 Coherent Cluster Strategy: find the maximum coherent attribute sets for each pair of objects with respect to the given threshold .  = r1r1 r2r2 a2a2 2 a3a3 3.5 a4a4 2 a5a5 2.5 a1a1 3 1 The maximum coherent attribute sets define the search space for maximum coherent clusters.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 33 Two Way Pruning a0a1a2 o0142 o1255 o2365 o o (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) (o0,o2) →(a0,a1,a2) (o1,o2) →(a0,a1,a2) (a0,a1) →(o0,o1,o2) (a0,a2) →(o1,o2,o3) (a1,a2) →(o1,o2,o4) (a1,a2) →(o0,o2,o4) delta=1 nc =3 nr = 3 MCAS MCOS

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 34 Coherent Cluster Strategy: grouping object pairs by their CAS and, for each group, find the maximum clique(s). Implementation: using a lexicographical tree to organize the object pairs and to generate all maximum coherent clusters with a single post-order traversal of the tree. objects attributes

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 35 (o 0,o 1 ) : {a 0,a 1 }, {a 2,a 3 } (o 0,o 2 ) : {a 0,a 1,a 2,a 3 } (o 0,o 4 ) : {a 1,a 2 } (o 1,o 2 ) : {a 0,a 1,a 2 }, {a 2,a 3 } (o 1,o 3 ) : {a 0,a 2 } (o 1,o 4 ) : {a 1,a 2 } (o 2,o 3 ) : {a 0,a 2 } (o 2,o 4 ) : {a 1,a 2 } a0a0 a1a1 a2a2 a3a3 o0o o1o o2o o3o o4o a0a0 a1a1 a2a2 a2a2 a3a3 a1a1 a2a2 a2a2 a3a3 (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o1,o3)(o1,o3) (o2,o3)(o2,o3) (o0,o4)(o0,o4) (o1,o4)(o1,o4) (o2,o4)(o2,o4) (o0,o1)(o0,o1) (o1,o2)(o1,o2) assume  = 1 {a 0,a 1 } : (o 0,o 1 ) {a 0,a 2 } : (o 1,o 3 ),(o 2,o 3 ) {a 1,a 2 } : (o 0,o 4 ),(o 1,o 4 ),(o 2,o 4 ) {a 2,a 3 } : (o 0,o 1 ),(o 1,o 2 ) {a 0,a 1,a 2 } : (o 1,o 2 ) {a 0,a 1,a 2,a 3 } : (o 0,o 2 ) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2)

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 36 a0a0 a1a1 a2a2 a2a2 a3a3 a1a1 a2a2 a2a2 a3a3 (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o1,o3)(o1,o3) (o2,o3)(o2,o3) (o0,o4)(o0,o4) (o1,o4)(o1,o4) (o2,o4)(o2,o4) (o0,o1)(o0,o1) (o1,o2)(o1,o2) (o0,o2)(o0,o2) (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) a3a3 {o 0,o 2 }  {a 0,a 1,a 2,a 3 } (o1,o2)(o1,o2) (o0,o2)(o0,o2)(o1,o2)(o1,o2) (o0,o2)(o0,o2)(o1,o2)(o1,o2) (o0,o2)(o0,o2) {o 1,o 2 }  {a 0,a 1,a 2 } {o 0,o 1,o 2 }  {a 0,a 1 } a3a3 (o0,o2)(o0,o2) a3a3 (o0,o2)(o0,o2) (o0,o2)(o0,o2) {o 1,o 2,o 3 }  {a 0,a 2 } {o 0,o 2,o 4 }  {a 1,a 2 } {o 1,o 2,o 4 }  {a 1,a 2 } {o 0,o 1,o 2 }  {a 2,a 3 }

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 37 Coherent Cluster High expressive power  The coherent cluster can capture many interesting and meaningful patterns overlooked by previous clustering methods. Efficient and highly scalable Wide applications  Gene expression analysis  Collaborative filtering subspace cluster coherent cluster

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL 38 Remark Comparing to Bicluster  Can well separate noises and outliers  No random data insertion and replacement  Produce optimal solution