Subspace Clustering/Biclustering

Slides:



Advertisements
Similar presentations
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Advertisements

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Putting genetic interactions in context through a global modular decomposition Jamal.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Finding Local Linear Correlations in High Dimensional Data Xiang Zhang Feng Pan Wei Wang University of.
A New Biclustering Algorithm for Analyzing Biological Data Prashant Paymal Advisor: Dr. Hesham Ali.
Exhaustive Signature Algorithm
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
x – independent variable (input)
Mutual Information Mathematical Biology Seminar
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Introduction to Bioinformatics Algorithms Clustering.
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
CSCE822 Data Mining and Warehousing
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
COMMUNITIES IN MULTI-MODE NETWORKS 1. Heterogeneous Network Heterogeneous kinds of objects in social media – YouTube Users, tags, videos, ads – Del.icio.us.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Gene expression & Clustering (Chapter 10)
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Bi-Clustering. 2 Data Mining: Clustering Where K-means clustering minimizes.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Mining Shifting-and-Scaling Co-Regulation Patterns on Gene Expression Profiles Jin Chen Sep 2012.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
DB Seminar Series: Biclustering Methods for Microarray Data Analysis By: Kevin Yip 10 Sep 2003.
A compression-boosting transform for 2D data Qiaofeng Yang Stefano Lonardi University of California, Riverside.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Clustering by Pattern Similarity in Large Data Sets Haixun Wang, Wei Wang, Jiong Yang, Philip S. Yu IBM T. J. Watson Research Center Presented by Edmond.
EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science
Understanding Network Concepts in Modules Dong J, Horvath S (2007) BMC Systems Biology 2007, 1:24.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2011.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Biclustering of Expression Data by Yizong Cheng and Geoge M. Church Presented by Bojun Yan March 25, 2004.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Bi-Clustering COMP Seminar Spring 2008.
Cohesive Subgraph Computation over Large Graphs
Finding Dense and Connected Subgraphs in Dual Networks
Cluster Analysis II 10/03/2012.
Clustering CSC 600: Data Mining Class 21.
Chapter 7. Classification and Prediction
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Lecture 10. Clustering Algorithms
Research in Computational Molecular Biology , Vol (2008)
June 2017 High Density Clusters.
Clustering Evaluation The EM Algorithm
CSE 4705 Artificial Intelligence
Jianping Fan Dept of CS UNC-Charlotte
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
A Fast Algorithm for Subspace Clustering by Pattern Similarity
GPX: Interactive Exploration of Time-series Microarray Data
Dynamic Programming Merge Sort 1/18/ :45 AM Spring 2007
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
Dimension reduction : PCA and Clustering
A Framework for Testing Query Transformation Rules
CS 485G: Special Topics in Data Mining
Feature Selection Methods
Computational Genomics Lecture #3a
Dynamic Programming Merge Sort 5/23/2019 6:18 PM Spring 2008
“Traditional” image segmentation
Clustering.
Presentation transcript:

Subspace Clustering/Biclustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

Data Mining: Clustering K-means clustering minimizes Where

Clustering by Pattern Similarity (p-Clustering) The micro-array “raw” data shows 3 genes and their values in a multi-dimensional space Parallel Coordinates Plots Difficult to find their patterns “non-traditional” clustering

Clusters Are Clear After Projection

Motivation E-Commerce: collaborative filtering Movie 1 Movie 2 Movie 3 Viewer 1 1 2 4 3 5 Viewer 2 6 7 Viewer 3 Viewer 4 Viewer 5

Motivation

Motivation Movie 1 Movie 2 Movie 3 Movie 4 Movie 5 Movie 6 Movie 7 Viewer 1 1 2 4 3 5 Viewer 2 6 7 Viewer 3 Viewer 4 Viewer 5

Motivation

Gene Expression Data

Biclustering of Gene Expression Data Genes not regulated under all conditions Genes regulated by multiple factors/processes concurrently Key to determine function of genes Key to determine classification of conditions

Motivation DNA microarray analysis CH1I CH1B CH1D CH2I CH2B CTFC3 4392 284 4108 280 228 VPS8 401 281 120 275 298 EFB1 318 37 277 215 SSA1 292 109 580 238 FUN14 2857 285 2576 271 226 SP07 290 48 224 MDM10 538 272 266 236 CYS3 322 288 41 278 219 DEP1 312 40 273 232 NTG1 329 296 33 274

Motivation

Motivation Strong coherence exhibits by the selected objects on the selected attributes. They are not necessarily close to each other but rather bear a constant shift. Object/attribute bias bi-cluster

Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases may be local to the set of selected objects/attributes are usually unknown in advance May have many unspecified entries

What’s Biclustering? Given an n x m matrix, A, find a set of submatrices, Bk, such that the contents of each Bk follow a desired pattern Row/Column order need not be consistent between different Bks

Bipartite Graphs Matrix can be thought of as a Graph Rows are one set of vertices L, Columns are another set R Edges are weighted by the corresponding entries in the matrix If all weights are binary, biclustering becomes biclique finding

Bicluster Structures

Previous Work Subspace clustering Collaborative filtering: Pearson R Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R Only considers global offset of each object/attribute.

bi-cluster Consists of a (sub)set of objects and a (sub)set of attributes Corresponds to a submatrix Occupancy threshold  Each object/attribute has to be filled by a certain percentage. Volume: number of specified entries in the submatrix Base: average value of each object/attribute (in the bi-cluster)

bi-cluster CH1I CH1B CH1D CH2I CH2B Obj base CTFC3 VPS8 401 120 298 273 EFB1 318 37 215 190 SSA1 FUN14 SP07 MDM10 CYS3 322 41 219 194 DEP1 NTG1 Attr base 347 66 244

Bicluster Structures

Cheng and Church Example: Correlation between any two columns = correlation between any two rows = 1. aij = aiJ + aIj – aIJ, where aiJ = mean of row i, aIj = mean of column j, aIJ = mean of A. Biological meaning: the genes have the same (amount of) response to the conditions. Back.: 5 Col 0: 1 Col 1: 3 Col 2: 2 Row 0: 2 8 10 9 Row 1: 4 12 11 Row 2: 1 7

bi-cluster Perfect -cluster Imperfect -cluster Residue: dij diJ dIJ

Cheng and Church Model: A bicluster is represented the submatrix A of the whole expression matrix (the involved rows and columns need not be contiguous in the original matrix). Each entry Aij in the bicluster is the superposition (summation) of: The background level The row (gene) effect The column (condition) effect A dataset contains a number of biclusters, which are not necessarily disjoint.

Cheng and Church Finding the largest -bicluster: The problem of finding the largest square -bicluster (|I| = |J|) is NP-hard. Objective function for heuristic methods (to minimize): => sum of the components from each row and column, which suggests simple greedy algorithms to evaluate each row and column independently.

Cheng and Church Greedy methods: Algorithm 0: Brute-force deletion (skipped) Algorithm 1: Single node deletion Parameter(s):  (maximum squared residue). Initialization: the bicluster contains all rows and columns. Iteration: Compute all aIj, aiJ, aIJ and H(I, J) for reuse. Remove a row or column that gives the maximum decrease of H. Termination: when no action will decrease H or H <= . Time complexity: O(MN)

Cheng and Church Greedy methods: Algorithm 2: Multiple node deletion (take one more parameter . In iteration step 2, delete all rows and columns with row/column residue > H(I, J)). Algorithm 3: Node addition (allow both additions and deletions of rows/columns).

Cheng and Church Handling missing values and masking discovered biclusters: replace by random numbers so that no recognizable structures will be introduced. Data preprocessing: Yeast: x  100log(105x) Lymphoma: x  100x (original data is already log-transformed)

Cheng and Church Some results on yeast cell cycle data (288417):

Cheng and Church Some results on lymphoma data (402696): No. of genes, no. of conditions 4, 96 10, 29 11, 25 103, 25 127, 13 13, 21 10, 57 2, 96 25, 12 9, 51 3, 96

Cheng and Church Discussion: Biological validation: comparing with the clusters in previously published results. No evaluation of the statistical significance of the clusters. Both the model and the algorithm are not tailored for discovering multiple non-disjoint clusters. Normalization is of utmost importance for the model, but this issue is not well-discussed.

The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Y Improved? N

The FLOC algorithm Action: the change of membership of a row(or column) with respect to a cluster column M=4 1 2 3 4 row 3 4 2 2 1 M+N actions are Performed at each iteration 2 1 3 2 3 N=3 3 4 2 4

Performance Microarray data: 2884 genes, 17 conditions 100 bi-clusters with smallest residue were returned. Average residue = 10.34 The average residue of clusters found via the state of the art method in computational biology field is 12.54 The average volume is 25% bigger The response time is an order of magnitude faster

Coherent Cluster Want to accommodate noises but not outliers

Coherent Cluster Coherent cluster pair-wise disparity Subspace clustering pair-wise disparity For a 22 (sub)matrix consisting of objects {x, y} and attributes {a, b} x y a b z x y a b dxa dya dxb dyb mutual bias of attribute a mutual bias of attribute b attribute

Coherent Cluster A 22 (sub)matrix is a -coherent cluster if its D value is less than or equal to . An mn matrix X is a -coherent cluster if every 22 submatrix of X is -coherent cluster. A -coherent cluster is a maximum -coherent cluster if it is not a submatrix of any other -coherent cluster. Objective: given a data matrix and a threshold , find all maximum -coherent clusters.

Coherent Cluster Challenges: Finding subspace clustering based on distance itself is already a difficult task due to the curse of dimensionality. The (sub)set of objects and the (sub)set of attributes that form a cluster are unknown in advance and may not be adjacent to each other in the data matrix. The actual values of the objects in a coherent cluster may be far apart from each other. Each object or attribute in a coherent cluster may bear some relative bias (that are unknown in advance) and such bias may be local to the coherent cluster.

References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. Y. Sungroh,  C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03.