Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu

Similar presentations


Presentation on theme: "Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu"— Presentation transcript:

1 Multi-view Sparse Co-clustering via Proximal Alternating Linearized Minimization
Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu Department of Computer Science and Engineering University of Connecticut

2 Outline Existing Co-clustering methods
Motivation of our problem – multi-view sparse co-clustering Our formulation - low-rank matrix approximation Optimization of our formulated problem – proximal alternating linearized minimization Experimental results Conclusion We review what existing co-clustering methods do, then we talk about why we want to solve multi-view sparse co-clustering. Our approach is formulated based low-rank matrix approximation. We developed a proximal alternating linearized minimization algorithm to optimize the related optimization problem. We discuss our experimental results and then conclude.

3 Existing co-clustering methods
Bi-clustering – jointly cluster rows and columns of a data matrix Clustering in subspaces – search subspaces, and find different cluster solutions in the different subspace Existing co-clustering methods are largely divided into two lines. One line is called bi-clustering where we jointly cluster rows and columns of a data matrix, so each cluster of subjects differ from others on a subset of features rather than all features. Bi-clustering is similar to subspace clustering where we cluster subjects into subspaces or more generally we search for subspaces in the data dimension to find different subject groupings in the different subspaces. A data matrix

4 Existing co-clustering methods
Bi-clustering – jointly cluster rows and columns of a data matrix Clustering in subspaces – search subspaces, and find different cluster solutions in the different subspace Existing co-clustering methods are largely developed into two lines. One line is called bi-clustering where we jointly cluster rows and columns of a data matrix, so each cluster of subjects differ from others on a subset of features rather than all features. Bi-clustering is similar to subspace clustering where we cluster subjects into subspaces or more generally we search for subspaces in the data dimension to find different subject groupings in the different subspaces. A data matrix

5 Existing co-clustering methods
Multi-view co-clustering – e.g., co-regularized spectral clustering (using all features to compute a similarity matrix for each view) Another line is relevant to multiple data matrices, called multi-view co-clustering where the same subjects are viewed in different data sources. We could do cluster analysis in each view, but we want to resultant subject clusters to be consistent across the views. Data matrix 1 Data matrix 2

6 Multi-view sparse co-clustering
Find subspaces in each view so to identify clusters in the subspaces that are consistent across the views Now the problem we want to solve is called multi-view sparse co-clustering where we find subspaces in each view so to identify clusters that are agreed across the views. The fundamental assumption here is that the subject clusters may exist in subspaces rather in all dimensions. Data matrix 1 Data matrix 2

7 Motivation – example applications
Derive subtypes of a complex disease in both clinical symptoms and genetic variants A disease subtype may be characterized by a subset of symptoms (not all symptoms) Usually only very few genetic variants from DNA are associated with a disease subtype Detect conservative gene co-regulation among multiple species Genes are co-regulated (up or down regulated) only at certain stages (not all stages) for every species The same subset of genes may be co-regulated at different stages for different species This problem is encountered in many scientific domains. Here, we give two bioinformatics problems as examples. When we derive subtypes of a complex disorder, the subtypes need to defined in both clinical symptoms and genetic variations. A disease subtype may be characterized by a subset of symptom rather than all symptoms. We may identify specific genes may be associated with a disease subtype not the entire DNA. Hence, the subtype clusters exists in subspaces of the two views. When we detect gene co-regulation conservative across species, each species gives us a data matrix containing gene expression levels observed at different developmental stages. The genes can up or down regulated at certain stages that may differ among species but not all stages.

8 Our formulation – single view
Based on low-rank matrix approximation – perform a sequence of rank-one approximations to the data matrices u v T Our solution to the multi-view sparse co-clustering is based on low-rank matrix approximation. We propose to perform a sequence of rank-one approximation to the data matrices. For one data matrix, our task is similar to biclustering. If we approximate this data matrix by the product of vectors u and v, and we require both u and v to be sparse. For instance, if they have non-zero values here, the first cluster is identified. This means we impose 0 norm-regularization on the u and v when solve the approximation. Note that in this process, we don’t care the actual values of the nonzero entries, but how many non-zeros entries and where they are located, so a 0-norm penalty is appropriate. Require sparse vectors to be used in the decomposition

9 Our formulation – multiple view
Given m data matrices, we want the u’s to have non-zero entries at the same positions We hence use a binary vector to connect the different views Now given multiple say m data matrices, to find the same subject clusters, we need the left vectors u to have their non-zeros entries at the same positions. We hence use a binary indicator vector w to connect the different views when we solve the approximation problems. In each view of data approximation, we multiply u componentwisely by this shared w. Now we no longer need u to be sparse because once w is sparse, it will enforces all left vectors of the views to have the same sparsity pattern. This optimization problem is equivalent to solving a non-convex and non-smooth minim where is the set of binary vectors of length n Equivalently, it solves a non-convex and non-smooth minimization problem

10 Optimization algorithm
The framework of proximal alternating linearized minimization (PALM) (Bolte et al, 2014) Can solve an optimization problem with multiple blocks of variables Only requires the objective function is smooth at the term that uses all variables. Only requires that the smooth part of the objective has component-wise Lipchitz continuous gradient for convergence Has been proved to globally converge to a critical point of the problem if the problem satisfies certain conditions To effectively optimize the problem, we developed an algorithm based on the framework of proximal alternating linearized minimization, in short PALM. So PALM is a framework to solve optimization problems with multiple blocks of variables. It only requires the objective function is smooth at the term that uses all variables. It only requires that the objective function is componentwise Lipchitz continous for convergence. Most proximal operator methods require the objective function to be globally Lipchitz continuous. It has been proved to globally converge to a critical point of the problem if the problem satisfies certain conditions.

11 Optimization algorithm
We derive a PALM algorithm that alternates between optimizing u’s, v’s and ω To solve for each block of variables, we use a proximal gradient method, for instance, we solve uk by where h is the smooth part, is a constant, and We hence alternate between optimizing u’s, v’s and w. For each group of variables, we use a proximal gradient method. That gives us a closed-form updating formula, for instance, this is the formula to update u for each view where gamma is a pre-chosen constant, and L is the Lipchitz modulus of the gradient. Hence,

12 Optimization algorithm
When solve for v’s and ω, we can similarly derive the proximal operator problems but now with L0-regularizer When we solve for v’s and w, we similarly derive the proximal operator problems but now with the 0-norm regularization.

13 Optimization algorithm
Both of the proximal operator problems have closed-form solutions Let Let These two sub-problems also have closed-form solutions as follows. For instance, for the shared vector w, we first compute the solution of the unconstrained problem, then we threshold this vector and we only maintain those entries whose magnitude is greater than a threshold. This threshold is determined by the hyper-parameter s_w. α and β are thresholds determined by and

14 Optimization algorithm
Convergence analysis shows the following result Theorem: Let z be the vector consisting of all variables of the proposed problem, and {zt} be a sequence generated by our PALM algorithm. Then the sequence {zt} has finite length and converges to a critical point of the problem. The algorithm takes computation time of O(nmd). Our convergence analysis shows this PALM based algorithm can globally converge.

15 Computational results
The proposed algorithm was tested on Simulations Benchmark datasets Comparison Single view sparse low-rank approximation Kernel addition Kernel product Co-regularized spectral (Kumar et al 2011) Co-trained spectral (Kumar & Daume III, 2011) Multi-view CCA (Chaudhuri et al, 2009) Multi-view feature learning (Wang et al, 2013) We tested our algorithms in simulations and on benchmark data. We compared it with 7 other methods, the first three methods are baseline methods, the other four are state of the art co-clustering methods.

16 Simulations We synthesized two views of data
Genetic view: 1092 subjects, 100 genetic markers from 1000 Genome Project Clinical view: synthesized 9 clinical variables for the 1092 subjects Created two clusters in each view, the two clusters each are associated with 10 genetic markers and 3 clinical variables randomly picked Subjects not in the two clusters form the third cluster We added noise measured by a parameter e, the larger the e, the more agreed the cluster solutions are in the two views In simulation, read the slides.

17 Results in simulations
This table shows the normalized mutual information that computes the mutual information between the synethesized cluster assignments and the cluster assignments resulted from each method and normalized by the cluster entropies. It ranges from 0 to 1, and The higher the better. The proposed method clearly outperformed the other methods. NMI: normalized mutual information computes the mutual information between two cluster assignments normalized by the cluster entropies.

18 Results in simulations
Among all of the comparison methods, only our method can identify features for the clusters. This table summarizes the feature selection performance. Our algorithm is pretty accurate to recover the true features. TF: true features, TPF: true positive features, FPF: false positive features

19 Benchmark datasets UCI Handwritten digits dataset:
2000 examples 6 views The views have different features, e.g., 240 pixel averages in 2 by 3 sub-images in one view, 76 Fourier coefficients in another view Crowd-sourcing dataset: 584 images 2 views One view has 15,369 image features, the other has 108 labels provided by 27 online labelers We used two benchmark datasets. UCI hanwritten digits data which has 6 views and 2000 examples, and crowd-sourcing dataset which has 2 views and 584 examples.

20 Results on benchmark data
Again here is the normalized mutual information values. Our method achieved the highest values among the methods. NMI: normalized mutual information computes the mutual information between two cluster assignments normalized by the cluster entropies.

21 Conclusion We believe this is the first method that searches subspaces for multi-view consistent clusters. The proposed PALM based algorithm is efficient. At each alternative step, it computes an analytical formula. It has a linear complexity of computation time The algorithm can globally converge to a critical point of the problem Our approach directly solves a formulation with the L0 regularization (rather than its approximation) The take-home message is the following.

22 Thank you! References Supported by NSF and NIH grants
Bolte et al, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming, 146(1): , 2014 Chaudhuri et al, Multi-view clustering via canonical correlation analysis, International Conference on Machine Learning, pp , 2009. Kumar et al, A co-training approach for multi-view spectral clustering, International Conference on Machine Learning, pp , 2011 Kumar et al, Co-regularized multi-view spectral clustering, Advances in Neural Information Processing Systems, pp , 2011. Wang et al, Multi-view clustering and feature learning via structured sparsity, International Conference on Machine Learning, JMLR 28: , 2013 Supported by NSF and NIH grants Thank you! This work is supported by the following federal grants.


Download ppt "Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu"

Similar presentations


Ads by Google