Presentation on theme: "Consistent Bipartite Graph Co-Partitioning for High-Order Heterogeneous Co-Clustering Tie-Yan Liu WSM Group, Microsoft Research Asia 2005.11.11 Joint work."— Presentation transcript:
Consistent Bipartite Graph Co-Partitioning for High-Order Heterogeneous Co-Clustering Tie-Yan Liu WSM Group, Microsoft Research Asia Joint work with Bin Gao, Peking University
Talk at NTU, Tie-Yan Liu Outline Motivation ̵ What is high-order heterogeneous co-clustering ̵ Why previous methods can not work well on this problem Consistent Bipartite Graph Go-partitioning (CGBC) Experimental Evaluation Conclusions and Future Work
Talk at NTU, Tie-Yan Liu Clustering Clustering is to group the data objects into clusters, so that objects in the same cluster are similar to each other. Spectral Clustering ̵ Models the similarity of data objects by an affinity graph, and assume that the best clustering result corresponds to the minimal (ratio, normalized or min-max) graph cut. ̵ It can be proven that the minimum of the normalized cut can be achieved by minimizing this objective function and the corresponding solution q is the eigenvector associated with the second smallest eigenvalue of the generalized eigenvalue problem.
Talk at NTU, Tie-Yan Liu Co-Clustering Co-clustering is to group two types of objects into their own clusters simultaneously. Bipartite graph partitioning (Dhillon and Zha) ̵ Use bipartite graph to model the inter-relationship between the two types of objects: the edges are of the same type in the bipartite graph so the graph cut is still easy to define. ̵ It can be proven that the solutions are the singular vectors associated with the second smallest singular value of the normalized inter-relationship matrix
Talk at NTU, Tie-Yan Liu High-order Heterogeneous Co- Clustering (HHCC) HHCC is to group multiple (2) types of objects into clusters simultaneously. ̵ Order is defined as the number of types of objects. If we use graph to represent the inter-relationship between data objects, we will have that although the edges in each bipartite graph are of the same type, they are of different type for different bipartite graphs. This is what heterogeneous refers to, as compared to spectral clustering and bipartite graph co-clustering.
Talk at NTU, Tie-Yan Liu HHCC is not a Rare Problem Typical examples Surrounding Text – Web Image – Visual Features User – Query– Click through Many other examples Category – Document – Term; Reader – Newspaper – Article; Passenger – Airplane – Airways; Webpage – Website – Site-group; Article – Magazine – Category; Hardware – Computer – Usage; Software – People – Community
Talk at NTU, Tie-Yan Liu Why HHCC is a new problem? Although bipartite graph partitioning is just a trivial extension of the spectral clustering, the extension to HHCC is non-trivial ̵ Since there are different types of edges in the HHCC problem, the cut of high-order data is difficult to define. It may not be very reasonable to assign some weights to heterogeneous edges so as to make their contributions to the graph cut comparable. ̵ Simply applying spectral clustering may cause the high-order problem degraded to be a 2-order problem.
Talk at NTU, Tie-Yan Liu An Example of Weighting Heterogeneous Edges Embeddings produced by spectral clustering α = 0.01 α = 100 α = 1 no matter how we adjust the weights to balance the different types of edges, we always can not cluster X into two groups successfully
Talk at NTU, Tie-Yan Liu An Example of Weighting Heterogeneous Edges (Cont.) Mathematical Proof. Including X and Z
Talk at NTU, Tie-Yan Liu Order Degradation 3-Order Heterogeneous graph 2-Order Heterogeneous graph
Talk at NTU, Tie-Yan Liu Our Solution We will try to tackle the aforementioned problems by proposing a new solution to HHCC: Consistent Bipartite Graph Co-Partitioning (CGBC). Where should we get started? ̵ Star-structured HHCC ̵ The concept of consistency ̵ An SDP-based solution
Talk at NTU, Tie-Yan Liu Why Star-Structured? Star-Structure means that in the heterogeneous graph, there is a central type of objects which connects all the other types of objects, and there is no direct connections between any other object types Star-Structured is the simplest but very common case of HHCC.
Talk at NTU, Tie-Yan Liu Why Star-Structured? Star-Structured is the simplest but very common case of HHCC. Surrounding text Web Images Visual features Author Conference Paper Key Word Customer Shareholder Shop Supplier Advertisement Media
Talk at NTU, Tie-Yan Liu The Concept of Consistency Divide the star-structured HHCC problem into a set of bipartite sub-problems, where each sub-problem only has homogeneous edges. Solve each sub problem separately, to avoid the order degradation. Add a global constraint to the central type of objects, so as to get a feasible cut for the original problem.
Talk at NTU, Tie-Yan Liu The Concept of Consistency divide this tripartite graph into two bipartite graphs partition these two graphs simultaneously and consistently
Talk at NTU, Tie-Yan Liu Formulating the Optimization Problem Minimize the cuts of the two bipartite graphs, with the constraints that their partitioning results on the central type of objects are the same. Objective Function: The definition of q and p indicates the consistency between these two graphs: the y in the two embeddings are the same, so we actually force the partitioning on the central type of objects to be the same.
Talk at NTU, Tie-Yan Liu How to Solve the Optimization Problem #1: Convert it to a QCQP Problem Simplify the original Problem to single-objective programming Assistant Notations Sum-of-ratios Quadratic Fractional Programming Quadratically Constrained Quadratic Programming (QCQP) Considering that the normalized Rayleigh quotient has been a scalar measure of the graph structure, the combination of two Rayleigh quotients is more reasonable and indicates which graph we should trust more. Linear combination is only one of the approaches of multi-objective programming. We can surely use other methods which do not have this argument.
Talk at NTU, Tie-Yan Liu How to Solve the Optimization Problem # 2: Convert QCQP to SDP Semi-definite Programming (SDP)SDP
Talk at NTU, Tie-Yan Liu The Final Algorithm (CGBC) 1.Set the parameters β, θ 1 and θ 2. 2.Given the inter-relation matrices A and B, form the corresponding diagonal matrices and Laplacian matrices D (1), D (2), L (1) and L (2). 3.Extend D (1), D (2), L (1) and L (2) to Π 1, Π 2, Г 1 and Г 2, and form Г, such that the coefficient matrices in the SDP problem can be computed. 4.Solve the above SDP problem by a certain iterative algorithm such as SDPA. 5.Extract ω from W and regard it as the embedding vector of the heterogeneous objects. 6.Run the k-means algorithm on ω to obtain the desired partitioning of the heterogeneous objects.
Talk at NTU, Tie-Yan Liu CGBCs Extension to the k-star- structured HHCC
Talk at NTU, Tie-Yan Liu Experiment on Toy Problem Relation Matrix A Relation Matrix B Embedding values of heterogeneous objects Totally based on the first graph Y(8:12) Totally based on the second graph Y(12:8) A more reasonable cut which is based on the information from both the first and the second graph β=
Talk at NTU, Tie-Yan Liu Experiment on Web Image Clustering
Talk at NTU, Tie-Yan Liu Embedding of the Clustering Hill vs OwlFlying vs Map
Talk at NTU, Tie-Yan Liu Average Performance Performance Comparison
Talk at NTU, Tie-Yan Liu Conclusions We propose a new problem named high-order heterogeneous co-clustering (HHCC). We propose a consistent bipartite graph co- partitioning algorithm to solve the HHCC problem with star-structured inter-relationship. Various experiments demonstrate the effectiveness of our proposed algorithm.
Talk at NTU, Tie-Yan Liu References Bin Gao, Tie-Yan Liu, et al, Consistent Bipartite Graph Co- Partitioning for Star-Structured High-Order Heterogeneous Data Co-Clustering, in Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2005), pp41~50. Bin Gao, Tie-Yan Liu, Tao Qin, Qian-Sheng Cheng, Wei-Ying Ma, Web Image Clustering by Consistent Utilization of Low-level Features and Surrounding Texts, in Proceedings of ACM Multimedia 2005.