Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Slides:



Advertisements
Similar presentations
Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Advertisements

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Partitional Algorithms to Detect Complex Clusters
Social network partition Presenter: Xiaofei Cao Partick Berg.
Community Detection with Edge Content in Social Media Networks Paper presented by Konstantinos Giannakopoulos.
Modularity and community structure in networks
Andreas Papadopoulos - [WI 2013] IEEE/WIC/ACM International Conference on Web Intelligence Nov , 2013 Atlanta, GA USA A. Papadopoulos,
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Clustering II CMPUT 466/551 Nilanjan Ray. Mean-shift Clustering Will show slides from:
Schema Summarization cong Yu Department of EECS University of Michigan H. V. Jagadish Department of EECS University of Michigan
10/11/2001Random walks and spectral segmentation1 CSE 291 Fall 2001 Marina Meila and Jianbo Shi: Learning Segmentation by Random Walks/A Random Walks View.
6/26/2006CGI'06, Hangzhou China1 Sub-sampling for Efficient Spectral Mesh Processing Rong Liu, Varun Jain and Hao Zhang GrUVi lab, Simon Fraser University,
Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation.
Lecture 21: Spectral Clustering
Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,
Spectral Clustering 指導教授 : 王聖智 S. J. Wang 學生 : 羅介暐 Jie-Wei Luo.
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts Dhillon, Inderjit S., Yuqiang Guan, and Brian Kulis.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
HCS Clustering Algorithm
A Unified View of Kernel k-means, Spectral Clustering and Graph Cuts
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Segmentation Graph-Theoretic Clustering.
1 AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti
אשכול בעזרת אלגורתמים בתורת הגרפים
Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning.
Application of Graph Theory to OO Software Engineering Alexander Chatzigeorgiou, Nikolaos Tsantalis, George Stephanides Department of Applied Informatics.
Radial Basis Function Networks
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2014.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
Spectral Analysis based on the Adjacency Matrix of Network Data Leting Wu Fall 2009.
CSE554AlignmentSlide 1 CSE 554 Lecture 8: Alignment Fall 2013.
Spectral Clustering Jianping Fan Dept of Computer Science UNC, Charlotte.
Learning Spectral Clustering, With Application to Speech Separation F. R. Bach and M. I. Jordan, JMLR 2006.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Data Structures and Algorithms in Parallel Computing Lecture 7.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Graphs, Vectors, and Matrices Daniel A. Spielman Yale University AMS Josiah Willard Gibbs Lecture January 6, 2016.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Presented by Alon Levin
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Spectral Clustering Shannon Quinn (with thanks to William Cohen of Carnegie Mellon University, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford.
Mesh Segmentation via Spectral Embedding and Contour Analysis Speaker: Min Meng
A Tutorial on Spectral Clustering Ulrike von Luxburg Max Planck Institute for Biological Cybernetics Statistics and Computing, Dec. 2007, Vol. 17, No.
Paper Presentation Social influence based clustering of heterogeneous information networks Qiwei Bao & Siqi Huang.
哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Motoki Shiga, Ichigaku Takigawa, Hiroshi Mamitsuka
High Performance Computing Seminar
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Spectral Methods for Dimensionality
Random Walk for Similarity Testing in Complex Networks
Network analysis.
Degree and Eigenvector Centrality
RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng,
Segmentation Graph-Theoretic Clustering.
Jinhong Jung, Woojung Jin, Lee Sael, U Kang, ICDM ‘16
Graph Clustering Based on Structural/Attribute Similarities
3.3 Network-Centric Community Detection
Graph-based Security and Privacy Analytics via Collective Classification with Joint Weight Learning and Propagation Binghui Wang, Jinyuan Jia, and Neil.
Shan Lu, Jieqi Kang, Weibo Gong, Don Towsley UMASS Amherst
“Traditional” image segmentation
Presentation transcript:

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International Conference on Database and Expert Systems Applications Sep. 1-4, 2015 Valencia, Spain Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos

Slide 2 of 35 Andreas Papadopoulos - [DEXA 2015] The Real World: Information Networks Friendship Coauthor Friendship Coauthor Friendship Coauthor

Slide 3 of 35 Andreas Papadopoulos - [DEXA 2015] The Real World: Information Networks Friendship Coauthor Friendship Coauthor Friendship Coauthor

Slide 4 of 35 Andreas Papadopoulos - [DEXA 2015] Challenges Identify importance of each edge- type/attribute property For instance, clustering a bibliography network Attribute ‘area of interest’ is important Attributes ‘name’ and ‘gender’ may introduce noise and reduce the clustering accuracy Combine the attribute and structural vertex properties Edges and attributes are of different type

Slide 5 of 35 Andreas Papadopoulos - [DEXA 2015] Related Work Limited attention to the different importance of attributes/edge-types Weights are mainly updated at each iteration Ignore the existence of multiple edge-types Increases computational cost and complexity Spectral clustering is not used for clustering attributed graphs Used to identify dense clusters in attribute subspaces Model-Based BAGC [SIGMOD ‘12, TKDD ‘14] CESNA [ICDM ‘13] Distance-Based SACluster [VLDB ‘09, TKDD ‘11] PICS [SDM ‘12] HASCOP [WI ‘13]

Slide 6 of 35 Andreas Papadopoulos - [DEXA 2015] Proposed Approach: CAMIR C lustering Attributed Multi-graphs with Information Ranking: CAMIR 1.Rank edge-type and attribute properties 2.Construct a unified similarity matrix 3.Adopt spectral clustering technique to generate the final clusters

Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

Slide 9 of 35 Andreas Papadopoulos - [DEXA 2015] An edge represents the similarity of the two connected vertices Find the minimum cut of a graph Minimizes inter-cluster similarities Identifies an optimal partitioning of the graph Identifying a minimum cut is computationally difficult Efficient approximations using linear algebra Background: Graph Partitioning

Slide 10 of 35 Andreas Papadopoulos - [DEXA 2015] Based on the graph Laplacian, or Laplacian matrix Given a similarity matrix The normalized symmetric Laplacian L is defined as The eigenvectors corresponding to top k eigenvalues are the projection of the graph into R |V| x k Data is easily separable into clusters, i.e. using k-means Background: Spectral Clustering

Slide 11 of 35 Andreas Papadopoulos - [DEXA 2015] Background: Spectral Clustering Adjacency Matrix Laplacian Matrix Top 3 eigenvectors U1U2U

Slide 12 of 35 Andreas Papadopoulos - [DEXA 2015] How do we define the similarity matrix for an attributed multi-graph?

Slide 13 of 35 Andreas Papadopoulos - [DEXA 2015] Background: Similarity Matrices IR DM AI IR [0,1] N X N Gaussian Kernel [0,1] N X N Edges [0,1] N X N #Edge types + #Attributes Symmetric Non-negative Similarity Matrices How do we efficiently combine the similarity matrices?

Slide 14 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

Slide 15 of 35 Andreas Papadopoulos - [DEXA 2015] CAMIR Overview 1.Rank vertex properties and calculate their weights accordingly By considering the agreement among vertex properties 2.Compute a unified similarity matrix By combining all vertex properties based on their ranking 3.Generate the final clusters By adopting a spectral clustering approach

Slide 16 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary

Slide 17 of 35 Andreas Papadopoulos - [DEXA 2015] Most informative property [NIPS ’11]: Has the highest ‘agreement’ with other properties ‘agree’  assign vertices the same cluster labels when used individually Information Ranking Rank attribute and edge type properties Iteratively select from the set of unranked properties the most informative property

Slide 18 of 35 Andreas Papadopoulos - [DEXA 2015] Information Ranking From the set of properties ( ), the most informative property is p [NIPS ‘11] The highest rank (| |) is assigned to the most informative property i.e. best separates the vertices The lowest rank (1.0) is assigned to the property that is selected last i.e. does not ‘agree’ with the rest of properties Rank attribute and edge type properties Iteratively select from the set of unranked properties the most informative property

Slide 19 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary

Slide 20 of 35 Andreas Papadopoulos - [DEXA 2015] Unified Similarity Matrix Combines the multiple edge-type and attribute properties with respect to identified ranking Defined as the weighted sum of the individual similarity matrices Weights are defined by normalizing the rankings Contains all the similarity information about the network under study

Slide 21 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary

Slide 22 of 35 Andreas Papadopoulos - [DEXA 2015] Generating the Final Clusters Calculate normalized Laplacian of Unified Similarity Matrix Perform Eigen decomposition Apply k-means to the eigenspace of top k eigenvectors Generate the final clusters

Slide 23 of 35 Andreas Papadopoulos - [DEXA 2015] CAMIR Clustering Process Diagram Properties ranking Unified Similarity Matrix Generate the final clusters Cluster 1 Cluster 2 … Cluster k Iteratively Select the Most Informative Property Apply Spectral Clustering Normalize Rankings and Compute the Unified Similarity Matrix Step 1. Identify importance of vertex properties Step 2. Efficiently combine vertex properties Step 3. Cluster the attributed multi-graph

Slide 24 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

Slide 25 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Datasets Real-World Datasets DBLP: Bibliography Networks GoogleSP23: Google Software Packages DatasetDBLP-1KDBLP-10KGoogleSP-23 Nodes Edges Attributes225 Edge Types112 Total Vertex Properties 337 Synthetic Datasets {100, 500, 1 000, 5 000, }1 000 {1 000 – }~ {2, 4, 8, 16, 32} 11 5{3, 5, 9, 17, 33}

Slide 26 of 35 Andreas Papadopoulos - [DEXA 2015] Entropy Low entropy equals to high attribute homogeneity Normalized Mutual Information (NMI) High NMI is equivalent to high similarity between the resulted clustering and the ground-truth NMI of value 1 indicates perfect match Runtime Quad-core i7 2.8Ghz, 8 Gb RAM Evaluation Measures

Slide 27 of 35 Andreas Papadopoulos - [DEXA 2015] SACluster [VLDB 2009] Similarity is defined as the Random Walk distance in the augmented graph BAGC [SIGMOD 2012] Uses Bayesian inference to update the parameters of the clusters distributions PICS [SDM 2012] Compresses adjacency and attribute matrices HASCOP [WI 2013] Heuristic distance-based Applies to attributed multi-graphs State-of-the-Art Competitors

Slide 28 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Synthetic Datasets CAMIR Entropy is always less than 0.5 High Attribute homogeneity CAMIR NMI is at least 0.8 on all experiments High quality results Similar behavior as we increase the number of attributes

Slide 29 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Synthetic Datasets CAMIR is the 2nd fastest algorithm Less than 10 secs for up to 5000 vertices CAMIR on average outperforms almost all its competitors

Slide 30 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Real-world Datasets DBLP-1K DBLP-10K CAMIR achieves the lowest entropy among its competitors Efficiently ranks and combines vertex properties Identifies clusters of arbitrary shapes and sizes (Spectral clustering)

Slide 31 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation - Real-world Datasets GoogleSP-23 CAMIR achieves low entropy CAMIR achieves high NMI Identifies a high percentage of software packages

Slide 32 of 35 Andreas Papadopoulos - [DEXA 2015] Evaluation – Runtime and Entropy AlgorithmDBLP-1KDBLP-10KGoogleSP23 Runtime (secs) Entropy Runtime (secs) Entropy Runtime (secs) Entropy CAMIR BAGC SACluster PICS HASCOP CAMIR requires: Less than 6 secs for ~1000 vertices About 8 minutes for vertices CAMIR achieves on average 55% time and 60% entropy improvement BAGC is the fastest method, but achieved limited clustering quality HASCOP achieved slightly better results than CAMIR, but it is the slowest method

Slide 33 of 35 Andreas Papadopoulos - [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

Slide 34 of 35 Andreas Papadopoulos - [DEXA 2015] Summary A new approach for Clustering Attributed Multi-graphs with Information Ranking: CAMIR A new mechanism to rank and weigh vertex properties Identifies the importance of each attribute and edge-type property A unified similarity matrix for attributed multi-graphs Efficiently combines vertex properties Identify clusters of arbitrary sizes and shapes Effective in terms of clustering accuracy and computational time

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos Department of Computer Science University of Cyprus Thank You!