Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International Conference on Database and Expert Systems Applications Sep. 1-4, 2015 Valencia, Spain Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] The Real World: Information Networks Friendship Coauthor Friendship Coauthor Friendship Coauthor

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Challenges Identify importance of each edge- type/attribute property For instance, clustering a bibliography network Attribute ‘area of interest’ is important Attributes ‘name’ and ‘gender’ may introduce noise and reduce the clustering accuracy Combine the attribute and structural vertex properties Edges and attributes are of different type

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Related Work Limited attention to the different importance of attributes/edge-types Weights are mainly updated at each iteration Ignore the existence of multiple edge-types Increases computational cost and complexity Spectral clustering is not used for clustering attributed graphs Used to identify dense clusters in attribute subspaces Model-Based BAGC [SIGMOD ‘12, TKDD ‘14] CESNA [ICDM ‘13] Distance-Based SACluster [VLDB ‘09, TKDD ‘11] PICS [SDM ‘12] HASCOP [WI ‘13]

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Proposed Approach: CAMIR C lustering Attributed Multi-graphs with Information Ranking: CAMIR 1.Rank edge-type and attribute properties 2.Construct a unified similarity matrix 3.Adopt spectral clustering technique to generate the final clusters

http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] An edge represents the similarity of the two connected vertices Find the minimum cut of a graph Minimizes inter-cluster similarities Identifies an optimal partitioning of the graph Identifying a minimum cut is computationally difficult Efficient approximations using linear algebra Background: Graph Partitioning

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Based on the graph Laplacian, or Laplacian matrix Given a similarity matrix The normalized symmetric Laplacian L is defined as The eigenvectors corresponding to top k eigenvalues are the projection of the graph into R |V| x k Data is easily separable into clusters, i.e. using k-means Background: Spectral Clustering

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Background: Spectral Clustering 10 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 5 1 7 12 19 13 4 30 20 8 Adjacency Matrix 12345678910 111 211 3111 411 511 6 71 8 9 Laplacian Matrix 12345678910 11-0.354-0.5 21-0.408 3-0.3541-0.25-0.354 4-0.408-0.289-0.333 5-0.25-0.2891-0.5-0.289 6-0.51 71-0.707 8-0.408-0.333-0.2891 9-0.354-0.7071 10-0.5-0.3541 Top 3 eigenvectors U1U2U3 1 -0.659-0.7050.263 2 -0.6200.7470.241 3 -0.595-0.486-0.640 4 -0.6680.711-0.221 5 -0.7230.3950.566 6 -0.6690.414-0.617 7 -0.332-0.486-0.808 8 -0.6680.711-0.221 9 -0.379-0.4910.784 10 -0.659-0.7050.263

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] How do we define the similarity matrix for an attributed multi-graph?

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Background: Similarity Matrices IR DM AI IR [0,1] N X N 5 1 7 12 19 13 4 30 20 8 0 1 2 3 4 5 6 7 8 9 Gaussian Kernel [0,1] N X N Edges [0,1] N X N #Edge types + #Attributes Symmetric Non-negative Similarity Matrices How do we efficiently combine the similarity matrices?

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] CAMIR Overview 1.Rank vertex properties and calculate their weights accordingly By considering the agreement among vertex properties 2.Compute a unified similarity matrix By combining all vertex properties based on their ranking 3.Generate the final clusters By adopting a spectral clustering approach

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Most informative property [NIPS ’11]: Has the highest ‘agreement’ with other properties ‘agree’  assign vertices the same cluster labels when used individually Information Ranking Rank attribute and edge type properties Iteratively select from the set of unranked properties the most informative property

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Information Ranking From the set of properties ( ), the most informative property is p [NIPS ‘11] The highest rank (| |) is assigned to the most informative property i.e. best separates the vertices The lowest rank (1.0) is assigned to the property that is selected last i.e. does not ‘agree’ with the rest of properties Rank attribute and edge type properties Iteratively select from the set of unranked properties the most informative property

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Unified Similarity Matrix Combines the multiple edge-type and attribute properties with respect to identified ranking Defined as the weighted sum of the individual similarity matrices Weights are defined by normalizing the rankings Contains all the similarity information about the network under study

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Generating the Final Clusters Calculate normalized Laplacian of Unified Similarity Matrix Perform Eigen decomposition Apply k-means to the eigenspace of top k eigenvectors Generate the final clusters

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] CAMIR Clustering Process Diagram Properties ranking Unified Similarity Matrix Generate the final clusters Cluster 1 Cluster 2 … Cluster k Iteratively Select the Most Informative Property Apply Spectral Clustering Normalize Rankings and Compute the Unified Similarity Matrix Step 1. Identify importance of vertex properties Step 2. Efficiently combine vertex properties Step 3. Cluster the attributed multi-graph

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Datasets Real-World Datasets DBLP: Bibliography Networks GoogleSP23: Google Software Packages DatasetDBLP-1KDBLP-10KGoogleSP-23 Nodes1 00010 0001 297 Edges17 12865 734268 956 Attributes225 Edge Types112 Total Vertex Properties 337 Synthetic Datasets {100, 500, 1 000, 5 000, 10 000}1 000 {1 000 – 1 230 000}~ 40 000 4{2, 4, 8, 16, 32} 11 5{3, 5, 9, 17, 33}

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Entropy Low entropy equals to high attribute homogeneity Normalized Mutual Information (NMI) High NMI is equivalent to high similarity between the resulted clustering and the ground-truth NMI of value 1 indicates perfect match Runtime Quad-core i7 2.8Ghz, 8 Gb RAM Evaluation Measures

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] SACluster [VLDB 2009] Similarity is defined as the Random Walk distance in the augmented graph BAGC [SIGMOD 2012] Uses Bayesian inference to update the parameters of the clusters distributions PICS [SDM 2012] Compresses adjacency and attribute matrices HASCOP [WI 2013] Heuristic distance-based Applies to attributed multi-graphs State-of-the-Art Competitors

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Synthetic Datasets CAMIR Entropy is always less than 0.5 High Attribute homogeneity CAMIR NMI is at least 0.8 on all experiments High quality results Similar behavior as we increase the number of attributes

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Synthetic Datasets CAMIR is the 2nd fastest algorithm Less than 10 secs for up to 5000 vertices CAMIR on average outperforms almost all its competitors

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Real-world Datasets DBLP-1K DBLP-10K CAMIR achieves the lowest entropy among its competitors Efficiently ranks and combines vertex properties Identifies clusters of arbitrary shapes and sizes (Spectral clustering)

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Real-world Datasets GoogleSP-23 CAMIR achieves low entropy CAMIR achieves high NMI Identifies a high percentage of software packages

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation – Runtime and Entropy AlgorithmDBLP-1KDBLP-10KGoogleSP23 Runtime (secs) Entropy Runtime (secs) Entropy Runtime (secs) Entropy CAMIR 1.200.299520.480.2555.980.387 BAGC 0.151.4480.351.6490.811.573 SACluster 3.220.729433.2281.06630.571.513 PICS 4.871.280495.171.877476.492.178 HASCOP 882.170.838329571.30646750.061 CAMIR requires: Less than 6 secs for ~1000 vertices About 8 minutes for 10000 vertices CAMIR achieves on average 55% time and 60% entropy improvement BAGC is the fastest method, but achieved limited clustering quality HASCOP achieved slightly better results than CAMIR, but it is the slowest method

of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Summary A new approach for Clustering Attributed Multi-graphs with Information Ranking: CAMIR A new mechanism to rank and weigh vertex properties Identifies the importance of each attribute and edge-type property A unified similarity matrix for attributed multi-graphs Efficiently combines vertex properties Identify clusters of arbitrary sizes and shapes Effective in terms of clustering accuracy and computational time

http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos Department of Computer Science University of Cyprus Thank You!

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Similar presentations

Presentation on theme: "Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Similar presentations

Presentation on theme: "Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International."— Presentation transcript:

Similar presentations

About project

Feedback