Presentation is loading. Please wait.

Presentation is loading. Please wait.

Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.

Similar presentations


Presentation on theme: "Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International."— Presentation transcript:

1 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International Conference on Database and Expert Systems Applications Sep. 1-4, 2015 Valencia, Spain Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos

2 Slide 2 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] The Real World: Information Networks Friendship Coauthor Friendship Coauthor Friendship Coauthor

3 Slide 3 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] The Real World: Information Networks Friendship Coauthor Friendship Coauthor Friendship Coauthor

4 Slide 4 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Challenges Identify importance of each edge- type/attribute property For instance, clustering a bibliography network Attribute ‘area of interest’ is important Attributes ‘name’ and ‘gender’ may introduce noise and reduce the clustering accuracy Combine the attribute and structural vertex properties Edges and attributes are of different type

5 Slide 5 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Related Work Limited attention to the different importance of attributes/edge-types Weights are mainly updated at each iteration Ignore the existence of multiple edge-types Increases computational cost and complexity Spectral clustering is not used for clustering attributed graphs Used to identify dense clusters in attribute subspaces Model-Based BAGC [SIGMOD ‘12, TKDD ‘14] CESNA [ICDM ‘13] Distance-Based SACluster [VLDB ‘09, TKDD ‘11] PICS [SDM ‘12] HASCOP [WI ‘13]

6 Slide 6 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Proposed Approach: CAMIR C lustering Attributed Multi-graphs with Information Ranking: CAMIR 1.Rank edge-type and attribute properties 2.Construct a unified similarity matrix 3.Adopt spectral clustering technique to generate the final clusters

7 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

8 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

9 Slide 9 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] An edge represents the similarity of the two connected vertices Find the minimum cut of a graph Minimizes inter-cluster similarities Identifies an optimal partitioning of the graph Identifying a minimum cut is computationally difficult Efficient approximations using linear algebra Background: Graph Partitioning

10 Slide 10 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Based on the graph Laplacian, or Laplacian matrix Given a similarity matrix The normalized symmetric Laplacian L is defined as The eigenvectors corresponding to top k eigenvalues are the projection of the graph into R |V| x k Data is easily separable into clusters, i.e. using k-means Background: Spectral Clustering

11 Slide 11 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Background: Spectral Clustering 10 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 5 1 7 12 19 13 4 30 20 8 Adjacency Matrix 12345678910 111 211 3111 411 511 6 71 8 9 Laplacian Matrix 12345678910 11-0.354-0.5 21-0.408 3-0.3541-0.25-0.354 4-0.408-0.289-0.333 5-0.25-0.2891-0.5-0.289 6-0.51 71-0.707 8-0.408-0.333-0.2891 9-0.354-0.7071 10-0.5-0.3541 Top 3 eigenvectors U1U2U3 1 -0.659-0.7050.263 2 -0.6200.7470.241 3 -0.595-0.486-0.640 4 -0.6680.711-0.221 5 -0.7230.3950.566 6 -0.6690.414-0.617 7 -0.332-0.486-0.808 8 -0.6680.711-0.221 9 -0.379-0.4910.784 10 -0.659-0.7050.263

12 Slide 12 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] How do we define the similarity matrix for an attributed multi-graph?

13 Slide 13 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Background: Similarity Matrices IR DM AI IR [0,1] N X N 5 1 7 12 19 13 4 30 20 8 0 1 2 3 4 5 6 7 8 9 Gaussian Kernel [0,1] N X N Edges [0,1] N X N #Edge types + #Attributes Symmetric Non-negative Similarity Matrices How do we efficiently combine the similarity matrices?

14 Slide 14 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

15 Slide 15 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] CAMIR Overview 1.Rank vertex properties and calculate their weights accordingly By considering the agreement among vertex properties 2.Compute a unified similarity matrix By combining all vertex properties based on their ranking 3.Generate the final clusters By adopting a spectral clustering approach

16 Slide 16 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary

17 Slide 17 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Most informative property [NIPS ’11]: Has the highest ‘agreement’ with other properties ‘agree’  assign vertices the same cluster labels when used individually Information Ranking Rank attribute and edge type properties Iteratively select from the set of unranked properties the most informative property

18 Slide 18 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Information Ranking From the set of properties ( ), the most informative property is p [NIPS ‘11] The highest rank (| |) is assigned to the most informative property i.e. best separates the vertices The lowest rank (1.0) is assigned to the property that is selected last i.e. does not ‘agree’ with the rest of properties Rank attribute and edge type properties Iteratively select from the set of unranked properties the most informative property

19 Slide 19 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary

20 Slide 20 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Unified Similarity Matrix Combines the multiple edge-type and attribute properties with respect to identified ranking Defined as the weighted sum of the individual similarity matrices Weights are defined by normalizing the rankings Contains all the similarity information about the network under study

21 Slide 21 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR 1. Information Ranking 2. Unified Similarity Matrix 3. Generate the final clusters Evaluation Summary

22 Slide 22 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Generating the Final Clusters Calculate normalized Laplacian of Unified Similarity Matrix Perform Eigen decomposition Apply k-means to the eigenspace of top k eigenvectors Generate the final clusters

23 Slide 23 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] CAMIR Clustering Process Diagram Properties ranking Unified Similarity Matrix Generate the final clusters Cluster 1 Cluster 2 … Cluster k Iteratively Select the Most Informative Property Apply Spectral Clustering Normalize Rankings and Compute the Unified Similarity Matrix Step 1. Identify importance of vertex properties Step 2. Efficiently combine vertex properties Step 3. Cluster the attributed multi-graph

24 Slide 24 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

25 Slide 25 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Datasets Real-World Datasets DBLP: Bibliography Networks GoogleSP23: Google Software Packages DatasetDBLP-1KDBLP-10KGoogleSP-23 Nodes1 00010 0001 297 Edges17 12865 734268 956 Attributes225 Edge Types112 Total Vertex Properties 337 Synthetic Datasets {100, 500, 1 000, 5 000, 10 000}1 000 {1 000 – 1 230 000}~ 40 000 4{2, 4, 8, 16, 32} 11 5{3, 5, 9, 17, 33}

26 Slide 26 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Entropy Low entropy equals to high attribute homogeneity Normalized Mutual Information (NMI) High NMI is equivalent to high similarity between the resulted clustering and the ground-truth NMI of value 1 indicates perfect match Runtime Quad-core i7 2.8Ghz, 8 Gb RAM Evaluation Measures

27 Slide 27 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] SACluster [VLDB 2009] Similarity is defined as the Random Walk distance in the augmented graph BAGC [SIGMOD 2012] Uses Bayesian inference to update the parameters of the clusters distributions PICS [SDM 2012] Compresses adjacency and attribute matrices HASCOP [WI 2013] Heuristic distance-based Applies to attributed multi-graphs State-of-the-Art Competitors

28 Slide 28 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Synthetic Datasets CAMIR Entropy is always less than 0.5 High Attribute homogeneity CAMIR NMI is at least 0.8 on all experiments High quality results Similar behavior as we increase the number of attributes

29 Slide 29 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Synthetic Datasets CAMIR is the 2nd fastest algorithm Less than 10 secs for up to 5000 vertices CAMIR on average outperforms almost all its competitors

30 Slide 30 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Real-world Datasets DBLP-1K DBLP-10K CAMIR achieves the lowest entropy among its competitors Efficiently ranks and combines vertex properties Identifies clusters of arbitrary shapes and sizes (Spectral clustering)

31 Slide 31 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation - Real-world Datasets GoogleSP-23 CAMIR achieves low entropy CAMIR achieves high NMI Identifies a high percentage of software packages

32 Slide 32 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Evaluation – Runtime and Entropy AlgorithmDBLP-1KDBLP-10KGoogleSP23 Runtime (secs) Entropy Runtime (secs) Entropy Runtime (secs) Entropy CAMIR 1.200.299520.480.2555.980.387 BAGC 0.151.4480.351.6490.811.573 SACluster 3.220.729433.2281.06630.571.513 PICS 4.871.280495.171.877476.492.178 HASCOP 882.170.838329571.30646750.061 CAMIR requires: Less than 6 secs for ~1000 vertices About 8 minutes for 10000 vertices CAMIR achieves on average 55% time and 60% entropy improvement BAGC is the fastest method, but achieved limited clustering quality HASCOP achieved slightly better results than CAMIR, but it is the slowest method

33 Slide 33 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Presentation Outline Motivation Problem Definition Related Work Background Proposed Approach: CAMIR Evaluation Summary

34 Slide 34 of 35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Summary A new approach for Clustering Attributed Multi-graphs with Information Ranking: CAMIR A new mechanism to rank and weigh vertex properties Identifies the importance of each attribute and edge-type property A unified similarity matrix for attributed multi-graphs Efficiently combines vertex properties Identify clusters of arbitrary sizes and shapes Effective in terms of clustering accuracy and computational time

35 http://linc.ucy.ac.cy Andreas Papadopoulos - andpapad@cs.ucy.ac.cy [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking Andreas Papadopoulos, Dimitrios Rafailidis, George Pallis, Marios D. Dikaiakos Department of Computer Science University of Cyprus Thank You!


Download ppt "Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International."

Similar presentations


Ads by Google