Detecting Community Structure in Network Seung Woo Son KAIST 2004 summer intensive studies on complex networks 2004. 8. 11.

Slides:



Advertisements
Similar presentations
Class 12: Communities Network Science: Communities Dr. Baruch Barzel.
Advertisements

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Social network partition Presenter: Xiaofei Cao Partick Berg.
Clustering.
Analysis and Modeling of Social Networks Foudalis Ilias.
Modularity and community structure in networks
Community Detection Laks V.S. Lakshmanan (based on Girvan & Newman. Finding and evaluating community structure in networks. Physical Review E 69,
Community Detection Algorithm and Community Quality Metric Mingming Chen & Boleslaw K. Szymanski Department of Computer Science Rensselaer Polytechnic.
Graph Partitioning Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Information Networks Graph Clustering Lecture 14.
Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.
Information Networks Small World Networks Lecture 5.
1 Modularity and Community Structure in Networks* Final project *Based on a paper by M.E.J Newman in PNAS 2006.
Introduction to Bioinformatics
V4 Matrix algorithms and graph partitioning
Structural Inference of Hierarchies in Networks BY Yu Shuzhi 27, Mar 2014.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Lecture 6 Image Segmentation
Lecture 21: Spectral Clustering
CS 584. Review n Systems of equations and finite element methods are related.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Fast algorithm for detecting community structure in networks.
Modularity in Biological networks.  Hypothesis: Biological function are carried by discrete functional modules.  Hartwell, L.-H., Hopfield, J. J., Leibler,
Topologically biased random walks with application for community finding Vinko Zlatić Dep. Of Physics, “Sapienza”, Roma, Italia Theoretical Physics Division,
A scalable multilevel algorithm for community structure detection
COMS Network Theory Week 6: February 28, 2008 Dragomir R. Radev Thursdays, 6-8 PM 233 Mudd Spring 2008.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Systems Biology, April 25 th 2007Thomas Skøt Jensen Technical University of Denmark Networks and Network Topology Thomas Skøt Jensen Center for Biological.
CS8803-NS Network Science Fall 2013
Network Measures Social Media Mining. 2 Measures and Metrics 2 Social Media Mining Network Measures Klout.
Clustering Unsupervised learning Generating “classes”
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
Presentation: Random Walk Betweenness, J. Govorčin Laboratory for Data Technologies, Faculty of Information Studies, Novo mesto – September 22, 2011 Random.
Community detection algorithms: a comparative analysis Santo Fortunato.
Vladyslav Kolbasin Stable Clustering. Clustering data Clustering is part of exploratory process Standard definition:  Clustering - grouping a set of.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Lecture 5: Mathematics of Networks (Cont) CS 790g: Complex Networks Slides are modified from Networks: Theory and Application by Lada Adamic.
Mathematics of Networks (Cont)
Data Structures and Algorithms in Parallel Computing Lecture 2.
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Communities. Questions 1.What is a community (intuitively)? Examples and fundamental hypothesis 2.What do we really mean by communities? Basic definitions.
Network Community Behavior to Infer Human Activities.
Community Detection Algorithms: A Comparative Analysis Authors: A. Lancichinetti and S. Fortunato Presented by: Ravi Tiwari.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
University at BuffaloThe State University of New York Detecting Community Structure in Networks.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Community Discovery in Social Network Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
 In the previews parts we have seen some kind of segmentation method.  In this lecture we will see graph cut, which is a another segmentation method.
Community structure in graphs Santo Fortunato. More links “inside” than “outside” Graphs are “sparse” “Communities”
Informatics tools in network science
Network Theory: Community Detection Dr. Henry Hexmoor Department of Computer Science Southern Illinois University Carbondale.
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
Selected Topics in Data Networking Explore Social Networks:
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
Community Detection  Definition: Community Detection  Girwan Newman Approach  Hierarchical Clustering.
Department of Computer and IT Engineering University of Kurdistan Social Network Analysis Communities By: Dr. Alireza Abdollahpouri.
Graph clustering to detect network modules
Cohesive Subgraph Computation over Large Graphs
Groups of vertices and Core-periphery structure
Greedy Algorithm for Community Detection
Community detection in graphs
TELCOM2125: Network Science and Analysis
Network Science: A Short Introduction i3 Workshop
Michael L. Nelson CS 495/595 Old Dominion University
Overcoming Resolution Limits in MDL Community Detection
Presentation transcript:

Detecting Community Structure in Network Seung Woo Son KAIST 2004 summer intensive studies on complex networks

Clustering of data Partitional clustering methods  Important technique in data analysis  Divide the data according to natural classes  Pattern recognition, learning, astrophysics, and network analysis N multivariable data points o D-dimensional vector space metric

On network N vertices (nodes) No prior information Only know the edge (link) connectivity : Structural information How can we divide the network into several parts? = How can we find the “community” structure? = How can we find the “community” structure? Web page having same topic, hidden social relationship, distribute processes to processors in a parallel computer, etc. applications

Community, cluster Functional modules in cellular and genetic network  P. Holme, M. Huss, and H. Jeong, Bioinformatics 19, 532 (2003).  D. Wilkinson and B. A. Huberman, Proc. Natl. Acad. Sci. USA /pnas (2004).  A. Vespignani, Nature Genetics 35, 118 (2003). Cultural society or important source of a person’s identity in social network  J. Scott, Social Network Analysis: A Handbook, Sage Publications 2 nd ed. (2000). A bundle of web pages on common topics etc. Community, module, (cohesive) subgroup, cluster, clique etc. Computer science, mathematics, sociology, biology, and physics are related in this community finding problem.

Structural definition of community Groups of vertices within which connections are dense, but between which connections are sparser.  Because we don’t have any prior information about network.Modularity

Key points (Highlight) What property or measure of network is used in this algorithm or method?  eigenvalue and eigenvector, spectrum of adjacency matrix.  Edge betweenness, information centrality.  Distance, dissimilarity index, edge clustering coefficient, etc. Agglomerative or divisive? What is the required prior information here?  Whether there is community or not.  How many modules are there. Performance of partitioning results and computational complexity. We will review about 11 different methods recently studied. If you are boring, ask me a question. Physical meaning?

1. Spectral bisection (old one) M. Fiedler, Czrch. Math. J. 23, 298 (1973) A. Pothen, H. Simon, and K.-P. Liou, SIAM J. Matrix Anal. Appl. 11, 430 (1990) F. R. K. Chung, Spectral Graph Theory, Amer. Math. Soc. (1997) Laplacian L of n-vertex undirected graph G - D is the diagonal matrix of vertex degree k. - A is the adjacency matrix is always eigenvector with eigenvalue 0.

The eigenvector corresponding to the lowest eigenvalue must haveboth positive and negative elements. Algebraic connectivity : How good the split is, with smaller values corresponding to better splits. Bisect ! The spectral bisection method is reasonably fast. General n by n matrix case, O(n 3 ) time complexity. However, sparse matrix case, Lancozos method reduces it to approximately. G. H. Golub and C. F. Van Loan, Matrix computations. Johns Hopkins University Press, Baltimore, MD (1989)

2. The Kernighan-Lin (KL) algorithm B. W. Kernighan and S. Lin, Bell System Technical Journal 49, 291 (1970) Benefit function Q The number of edges that lie within the two groups minus the number that lie between them. A B 1. We should Specify the size of the two groups. N(A), N(B) 2. Calculate the ∆Q for all possible exchange pair from A and B. 3. Choose the pair that maximizes the change of Q. (greedy algorithm) 4. Repeat 2 & 3 until all vertices have been swapped once. (any vertex that has been swapped is never swapped. ) 5. Go back over the sequence of swaps and find the highest Q. Bisect ! - This algorithm requires a priori what the size of the groups will be. - It runs moderately quickly, in worst case time O(n 2 ). However, if we don’t know the size, It will increase to O(n 3 ). - The best values of Q are always achieved for very asymmetric trivial division.

3. Newman fast algorithm M. E. J. Newman, cond-mat/ (PRE in press)Modularity Maximize Q by greedy algorithm ! Generally the number of ways to divide n vertices into g non-empty groups is given by the Stirling number of the second kind S(n,g), and hence the number of distinct community divisions is. Agglomerative hierarchical clustering method! 1.Separate each vertex solely into n community. 2.Calculate the increase of Q for all possible community pairs. 3.Choose the mergence of the greatest increase in Q. 4.Repeat 2 & 3 until the modularity Q reaches the maximal value. Time Complexity - O(mn) O(n 2 ) on sparse graph.

4. q-state Potts method or RB method (Reihardt-Bornholdt method) J. Reichardt and S. Bornholdt, cond-mat/ (2004) q-state Potts model on network Hamiltonian : Nearest neighbor ferromagnetic interaction of the Potts model : homogeneous distribution of spin Diversity : global anti-ferromagnetic interaction. q = N/5 is reasonable for application. Monte-Carlo heat-bath algorithm and simulated annealing magnetization

128 nodes computer-generated (proposed by Newman) network, 4 groups of 32 nodes each. Average of 16 links ( z in +z out =16 )

5. Hierarchical clustering Dendrogram 1.Measure of similarity x ij between pairs (i,j) of vertices. 2.Single linkage, complete linkage, or average linkage. metric Structural equivalence : Two vertices are said to be structurally equivalent if they have the same set of neighbours. How many same friends they have. Euclidean distance Pearson correlation K-components : Two vertices in the same community have at least k independent paths between them. The count of edge-independent path (max-flow) betweenvertices. Time complexity Max(O(mn), O(n2logn) ) because of the sorting of n 2 similarity.

6. Zhou dissimilarity index method H. Zhou, Phys. Rev. E 67, (2003) H. Zhou, Phys. Rev. E 67, (2003) The distance d ij from vertex i to vertex j is defined as the average number of steps needed for a Brownian particle on this network to move from vertex i to vertex j. Transfer matrix (jumping probability) Distance I is N by N identity matrix. B(j) is equals to P except that B lj (j) = 0 for all l. Dissimilarity index

7. Girvan-Newman (GN) algorithm M. Girvan and M. E. J. Newman, PNAS 99, 7821 (2002) M. E. J. Newman and M. Girvan, Phys. Rev. E 69, (2004) A B The few edges that lie between communities can be thought of as forming “bottlenecks” between the communities. Betweenness and edge betweenness : The number of geodesic (i.e., shortest) paths between vertex pairs that run along the edge in question, summed over all vertex pairs. Edge removal : After calculating the betweenness of all edges in the network, remove the one with highest betweenness. Recalculate after edge removal and repeat it until the modularity Q is maximum. Time complexity O(m 2 n)

8. Tyler-Wilkinson-Huberman (TWM) method J. R. Tyler, D. M. Wilkinson, and B. A. Huberman, cond-mat/ (2003) Variation of Girvan-Newman algorithm to improve the calculating speed. Tyler et al. suggest instead summing up over all node only a subset of vertices i be summed over, giving partial betweenness score for all edges; if a random sample is chosen, this will give a Monte Carlo estimate of betweenness. The number of vertices sampled is chosen so as to make the betweenness of at least one edge in the network greater than a certain threshold. This stochastic approach reduces the time complexity from O(m 2 n) to O(m 2 )

9. RCCLP method or Parisi method (Radicchi-Castellano-Cecconi-Loreto-Parisi method) F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, PNAS 101, 2658 (2004) Edge clustering coefficient Definition of community in a strong sense : Definition of community in a weak sense : : the number of triangles built on that edge. ij 65 ij 65 Edge coefficient of order g :

Time complexity O(m 4 /n 2 ) ~ O(n 2 ) Edge clustering coefficient is strongly negatively correlated with edge betweenness. This algorithm relies on the presence of triangles in the network. Clearly if a network has few triangles in the first place, then the edge clustering coefficient will be small for all edges, and the algorithm will be fail to find the communities.

10. Information centrality method (Fortunato-Latora-Marchiori method) S. Fortunato, V. Latora, and M. Marchiori, cond-mat/ (2004) Network efficiency E Information centrality C I Iterative removal of the edges with the highest information centrality Time complexity O(m 3 n) 64 nodes computer-generated network. 256 edges, 4 groups of 16 nodes each.

11. Flake’s max-flow method (Flake-Lawrence-Giles-Coetzee method) G. W. Flake, S. R. Lawrence, C. L. Giles, and F. M. Coetzee, IEEE Computer 35, 66 (2002) Web community Starting page or seed Web sites Find the boundary of community using max-flow and min-cut. Without the text information only link information. Ex) Page Rank, Hyperlink Induced Topic Search(HIT)

Simple example of max-flow, min-cut

Spectral analysis : eignevalue and eigenvector of Laplacian or transfer matrix Optimization Approach : Hamiltonian, benefit function, or modularity Q Edge removal : betweenness, information centrality, clustering coefficient, etc. Hierarchical clustering : metric ( Euclidian, correlation, similarity, etc. )

12. ESMS method or K. Sneppen method ( Eriksen-Simonsen-Maslov-Sneppen method ) K. A. Eriksen, I. Simonsen, S. Maslov, and K. Sneppen, Phys. Rev. Lett. 90, (2003)

13. CSCC method or Capocci method ( Capocci-Servedio-Caldarelli-Colaiori method ) A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori, cond-mat/

14. Donetti-Muñoz (DM) method L. Donetti and M. A. Muñoz, cond-mat/

15. Wu-Huberman (WH) method F. Wu and B. A. Huberman, cond-mat/310600

16. Costa’s Hub-based flooding method L. F. Costa, cond-mat/ (2004)