Clustering of data Partitional clustering methods Important technique in data analysis Divide the data according to natural classes Pattern recognition, learning, astrophysics, and network analysis N multivariable data points o D-dimensional vector space metric
On network N vertices (nodes) No prior information Only know the edge (link) connectivity : Structural information How can we divide the network into several parts? = How can we find the “community” structure? = How can we find the “community” structure? Web page having same topic, hidden social relationship, distribute processes to processors in a parallel computer, etc. applications
Community, cluster Functional modules in cellular and genetic network P. Holme, M. Huss, and H. Jeong, Bioinformatics 19, 532 (2003). D. Wilkinson and B. A. Huberman, Proc. Natl. Acad. Sci. USA 10.1073/pnas.0307740100 (2004). A. Vespignani, Nature Genetics 35, 118 (2003). Cultural society or important source of a person’s identity in social network J. Scott, Social Network Analysis: A Handbook, Sage Publications 2 nd ed. (2000). A bundle of web pages on common topics etc. Community, module, (cohesive) subgroup, cluster, clique etc. Computer science, mathematics, sociology, biology, and physics are related in this community finding problem.
Structural definition of community Groups of vertices within which connections are dense, but between which connections are sparser. Because we don’t have any prior information about network.Modularity
Key points (Highlight) What property or measure of network is used in this algorithm or method? eigenvalue and eigenvector, spectrum of adjacency matrix. Edge betweenness, information centrality. Distance, dissimilarity index, edge clustering coefficient, etc. Agglomerative or divisive? What is the required prior information here? Whether there is community or not. How many modules are there. Performance of partitioning results and computational complexity. We will review about 11 different methods recently studied. If you are boring, ask me a question. Physical meaning?
1. Spectral bisection (old one) M. Fiedler, Czrch. Math. J. 23, 298 (1973) A. Pothen, H. Simon, and K.-P. Liou, SIAM J. Matrix Anal. Appl. 11, 430 (1990) F. R. K. Chung, Spectral Graph Theory, Amer. Math. Soc. (1997) http://www.cs.berkeley.edu/~demmel/cs267/lecture20/lecture20.html Laplacian L of n-vertex undirected graph G - D is the diagonal matrix of vertex degree k. - A is the adjacency matrix. 1 2 3 4 5 is always eigenvector with eigenvalue 0.
1 2 3 4 5 The eigenvector corresponding to the lowest eigenvalue must haveboth positive and negative elements. Algebraic connectivity : How good the split is, with smaller values corresponding to better splits. Bisect ! The spectral bisection method is reasonably fast. General n by n matrix case, O(n 3 ) time complexity. However, sparse matrix case, Lancozos method reduces it to approximately. G. H. Golub and C. F. Van Loan, Matrix computations. Johns Hopkins University Press, Baltimore, MD (1989)
2. The Kernighan-Lin (KL) algorithm B. W. Kernighan and S. Lin, Bell System Technical Journal 49, 291 (1970) http://www.cs.berkeley.edu/~demmel/cs267/lecture18/lecture18.html Benefit function Q The number of edges that lie within the two groups minus the number that lie between them. A B 1. We should Specify the size of the two groups. N(A), N(B) 2. Calculate the ∆Q for all possible exchange pair from A and B. 3. Choose the pair that maximizes the change of Q. (greedy algorithm) 4. Repeat 2 & 3 until all vertices have been swapped once. (any vertex that has been swapped is never swapped. ) 5. Go back over the sequence of swaps and find the highest Q. Bisect ! - This algorithm requires a priori what the size of the groups will be. - It runs moderately quickly, in worst case time O(n 2 ). However, if we don’t know the size, It will increase to O(n 3 ). - The best values of Q are always achieved for very asymmetric trivial division.
3. Newman fast algorithm M. E. J. Newman, cond-mat/0309508 (PRE in press)Modularity Maximize Q by greedy algorithm ! Generally the number of ways to divide n vertices into g non-empty groups is given by the Stirling number of the second kind S(n,g), and hence the number of distinct community divisions is. Agglomerative hierarchical clustering method! 1.Separate each vertex solely into n community. 2.Calculate the increase of Q for all possible community pairs. 3.Choose the mergence of the greatest increase in Q. 4.Repeat 2 & 3 until the modularity Q reaches the maximal value. Time Complexity - O(mn) O(n 2 ) on sparse graph.
4. q-state Potts method or RB method (Reihardt-Bornholdt method) J. Reichardt and S. Bornholdt, cond-mat/0402349 (2004) q-state Potts model on network Hamiltonian : Nearest neighbor ferromagnetic interaction of the Potts model : homogeneous distribution of spin Diversity : global anti-ferromagnetic interaction. q = N/5 is reasonable for application. Monte-Carlo heat-bath algorithm and simulated annealing magnetization
128 nodes computer-generated (proposed by Newman) network, 4 groups of 32 nodes each. Average of 16 links ( z in +z out =16 )
5. Hierarchical clustering Dendrogram 1.Measure of similarity x ij between pairs (i,j) of vertices. 2.Single linkage, complete linkage, or average linkage. metric Structural equivalence : Two vertices are said to be structurally equivalent if they have the same set of neighbours. How many same friends they have. Euclidean distance Pearson correlation K-components : Two vertices in the same community have at least k independent paths between them. The count of edge-independent path (max-flow) betweenvertices. Time complexity Max(O(mn), O(n2logn) ) because of the sorting of n 2 similarity.
6. Zhou dissimilarity index method H. Zhou, Phys. Rev. E 67, 061901 (2003) H. Zhou, Phys. Rev. E 67, 041908 (2003) The distance d ij from vertex i to vertex j is defined as the average number of steps needed for a Brownian particle on this network to move from vertex i to vertex j. Transfer matrix (jumping probability) Distance I is N by N identity matrix. B(j) is equals to P except that B lj (j) = 0 for all l. Dissimilarity index
7. Girvan-Newman (GN) algorithm M. Girvan and M. E. J. Newman, PNAS 99, 7821 (2002) M. E. J. Newman and M. Girvan, Phys. Rev. E 69, 026113 (2004) A B The few edges that lie between communities can be thought of as forming “bottlenecks” between the communities. Betweenness and edge betweenness : The number of geodesic (i.e., shortest) paths between vertex pairs that run along the edge in question, summed over all vertex pairs. Edge removal : After calculating the betweenness of all edges in the network, remove the one with highest betweenness. Recalculate after edge removal and repeat it until the modularity Q is maximum. Time complexity O(m 2 n)
8. Tyler-Wilkinson-Huberman (TWM) method J. R. Tyler, D. M. Wilkinson, and B. A. Huberman, cond-mat/0303264 (2003) Variation of Girvan-Newman algorithm to improve the calculating speed. Tyler et al. suggest instead summing up over all node only a subset of vertices i be summed over, giving partial betweenness score for all edges; if a random sample is chosen, this will give a Monte Carlo estimate of betweenness. The number of vertices sampled is chosen so as to make the betweenness of at least one edge in the network greater than a certain threshold. This stochastic approach reduces the time complexity from O(m 2 n) to O(m 2 )
9. RCCLP method or Parisi method (Radicchi-Castellano-Cecconi-Loreto-Parisi method) F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, and D. Parisi, PNAS 101, 2658 (2004) Edge clustering coefficient Definition of community in a strong sense : Definition of community in a weak sense : : the number of triangles built on that edge. ij 65 ij 65 Edge coefficient of order g :
Time complexity O(m 4 /n 2 ) ~ O(n 2 ) Edge clustering coefficient is strongly negatively correlated with edge betweenness. This algorithm relies on the presence of triangles in the network. Clearly if a network has few triangles in the first place, then the edge clustering coefficient will be small for all edges, and the algorithm will be fail to find the communities.
10. Information centrality method (Fortunato-Latora-Marchiori method) S. Fortunato, V. Latora, and M. Marchiori, cond-mat/042522 (2004) Network efficiency E Information centrality C I Iterative removal of the edges with the highest information centrality Time complexity O(m 3 n) 64 nodes computer-generated network. 256 edges, 4 groups of 16 nodes each.
11. Flake’s max-flow method (Flake-Lawrence-Giles-Coetzee method) G. W. Flake, S. R. Lawrence, C. L. Giles, and F. M. Coetzee, IEEE Computer 35, 66 (2002) Web community Starting page or seed Web sites Find the boundary of community using max-flow and min-cut. Without the text information only link information. Ex) Page Rank, Hyperlink Induced Topic Search(HIT)
Spectral analysis : eignevalue and eigenvector of Laplacian or transfer matrix Optimization Approach : Hamiltonian, benefit function, or modularity Q Edge removal : betweenness, information centrality, clustering coefficient, etc. Hierarchical clustering : metric ( Euclidian, correlation, similarity, etc. )