Finding community structure in very large networks

Name: Finding community structure in very large networks
Uploaded: 2017-10-12T05:58:00+00:00
Duration: PTM13S50
Channel: Catherine Kennedy
Description: Finding community structure in very large networks

Finding community structure in very large networks
By Aaron Clauset M. E. J. Newman and Cristopher Moore

Talk outline Introduction and reminder The algorithm
Example: Amazon.com Summary

Girvan & Newman: betweenness clustering
Edge Betweeness: The number of shortest paths between vertex pairs that goes along an edge divisive Algorithm compute all pairs of shortest paths For each edge compute the number of such paths it belongs to Remove the max-weight edge Repeat to 1 until no edges are left divisive – מפלג

Girvan & Newman: disadvantages
Betweenness needs to be recalculated at each step Removal of an edge can impact the betweenness of another edge Very expensive: all pairs shortest path – O(n3) O(m2n) Does not scale to more than a few hundred nodes O( mn) for calculating edge betweeness. m iterations.

Dendrogram (hierarchical tree)
A dendrogram (hierarchical tree) illustrates the output of hierarchical clustering algorithms Leaves represent graph nodes, top represents original graph As we move down the tree, larger communities are partitioned to smaller ones 1 2 3 4 5 6 7 8 9

Quality functions Hierarchical clustering algorithms create numerous partitions In general, we do not know how many communities we should seek. How may we know that our clustering is “good” We need a quality function

The modularity quality function
Modularity Q designed to measure the strength of division of a network into clusters/communities It measures when the division is a good one, in the sense that there are many edges within communities and only a few between them If a high value of Q represents a good community division, why not simply optimize Q over all possible divisions to find the best one?

Is there a community structure in a very large networks
Is there a community structure in a very large networks? How can we find it?

Newman Fast Algorithm(2003)
A naive implementation runs in time O((m+ n)n), or O(n^2) on a sparse graph. Greedy optimization of modularity: Starting with each vertex is the sole member of one of n communities, we repeatedly join communities together in pairs, choosing at each step the join that results in the greatest increase (or smallest decrease) in Q. (agglomerative aglrotihm) The progress of the algorithm can be represented as a “dendrogram,” a tree that shows the order of the joins Thank you shlomi for introducing it to us!

The algorithm Introduction and reminder The algorithm
Example: Amazon.com Summary

The algorithm(2004) Here we propose a new algorithm that performs the
same greedy optimization by: using more sophisticated data structures for reducing time complexity and memory it runs far more quickly, in time O(md log n) where d is the depth of the “dendrogram” describing the network’s community structure.

Definitions Avw- adjacency matrix vertex v belongs to community 4 4 1
לעשות הנפשה 1 2 1 2 3 3

Definitions (cont.) δ-function is 1 if Ci = Cj and 0 otherwise
degree of a vertex v is defined to be the number of edges incident upon it:

Definitions (cont.) the fraction of edges that join vertices in community i to vertices in community j the fraction of ends of edges that are attached to vertices in community i

The modularity calculation
Degrees of nodes-pair Modularity value ……. Rendom graph Probability of an edge if only a function of node-degrees # edges In-same-cluster indicator variable Graph adjacency matrix

The modularity calculation(cont.)
= Q

Purpose of the algorithm
The operation of the algorithm involves ﬁnding the changes in Q that would result from the join of each pair of communities, choosing the largest of them, and performing the corresponding join

Updates on the previous algorithm
1 The operation is done explicitly on the entire matrix, but if the adjacency matrix is sparse? the operation can be carried out more eﬃciently using data structures for sparse matrices.

Data structures Data structures: 1. A sparse matrix containing Qij for each pair i, j of communities with at least one edge between them. We store each row of the matrix both as a balanced binary tree and as a max-heap. ai= כמה קשתות נוגעות איכשהו בקהילה הi כולל הקשתות הפנימיות. (so that the largest element can be found in constant time). MAX HEAP (so that elements can be found or inserted in O(log n) time) עץ בינארי

Data structures (cont.)
2. A max-heap H containing the largest element of each row of the matrix Qij along with the labelsi, j of the corresponding pair of communities. 3. An ordinary vector array with elements ai. 4. Q=the maximal modularity Row j Row k k n m 99 21 5 k,m j,n k,i

Initialization for each i:
we start oﬀ with each vertex being the sole member of a community of one, in which case eij = 1/2m if i and j are connected and zero otherwise, and ai = ki/2m.

The algorithm Calculate the initial values of ∆Qij and ai according to initialization and populate the max heap with the largest element of each row of the matrix ∆Q. Select the largest ∆Qij from H, join the corresponding communities, update the matrix ∆Q, the heap H and ai (as described later) and increment Q by ∆Qij. Repeat step 2 until only one community remains.

update the matrix ∆Q If we join communities i and j, labeling the combined community j, say, we need only update the jth row and column, and remove the ith row and column altogether. If community k is connected to both i and j, then If k is connected to i but not to j, then If k is connected to j but not to i, then

Reminder of how modularity can help us visualize large networks

Reminder-run time Insertion in balanced binary tree - O(log n)
Updating the max-heap for the kth row by inserting, raising, or lowering ∆Qkj takes O(log|k|) ≤ O(log n) time Operation Binary[2] find-max Θ(1) delete-max Θ(log n) insert merge Θ(n)

Run time |i|= degree of i, the number of neighboring communities
Join i and j = O((|i| + |j|) log n) (10 a) -insert every |i| into the jth row costs :log |j| (10 b +10 c)- insert every |i|+|j| : log (|i|+|j|) kth row – update single element : log (|k|) maximal O(log n) there are at most |i| + |j| values of k for which we have to do this Total: O((|i| + |j|) log n) עבור העץ הבינארי וגם הmax heap

Run time (cont.) the total running time is at most O(log n) times the sum over all nodes of the dendrogram of the degrees of the corresponding communities. worst-case: the degree of a community is the sum of the degrees of all the vertices in the original network comprising it. In that case, each vertex of the original network contributes its degree to all of the communities it is a part of, along the path in the dendrogram from it to the root

Run time (cont.) O(md log n)
If the dendrogram has depth d, there are at most d nodes in this path, and since the total degree of all the vertices is 2m, we have a running time of O(md log n) as stated. O(md log n)

Practical situations It is usually unnecessary to maintain the separate max-heaps for each row their maintenance takes a moderate amount of effort and this effort is wasted if the largest element in a row does not change when two rows are joined if the largest element of the kth row was ∆Qki or ∆Qkj and is now reduced by Eq. (10b) or (10c), we simply scan the kth row to find the new largest element. the average-case running time is often better than that of the more sophisticated algorithm.

Example: Amazon.com Introduction and reminder The algorithm
Summary

The connections- Amazon
The network we study consists of items listed on the Amazon web site. the network has items and edges. Items can be books,music, video games etc. Edge from A to B iff B was frequently purchased by buyers of A

Bridge – an edge , that when removed, splits off a community. Bridges can act as bottlenecks for information flow

Looking at the largest communities in the network, we find that they tend to consist of items (books, music) in similar genres or on similar topics

Power low partitioned at the point of maximum modularity, the distribution of community sizes s appears to have a power-law form התפלגות מצטברת של הגדלים של קהילות, כאשר הרשת מחולקת למחיצות במודולריות המרבית שנמצאה על ידי האלגוריתם

Summary Introduction and reminder The algorithm Example: Amazon.com

Summary Run time O(md log n) n- vertices m- edges d- depth of the dendrogram Balanced dendrogram- d ∼ log n and Sparse network- m ∼ n Run time O(n log2 n). The algorithm should allow researchers to analyze even larger networks with millions of vertices and tens of millions of edges using current computing resources

Improvments Unfortunately, the algorithm does not scale well and its use is practically limited to networks whose sizes are up to 500,000 nodes. We show that this inefficiency is caused from merging communities in unbalanced manner and that a simple heuristics that attempts to merge community structures in a balanced manner can dramatically improve community structure analysis. היוריסטיקה (Heuristic) היא כלל חשיבה פשוט, מעין כלל אצבע המבוסס על הגיון פשוט או אינטואיציה, המציע דרך קלה ומהירה לקבלת החלטות, ללא התעמקות ובמחיר דיוק נמוך.

Improvments (cont.) The proposed techniques are tested using data sets obtained from existing social networking service that hosts 5.5 million users. We have tested two variations of the heuristics. The fastest method processes a SNS friendship network with 1 million users in 5 minutes (70 times faster than our algorithm) Another friendship network with 4 million users in 35 minutes.

Credits

The End

Finding community structure in very large networks

Similar presentations

Presentation on theme: "Finding community structure in very large networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding community structure in very large networks

Similar presentations

Presentation on theme: "Finding community structure in very large networks"— Presentation transcript:

Similar presentations

About project

Feedback